当前位置: 首页 > 知识库问答 >
问题:

Nutch IllegalArgumentException:行长度41221>32767

姜博
2023-03-14

我已经添加了一组种子,要使用以下命令进行爬网

./bin/crawl/largeseeds 1 http://localhost:8983/solr/ddcd 4

对于第一次迭代,所有命令(inject、generate、fetch、parse、update-table、Indexer和delete重复项)被成功处决了。对于第二次迭代,“crawldb update”命令失败(请参见错误日志),由于该命令失败,整个过程终止。

16/01/20 02:45:19 INFO parse.ParserJob: ParserJob: finished at 2016-01-20 02:45:19, time elapsed: 00:06:57
CrawlDB update for 1
/usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1453230757-13191 -crawlId 1
16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting at 2016-01-20 02:45:27
16/01/20 02:45:27 INFO crawl.DbUpdaterJob: DbUpdaterJob: batchId: 1453230757-13191
16/01/20 02:45:27 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-root/hadoop-unjar5654418190157422003/classes/plugins
16/01/20 02:45:28 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Plugins:
16/01/20 02:45:28 INFO plugin.PluginRepository:     HTTP Framework (lib-http)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Html Parse Plug-in (parse-html)
16/01/20 02:45:28 INFO plugin.PluginRepository:     MetaTags (parse-metatags)
16/01/20 02:45:28 INFO plugin.PluginRepository:     the nutch core extension points (nutch-extensionpoints)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Basic Indexing Filter (index-basic)
16/01/20 02:45:28 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Anchor Indexing Filter (index-anchor)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Basic URL Normalizer (urlnormalizer-basic)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Language Identification Parser/Filter (language-identifier)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Metadata Indexing Filter (index-metadata)
16/01/20 02:45:28 INFO plugin.PluginRepository:     CyberNeko HTML Parser (lib-nekohtml)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Subcollection indexing and query filter (subcollection)
16/01/20 02:45:28 INFO plugin.PluginRepository:     SOLRIndexWriter (indexer-solr)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Rel-Tag microformat Parser/Indexer/Querier (microformats-reltag)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Http / Https Protocol Plug-in (protocol-httpclient)
16/01/20 02:45:28 INFO plugin.PluginRepository:     JavaScript Parser (parse-js)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Tika Parser Plug-in (parse-tika)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Top Level Domain Plugin (tld)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Regex URL Filter Framework (lib-regex-filter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Regex URL Normalizer (urlnormalizer-regex)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Link Analysis Scoring Plug-in (scoring-link)
16/01/20 02:45:28 INFO plugin.PluginRepository:     OPIC Scoring Plug-in (scoring-opic)
16/01/20 02:45:28 INFO plugin.PluginRepository:     More Indexing Filter (index-more)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Http Protocol Plug-in (protocol-http)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Creative Commons Plugins (creativecommons)
16/01/20 02:45:28 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/20 02:45:28 INFO plugin.PluginRepository:     Parse Filter (org.apache.nutch.parse.ParseFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Index Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Content Parser (org.apache.nutch.parse.Parser)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch URL Filter (org.apache.nutch.net.URLFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Protocol (org.apache.nutch.protocol.Protocol)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
16/01/20 02:45:28 INFO plugin.PluginRepository:     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
16/01/20 02:45:29 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/01/20 02:45:29 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x60a2630a connecting to ZooKeeper ensemble=localhost:2181
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:host.name=cism479
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_65
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
16/01/20 02:45:29 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/jdk1.8.0_65/jre
16/01/20 02:45:35 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/20 02:45:35 INFO mapreduce.JobSubmitter: number of splits:2
16/01/20 02:45:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453210838763_0011
16/01/20 02:45:36 INFO impl.YarnClientImpl: Submitted application application_1453210838763_0011
16/01/20 02:45:36 INFO mapreduce.Job: The url to track the job: http://cism479:8088/proxy/application_1453210838763_0011/
16/01/20 02:45:36 INFO mapreduce.Job: Running job: job_1453210838763_0011
16/01/20 02:45:48 INFO mapreduce.Job: Job job_1453210838763_0011 running in uber mode : false
16/01/20 02:45:48 INFO mapreduce.Job:  map 0% reduce 0%
16/01/20 02:47:31 INFO mapreduce.Job:  map 33% reduce 0%
16/01/20 02:47:47 INFO mapreduce.Job:  map 50% reduce 0%
16/01/20 02:48:08 INFO mapreduce.Job:  map 83% reduce 0%
16/01/20 02:48:16 INFO mapreduce.Job:  map 100% reduce 0%
16/01/20 02:48:31 INFO mapreduce.Job:  map 100% reduce 31%
16/01/20 02:48:34 INFO mapreduce.Job:  map 100% reduce 33%
16/01/20 02:50:30 INFO mapreduce.Job:  map 100% reduce 34%
16/01/20 03:01:18 INFO mapreduce.Job:  map 100% reduce 35%
16/01/20 03:11:58 INFO mapreduce.Job:  map 100% reduce 36%
16/01/20 03:22:50 INFO mapreduce.Job:  map 100% reduce 37%
16/01/20 03:24:22 INFO mapreduce.Job:  map 100% reduce 50%
16/01/20 03:24:35 INFO mapreduce.Job:  map 100% reduce 82%
16/01/20 03:24:38 INFO mapreduce.Job:  map 100% reduce 83%
16/01/20 03:26:33 INFO mapreduce.Job:  map 100% reduce 84%
16/01/20 03:37:35 INFO mapreduce.Job:  map 100% reduce 85%
16/01/20 03:39:38 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_0, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
    at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/20 03:39:39 INFO mapreduce.Job:  map 100% reduce 50%
16/01/20 03:39:52 INFO mapreduce.Job:  map 100% reduce 82%
16/01/20 03:39:55 INFO mapreduce.Job:  map 100% reduce 83%
16/01/20 03:41:56 INFO mapreduce.Job:  map 100% reduce 84%
16/01/20 03:53:39 INFO mapreduce.Job:  map 100% reduce 85%
16/01/20 03:55:49 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_1, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
    at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/20 03:55:50 INFO mapreduce.Job:  map 100% reduce 50%
16/01/20 03:56:01 INFO mapreduce.Job:  map 100% reduce 83%
16/01/20 03:58:02 INFO mapreduce.Job:  map 100% reduce 84%
16/01/20 04:10:09 INFO mapreduce.Job:  map 100% reduce 85%
16/01/20 04:12:33 INFO mapreduce.Job: Task Id : attempt_1453210838763_0011_r_000001_2, Status : FAILED
Error: java.lang.IllegalArgumentException: Row length 41221 is > 32767
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:506)
    at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:487)
    at org.apache.hadoop.hbase.client.Get.<init>(Get.java:89)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:208)
    at org.apache.gora.hbase.store.HBaseStore.get(HBaseStore.java:79)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:156)
    at org.apache.gora.store.impl.DataStoreBase.get(DataStoreBase.java:56)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:114)
    at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:42)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/20 04:12:34 INFO mapreduce.Job:  map 100% reduce 50%
16/01/20 04:12:45 INFO mapreduce.Job:  map 100% reduce 82%
16/01/20 04:12:48 INFO mapreduce.Job:  map 100% reduce 83%
16/01/20 04:14:46 INFO mapreduce.Job:  map 100% reduce 84%
16/01/20 04:26:53 INFO mapreduce.Job:  map 100% reduce 85%
16/01/20 04:29:09 INFO mapreduce.Job:  map 100% reduce 100%
16/01/20 04:29:10 INFO mapreduce.Job: Job job_1453210838763_0011 failed with state FAILED due to: Task failed task_1453210838763_0011_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1

16/01/20 04:29:11 INFO mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=38378343
        FILE: Number of bytes written=115957636
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=2382
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=2
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed reduce tasks=4
        Launched map tasks=2
        Launched reduce tasks=5
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=789909
        Total time spent by all reduces in occupied slots (ms)=30215090
        Total time spent by all map tasks (ms)=263303
        Total time spent by all reduce tasks (ms)=6043018
        Total vcore-seconds taken by all map tasks=263303
        Total vcore-seconds taken by all reduce tasks=6043018
        Total megabyte-seconds taken by all map tasks=808866816
        Total megabyte-seconds taken by all reduce tasks=30940252160
    Map-Reduce Framework
        Map input records=49929
        Map output records=1777904
        Map output bytes=382773368
        Map output materialized bytes=77228942
        Input split bytes=2382
        Combine input records=0
        Combine output records=0
        Reduce input groups=754170
        Reduce shuffle bytes=38318183
        Reduce input records=881156
        Reduce output records=754170
        Spilled Records=2659060
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=17993
        CPU time spent (ms)=819690
        Physical memory (bytes) snapshot=4080136192
        Virtual memory (bytes) snapshot=15234293760
        Total committed heap usage (bytes)=4149739520
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
Exception in thread "main" java.lang.RuntimeException: job failed: name=[1]update-table, jobid=job_1453210838763_0011
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:111)
    at org.apache.nutch.crawl.DbUpdaterJob.updateTable(DbUpdaterJob.java:140)
    at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:174)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:178)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Error running:
  /usr/share/searchEngine/nutch-branch-2.3.1/runtime/deploy/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1453230757-13191 -crawlId 1
Failed with exit value 1.

共有1个答案

陈高寒
2023-03-14

我对同样的技术也有同样的问题。

我通过更改hbase-common-0.98.17-hadoop2.jar中的hconstants.java文件(在nutch:nutch/build/lib和hbase:/hbase/lib中)来解决这个问题。

我删除了这一行:

public static final short MAX_ROW_LENGTH = Short.MAX_VALUE;
public static final long MAX_ROW_LENGTH = Long.MAX_VALUE;
 类似资料:
  • 长度必须包含数字和长度单位,并且它们之间不允许出现空格。数字可以是整数或小数,可以是正数或负数。如果数字为0,则可以带单位,也可以不带。因此,以下都是合法的长度值:1.5em,-20px,0。 CSS的长度单位分为绝对长度单位和相对长度单位。使用绝对长度单位时,其值是一个固定的值;使用相对长度单位时,其长度值不是固定的,它会随着参考值的变化而变化。 常用的绝对长度单位有pt(点)、mm(毫米)、c

  • 问题内容: java.util.UUID.randomUUID()。toString()的长度是否始终等于36? 我找不到有关的信息。这里只说以下几点: 公共静态UUID randomUUID()静态工厂,用于检索类型4(伪随机生成的)UUID。使用加密强度高的伪随机数生成器生成UUID。返回:随机生成的UUID 那什么也没告诉我。我不知道类型4在这种情况下意味着什么。 问题答案: java.ut

  • 本文向大家介绍JavaScript localStorage长度,包括了JavaScript localStorage长度的使用技巧和注意事项,需要的朋友参考一下 示例 localStorage.length 属性返回一个整数,该整数指示 localStorage 例: 设定项目 得到长度            

  • public static UUID randomUUID()静态工厂检索类型4(伪随机生成)的UUID。UUID是使用密码学强的伪随机数生成器生成的。返回:随机生成的UUID 而并没有告诉我什么。我不知道类型4在这个案例中是什么意思。

  • 有时候希望指定两个节点之间的最小长度,可以使用minlen这个属性实现,如果必要的话,还可以使用invisible属性让这个节点隐藏。 [ Aachen ] --> [ Bonn ] --> [ Coburg ] [ Aue ] --> { minlen: 3; } [ Cuxhaven ] +--------+ +------+ +----------+ | Aachen |

  • 我想按值长度对Map进行排序。例如,我有这样的代码: 结果是: 所以我想做的是按值长度对这个Map进行排序,所以它返回: