当我运行以下命令时,我得到一个错误:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 2
上面,TSolr只是Solr核心的名称,您可能已经猜到了。
我正在下面的Hadoop.log中粘贴错误日志:
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20161028161642
2016-10-28 16:21:46,353 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:21:46,355 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:21:46,415 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:21:46,416 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:21:46,565 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-10-28 16:21:52,308 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Indexing 42/42 documents
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:53,469 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:53,472 INFO indexer.IndexingJob - Indexer: number of documents indexed, deleted, or skipped:
2016-10-28 16:21:53,476 INFO indexer.IndexingJob - Indexer: 42 indexed (add/update)
2016-10-28 16:21:53,477 INFO indexer.IndexingJob - Indexer: finished at 2016-10-28 16:21:53, elapsed: 00:00:32
2016-10-28 16:21:54,199 INFO indexer.CleaningJob - CleaningJob: starting at 2016-10-28 16:21:54
2016-10-28 16:21:54,344 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-10-28 16:22:19,739 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:22:19,741 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:22:19,797 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:22:19,799 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:22:19,807 WARN output.FileOutputCommitter - Output Path is null in setupJob()
2016-10-28 16:22:25,113 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:22:25,191 INFO solr.SolrIndexWriter - SolrIndexer: deleting 6/6 documents
2016-10-28 16:22:25,300 WARN output.FileOutputCommitter - Output Path is null in cleanupJob()
2016-10-28 16:22:25,301 WARN mapred.LocalJobRunner - job_local1653313730_0001
java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-28 16:22:25,841 ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172)
at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206)
bin/crawl-i-d solr.server.url=http://localhost:8983/solr/tsolr urls/testcrawl/1
==>注意,我已经将Nutch的轮数更改为“1”。并且,这将成功执行爬网和索引bin/crawl-i-d solr.server.url=http://localhost:8983/solr/tsolr urls/testcrawl/1
==>,这给我带来了与上面粘贴Hadoop.log相同的错误!!因此,因为我的Solr无法成功地索引Nutch在第二轮或种子站点的1级深处爬过的东西。
错误可能是由于种子站点的解析内容大小造成的吗?种子站点是一家报纸公司的网站,所以我确信第二轮(1层更深)将包含大量解析为索引的数据。如果问题是parseed内容大小,我如何配置我的Solr来解决这个问题?
如果错误来自其他地方,有人能帮我确定是什么以及如何修复它吗?
对于那些经历过我经历过的事情的人,我想我会张贴我正在经历的问题的解决方案。
首先,Apache Nutch 1.12似乎不支持Apache Solr6.x。如果您查看Apache Nutch 1.12发行说明,他们最近在Nuch 1.12中添加了支持Apache Solr5.x的特性,而不包括对Solr6.x的支持。因此,我决定使用Solr5.5.3,而不是Solr6.2.1。因此,我安装了Apache Solr 5.5.3来与Apache Nutch 1.12一起工作
正如Jorge Luis所指出的,Apache Nutch 1.12有一个bug,当它与Apache Solr一起工作时会产生错误。他们会修复这个bug,并在某个时候发布Nutch 1.13,但我不知道什么时候会这样,所以我决定自己修复这个bug。
我得到错误的原因是因为先调用CleaningJob.java(Nutch的)中的close方法,然后调用commit方法。然后,抛出以下异常:java.lang.IllegalStateException:连接池关闭。
修复其实很简单。要了解解决方案,请访问这里:https://github.com/apache/nutch/pull/156/commits/327e256bb72f0385563021995a9d0e96bb83c4f8
正如您在上面的链接中所看到的,您只需重新定位“writers.close();”方法。
顺便说一句,为了修复错误,您需要Nutch scr包而不是二进制包,因为您不能在Nutch二进制包中编辑CleaningJob.java文件。修复后,运行ant,您就设置好了。
修复后,我不再得到错误!
问题内容: 当我尝试将SVN连接到Eclipse时,出现以下错误:知道如何解决吗? 问题答案: 选择SVN接口: 客户端:SVNKit(纯Java) 应用并重试。
在嵌入式TomEE容器中运行Arquillian测试时,我得到了以下错误
问题内容: 当前在WTForms中访问错误,您必须像这样遍历字段错误: 由于我正在构建一个不使用任何表单视图的rest应用程序,因此我不得不检查所有表单字段以查找错误所在。 有没有办法可以做类似的事情: 问题答案: 实际对象的属性包含字典中的字段名称及其错误。因此,您可以执行以下操作:
我试图使用javamail示例包中的ShowMsg.java读取gmail消息,但在运行时不断出现此错误。虽然程序编译得很好。 以下是错误消息: 我运行的命令是,编译它的命令是。javax.mail.jar包含在lib文件夹中 这是一些更短的替代代码,并得到相同的错误:
我正在尝试一个非常基本的规则来测试流口水是否有效,我将其与Hibernate连接,但我发现错误“java.lang.ArrayIndexOutOfBoundsException:-1”这是main:包metier; 以及。drl文件为: 出现的错误是: 你知道它是从哪里来的吗?我要怎么修理它?我真的很感激
我有一个通过http承载图像的地理服务器。我的客户端站点使用https。我一直在使用openlayers,一切都很好。现在我正试图转移到cesiumjs,我在IE或Edge中没有得到任何图像(不幸的是,我无法测试其他浏览器)。如果使用bing地图,我可以在我的客户机中获取图像,因此客户机代码在其他情况下是可用的。在浏览器控制台中,我看到: SEC7117:网络请求超文本传输协议://[myserv