当前位置: 首页 > 知识库问答 >
问题:

Nutch 1.11(1.x)与Solr 5.3.1(5.x)的集成

姚信鸥
2023-03-14

我刚开始使用Nutch 1.11和Solr 5.3.1。

我想用Nutch抓取数据,然后用Solr索引并准备搜索。

bin/solr start
bin/solr create -c files -d example/files/conf
http://localhost:8983/solr/#/files
bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-params solr.server.url=127.0.0.1:8983/solr/files \
-dir crawl/segments

希望通过Solr5新的自动模式特性,我可以将自己设置为restful,但是,我得到了以下错误(从日志文件复制):

WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3.
INFO  indexer.IndexingJob - Indexer: starting at 2015-12-14 15:21:39
INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
INFO  indexer.IndexingJob - Indexer: URL filtering: false
INFO  indexer.IndexingJob - Indexer: URL normalizing: false
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  solr.SolrMappingReader - source: content dest: content
INFO  solr.SolrMappingReader - source: title dest: title
INFO  solr.SolrMappingReader - source: host dest: host
INFO  solr.SolrMappingReader - source: segment dest: segment
INFO  solr.SolrMappingReader - source: boost dest: boost
INFO  solr.SolrMappingReader - source: digest dest: digest
INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local117437667_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

我记得这个

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.

solrurl有关,但我仔细检查了我使用的url127.0.0.1:8983/solr/files,我认为它是正确的。

bin/nutch solrindex http://127.0.0.1:8983/solr/files crawl/crawldb -linkdb crawl/linkdb crawl/segments/s1

错误消息:

INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local1306504137_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

共有1个答案

封飞
2023-03-14

相反,尝试使用这个语句来集成solr和nutch

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/
 类似资料:
  • 将返回来自x(包括)和y(不包括)的流。将返回来自x(含)和y(含)的流。 我期望调用或调用。但是当查看的源代码时,它是这样的: 也有一个非常相似的实现,而不是。唯一的区别是的第三个参数是而不是表示范围已关闭。 然后使用此布尔值初始化类中的字段,并针对它提到以下注释: 如果该范围已关闭且最后一个元素未被遍历,则为1,如果该范围已打开,或该范围已关闭且所有元素都已被遍历,则为0 为什么需要这样的实现

  • 1.x

    1.x coolie.js 基于 seajs 重构 支持异步加载模块 coolie-cli 基于 0.x 的全盘重构 更合理的命令行接口 支持构建中间件 构建工具接口化 容积更小的配置文件 修正了以往的 BUG

  • 问题内容: 我的教授最近说,尽管并且显然会给出相同的结果,但是在JVM中实现它们的方式有所不同。这是什么意思?编译器是否不像:嘿,我明白了,所以我将其切换到并继续吗? 我怀疑在效率方面是否存在差异,但是如果在这些情况下组装会有所不同,我将感到惊讶… 问题答案: 我的教授最近说,尽管x = x + 1和x ++显然会给出相同的结果 我想你的教授也许是故意 的-after 和will 的价值是相同的

  • 问题内容: 我已经回顾了关于Struts 1 vs 2的几个问题,但是似乎没有一个人以我的观点来回答这个问题。 我将开始着手构建一个新系统,对一个非常老的桌面应用程序进行彻底的重新设计。目标是使其成为基于Web的网站,添加更多功能,使其更易使用等(通常的重新设计原因)。 将要开发该系统的团队主要是Java开发人员,并且在过去5年中广泛地研究了Struts1.x。 该系统打算使用很多年,因此,当一个

  • 策展人5. x不再支持ZooKeeper 3.4. x 但这是否意味着我既不能使用Zookeeper客户端,也不能使用Zookeeper服务器3.4。x? 我想知道策展人、Zookeeper客户端和Zookeeper服务器之间是否存在兼容性矩阵?这会很有帮助,但我在谷歌上找不到类似的东西

  • 我第一次使用Spring Boot,正在为我的应用程序设置分布式跟踪。我已经将Spring Cloud Slueth添加到我的应用程序中,当调用我的endpoint时,我可以看到生成的跨度和跟踪,但是我很难让它与Aws Sdk 2. x集成(使用Dynamo异步客户端)。我有几个关于集成的问题: 通过aws sdk跟踪http调用的最佳方式是什么。我能做到这一点的唯一方法是实现ExecutionI