Nutch无法使用Mongodb在Elasticsearch上正确索引

袁晟

2023-03-14

问题内容：

我正在运行Nutch 2.3.1，Mongodb 3.2.9和Elasticsearch 2.4.1。我遵循了本教程的内容：

https://qbox.io/blog/scraping-the-web-with-nutch-for-
elasticsearch

和本教程：

http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-
elasticsearch/

为了使用上述3个软件创建网络爬网工具。

一切工作都很好，直到可以归结为索引…只要我使用了nutch的index命令，就可以：

# bin/nutch index elasticsearch -all

有时候是这样的：

IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port (default 9300)
        elastic.index : elastic index command
        elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

IndexingJob: done.

我的nutch-site.xml：

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>AOssama Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>aossama</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
</configuration>

我还查看了ElasticIndexWriter.java代码，并注意到在第250行附近调用了ElasticIndexWriter的类。我现在正在进一步研究，但是我完全不知道为什么它不适用于Mongo。我将放弃并尝试使用不喜欢的Hbase。

谢谢！

乔

问题答案：

经过很多麻烦后，我开始工作了。我最终使用了ES 1.4.4，nutch 2.3.1，mongodb 3.10和JDK 8。

我经历过的许多问题在其他一些线程中仍未得到解决：

（这很简单，但是…）请确保一切都在运行。确保elasticsearch在具有正确端口的正确机器上运行。确保您可以与之交谈。确保MongoDB已启动并在正确的端口上运行，并确保您可以与之对话。
使用正确的索引命令。对于Nutch 3.2.1，它是：（ ./bin/nutch index -all在获取并解析之后）。如果遇到Solr错误，则nutch-site.xml中没有正确的索引功能。
在elasticsearch.yml和nutch-site.xml中将搜寻器引擎命名为“ SAME THING”。这是巨大的。这是我的索引函数抛出任何错误的主要原因。
版本控制。我尝试使用新版本的Elasticsearch进行此操作，并经常遇到问题。我将尝试在最新版本的Elasticsearch和Mongo上构建它，然后回到此线程。尝试使用与我先做的相同的构建，然后再尝试其他构建。由于在ivy / ivy.xml设置以及indexer-elastic / plugin.xml设置中与gora有关，因此使用nutch进行Elasticsearch版本控制似乎是最重要的部分。

拜托，拜托，拜托，请让我知道您是否对此有任何疑问。我花了将近2个星期的时间才弄清楚此构建过程，而且我知道它可能会令人沮丧。如果您遇到问题，请下午给我发帖或发布，我相信我可以帮助您解决问题。

乔

Nutch无法使用Mongodb在Elasticsearch上正确索引

相关阅读

相关文章

相关问答

相关工具

相关文档