问题：

Nutch爬行不起作用

丁鸿云

2023-03-14

我想使用Apache Nutch1.12爬网一个站点，并将数据索引到Apache Solr中。我已经遵循了这个教程。

我的seed.txt文件的url是http://nutch.apache.org/

在我的regex url筛选器中，我有如下所示+^http://([a-z0-9]*.)*nutch.apache.org/

当我试图获取数据时，我只得到seed.txt文件中的url。

Fetcher: starting at 2017-01-03 09:56:23
Fetcher: segment: crawl/segments/20170103095613
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
robots.txt whitelist not configured.
robots.txt whitelist not configured.
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0

我在这里错过了什么。

共有1个答案

蓟辰沛

2023-03-14

我再次尝试执行fetch操作，我得到了预期的结果。

类似资料：

Nutch 1.11爬网问题

我已经遵循了教程，并使用Cygwin将nutch配置为在Windows 7上运行，我正在使用Solr 5.4.0对数据进行索引但是坚果1.11在执行爬行时遇到了问题。爬网命令 $ bin/crawl -i -D solr.server.url= 错误/异常注入种子网址 /apache-nutch-1.11/bin/坚果注射 /测试爬网/抓取 /urls 注射器：从 2016-01-19 开始
执行爬网时出现Nutch问题

我正在尝试让nutch 1.11执行爬网。我正在使用cygwin在windows 7中运行这些命令。 Nutch正在运行，运行bin/Nutch会得到结果，但当我尝试运行爬网时，会不断收到错误消息。当我尝试使用 nutch 运行爬网执行时，我收到以下错误：运行时出错：/cygdrive/c/Users/User5/Documents/Nutch/apache-Nutch-1.11/runtim
Nutch Crawl-删除每个爬行影响的段

我注意到在每次Nutch抓取过程中，发送到Solr的索引不一致。有时会显示网页的最新更改，有时会显示较旧的更改。原因注意到Nutch将旧段的索引提供给Solr。当前解决方案在获取之前删除所有旧段，似乎解决了问题。问题想知道这种方法是否有任何含义，或者我对此的理解是不正确的。还想知道为什么Nutch在爬行过程中不会自动删除旧段。谢谢。
使用Solr Nutch对特定数据进行Web爬网

我看到了一些像http://homes.mitula.ph/homes/makati这样的搜索网站，我想知道他们是如何抓取其他网站（如、和）中的数据并将其显示到他们的站点上的。我正在考虑使用Solr索引数据，使用Nutch抓取数据。我是一个新的网页抓取和索引，目前为止，我只能抓取一个网页的内容。 Solr Nutch能做那种爬行吗？怎么做的？
\ n不起作用，不换行

问题内容：我正在创建一个小程序，它将int值保存到文本文件中，然后将其保存并在再次启动该程序时加载它。现在，我需要再将3个布尔值存储在文本文件中，我正在用每当我想使用\ n转到文本文件中的新行时，都不会转到新行，它将直接将其写在int值后面。我都尝试过和但是，两者都将布尔值直接写在int值后面，而无需开始新行。有什么帮助吗？我正在使用Windows 7 Ultimate 64位，并且我的
使用nutch爬行时拒绝身份验证和连接错误

根据Nutch教程 http://wiki.apache.org/nutch/httpauthenticationschemes#a_note_on_ntlm_domains > 我已经在文件中设置了auth-configuration： http.auth.file httpclient-auth.xml“protocol-httpclient”插件的身份验证配置文件。但对我来说没有成功！是

Nutch爬行不起作用

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档