问题：

NUTCH 1.13获取url失败：org.apache.NUTCH.protocol.protocolnotfound：未找到url=HTTP的协议

隆飞宇

2023-03-14

获取httpurl失败，原因是：org.apache.nutch.protocol.protocolnotfound：在org.apache.nutch.protocol.protocolfactory.getProtocol(protocolfactory.java:85)在org.apache.nutch.fetcher.fetcherthread.run（fetcherthread.java:285)处未找到URL=HTTP的协议

使用队列模式：byHost，对httpsurl的fetch失败，:org.apache.nutch.protocol.protocolnotfound:在org.apache.nutch.protocol.protocolfactory.getProtocol(protocolfactory.java:85)，在org.apache.nutch.fetcher.fetcherthread.run（fetcherthread.java:285)，没有找到URL=https的协议

在使用Solr6.6.0运行Nutch1.13时，我得到了高于此的结果

我使用的命令是

bin/crawl-i-d solr.server.url=http://myip/solr/nutch/urls/crawl 2

  <name>plugin.includes</name>
  <value>
protocol-(http|httpclient)|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
</value>

    [root@localhost apache-nutch-1.13]# ls plugins
creativecommons      index-more           nutch-extensionpoints   protocol-file                 scoring-similarity         urlnormalizer-ajax
feed                 index-replace        parse-ext               protocol-ftp                  subcollection              urlnormalizer-basic
headings             index-static         parsefilter-naivebayes  protocol-htmlunit             tld                        urlnormalizer-host
index-anchor         language-identifier  parsefilter-regex       protocol-http                 urlfilter-automaton        urlnormalizer-pass
index-basic          lib-htmlunit         parse-html              protocol-httpclient           urlfilter-domain           urlnormalizer-protocol
indexer-cloudsearch  lib-http             parse-js                protocol-interactiveselenium  urlfilter-domainblacklist  urlnormalizer-querystring
indexer-dummy        lib-nekohtml         parse-metatags          protocol-selenium             urlfilter-ignoreexempt     urlnormalizer-regex
indexer-elastic      lib-regex-filter     parse-replace           publish-rabbitmq              urlfilter-prefix           urlnormalizer-slash
indexer-solr         lib-selenium         parse-swf               publish-rabitmq               urlfilter-regex
index-geoip          lib-xml              parse-tika              scoring-depth                 urlfilter-suffix
index-links          microformats-reltag  parse-zip               scoring-link                  urlfilter-validator
index-metadata       mimetype-filter      plugin                  scoring-opic                  urlmeta

我被这个问题卡住了。正如您所看到的，我已经包含了两个协议-（httphttpclient）。但是仍然无法获取url。提前谢了。

更新问题Hadoop.log

2017-09-01 14:35:07,172 INFO solr.solrindexwriter-solrindexer:deleting 1/1 documents 2017-09-01 14:35:07,321 WARN Output.fileoutputcommitter-output Path is null in cleanupJob（）2017-09-01 14:35:07,323 WARN mapred.localjobrunner-job_local1176811933_0001 java.lang.exception:java.lang.illegalstateexception：连接池在LocalJobrunner$job.run（localJobrunner.java:529)引起的原因：java.lang.IllegalStateException：连接池在org.apache.http.util.asserts.check（asserts.java:34)在org.apache.http.pool.abstractConnpool.lease（AbstractConnpool.java:169)在org.apache.http.pool.abstractConnpool.lease（AbstractConnpool.java:202)在defaultRequestDirector.java:415)位于org.apache.ht在org.apache.http.impl.client.closeablehttpclient.closeablehttpclient.execute(closeablehttpclient.java:863)，在org.apache.http.impl.client.closeablehttpclient.java:106)，在org.apache.http.impl.client.closeablehttpclient.execute(closeablehttpclient.java:57)，在org.apache.solr.client.solrj.impl.httpsolrclient.executeemethod(httpsolrclient.java:481)Apache.solr.client.solrj.impl.httpsolrclient.request(httpsolrclient.java:240)在org.apache.solr.client.solrj.impl.httpsolrclient.request(httpsolrclient.java:229)在org.apache.solr.client.solrrequest.process（solrrequest.java:149)在org.apache.solr.client.solrclient.commit（solrclient.java:482)在solrIndexWriter.commit（solrIndexWriter.java:191)在org.apache.nutch.IndexWriter.solr.IndexWriter.close（solrIndexWriter.java:179)在org.apache。在org.apache.nutch.indexwriters.close(indexwriters.java:117)在org.apache.nutch.indexer.cleaningjob$deleterreducer.close(cleaningjob.java:122)在org.apache.hadoop.io.ioutils.cleanup(ioutils.java:244)在org.apache.hadoop.mapred.reducetask.runoldreducer(reducetask.java:459)在org.apache.hadoop.mapred.reducetask.run jobrunner.java:319)在java.util.concurrent.executors$runnableadapter.call（executors.java:511)在java.util.concurrent.futuretask.run（futuretask.java:266)在java.util.concurrent.threadpoolexecutor.runworker（threadpoolexecutor.java:1149)在java.util.concurrent.threadpoolexecutor.worker.run（在org.apache.hadoop.mapred.jobclient.runjob(jobclient.java:865)在org.apache.nutch.indexer.cleaningjob.delete（cleaningjob.java:174)在org.apache.nutch.indexer.cleaningjob.run（cleaningjob.java:174)在org.apache.nutch.indexer.cleaningjob.run（cleaningjob.java:197)在org.apache.hadoop.util.toolrunner.run（

共有1个答案

金令秋

2023-03-14

我不知怎么解决了这个问题。我认为nutch-site.xml中的空间导致了新插件的问题。包括其他来这里的人的部分。

      <name>plugin.includes</name>
  <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html)|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

类似资料：

PHP获取站点URL协议-http与https

问题内容：我写了一个小函数来建立当前站点的URL协议，但是我没有SSL，也不知道如何测试它是否可以在https下工作。你能告诉我这是否正确吗？是否有必要像上面那样做？还是我可以像上面那样做？：在SSL下，即使定位标记网址使用的是http，服务器也不会自动将网址转换为https吗？是否需要检查协议？谢谢！问题答案：这不是自动的。您的最高职能看起来还不错。
JSoup POST请求失败。HTTP错误获取URL。状态=400

我正在尝试登录一个网站(https://dashboard.ngrok.com/user/login)使用jsoup。我对GET请求没有任何问题，但当我尝试使用凭证执行POST请求时，我收到： HTTP错误获取URL。状态=400 我尝试为请求设置一个更好的头，使用我在连接发出POST请求时发送的相同参数。我也尝试过这种类型的请求：结果显示：线程“main”组织中出现异常。jsoup。Htt
从URL获取协议+主机名

问题内容：在我的Django应用中，我需要从引荐来源网址中获取主机名及其协议，以便从类似以下网址的网址中获取： https://docs.google.com/spreadsheet/ccc?key=blah-blah-blah-blah#gid=1 https://stackoverflow.com/questions/1234567/blah-blah-blah-blah http://ww
Jsoup http获取url时出错

我只是下载了最新版本的j汤（1.7.1）并遵循官方代码（更改了url）。然后我得到了“超文本传输协议错误获取url” 我的代码有什么问题？似乎错误只是发生在Android项目，因为我在一个工作正常的Java项目做同样的事情。注意：-我已经添加了Internet权限
从URL获取协议，域和端口

问题内容：我需要从给定的URL中提取完整的协议，域和端口。例如：问题答案：首先获取当前地址然后只需解析该字符串您的网址是：希望这可以帮助
获取URL

1.5. 获取URL 对于很多现代应用来说，访问互联网上的信息和访问本地文件系统一样重要。Go语言在net这个强大package的帮助下提供了一系列的package来做这件事情，使用这些包可以更简单地用网络收发信息，还可以建立更底层的网络连接，编写服务器程序。在这些情景下，Go语言原生的并发特性（在第八章中会介绍）显得尤其好用。为了最简单地展示基于HTTP获取信息的方式，下面给出一个示例程序fe

NUTCH 1.13获取url失败：org.apache.NUTCH.protocol.protocolnotfound：未找到url=HTTP的协议

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档