当前位置: 首页 > 知识库问答 >
问题:

使用nutch爬行时拒绝身份验证和连接错误

包子航
2023-03-14

根据Nutch教程

http://wiki.apache.org/nutch/httpauthenticationschemes#a_note_on_ntlm_domains

>

  • 我已经在httpclient-auth.xml文件中设置了auth-configuration:

    http.auth.file httpclient-auth.xml“protocol-httpclient”插件的身份验证配置文件。

    但对我来说没有成功!

    是我没有以正确的方式配置身份验证还是我遗漏了什么?

    2014-04-16 05:11:23,712 DEBUG httpclient.HttpMethodDirector - Authorization required
    2014-04-16 05:11:23,712 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
    2014-04-16 05:11:23,731 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
    2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
    2014-04-16 05:11:23,732 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
    2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
    2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Retry authentication
    2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
    2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
    2014-04-16 05:11:23,733 DEBUG cookie.CookieSpec - Unrecognized cookie attribute: name=HttpOnly, value=null
    2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
    2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Cookie accepted: "PHPSESSID=9f9378mvh9e720f5o3l0ibc1o7"
    2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Authenticating with NTLM <any realm>@sp.zzz.com:80
    2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Redirect required
    2014-04-16 05:11:23,735 DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
    2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
    2014-04-16 05:11:23,735 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
    2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=www.xxxportal.com]
    2014-04-16 05:11:23,736 DEBUG util.IdleConnectionHandler - Adding connection at: 1397643083736
    2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are no waiting threads
    2014-04-16 05:11:23,737 DEBUG httpclient.HttpMethodBase - Adding Host request header
    2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authorization required
    2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
    2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
    2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
    2014-04-16 05:11:23,745 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
    2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials required
    2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials provider not available
    2014-04-16 05:11:23,745 INFO  httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@sp.zzz.com:80
    2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
    2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
    2014-04-16 05:11:23,746 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
    2014-04-16 05:11:23,746 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=sp.zzz.com]
    

    我不支持任何代理,我已经关闭了系统中所有的防火墙设置。不知道为什么我会被拒绝连接异常。

    在这里,我也无法找到确切的原因,为什么我得到连接拒绝异常。

    请帮我理解一下这个情况下的确切问题。

    logpart2-connection被拒绝。

    2014-04-16 05:11:26,443 INFO  fetcher.Fetcher - * queue: www.xxxportal.com
    2014-04-16 05:11:26,443 INFO  fetcher.Fetcher -   maxThreads    = 1
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   inProgress    = 0
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   crawlDelay    = 5000
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   minCrawlDelay = 0
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   nextFetchTime = 1397643088739
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   now           = 1397643086444
    2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   0. www.xxxportal.com/profiles/
    2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   1. www.xxxportal.com/wiki/index.php
    2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   2. www.xxxportal.com/sop/
    2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Closing the connection.
    2014-04-16 05:11:26,560 INFO  httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
    2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Connection refused: connect
    java.net.ConnectException: Connection refused: connect
                    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
                    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
                    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
                    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
                    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
                    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
                    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
                    at java.net.Socket.connect(Socket.java:579)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.lang.reflect.Method.invoke(Method.java:606)
                    at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
                    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
                    at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
                    at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
                    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
                    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
                    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
                    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
                    at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
                    at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
                    at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:75)
                    at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:157)
                    at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:391)
                    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:676)
    2014-04-16 05:11:26,564 INFO  httpclient.HttpMethodDirector - Retrying request
    2014-04-16 05:11:26,565 DEBUG httpclient.HttpConnection - Open connection to www.zzzlearninglounge.com:80
    
  • 共有1个答案

    饶德本
    2023-03-14

    1)从日志中可以清楚地看出NTLM在特定站点上的身份验证失败。

    在这里,您必须首先检查用户名/密码。

    然后是Auth basic/ntlm/的方案,然后是要在其上进行authicated的端口

     类似资料:
    • 我试图连接到一个需要基本身份验证的网站,但我得到了一个java。网ConnectException:拒绝连接:连接,我无法获取内容,基本身份验证依赖于Base64编码的“Authorization”标头,其值由单词“Basic”后跟空格,后跟Base64编码的名称:password组成。当然,使用浏览器我可以得到我想要的正确的json文件 我在另一个程序中得到了相同的结果: 这里是printSta

    • 连接到服务器时出错:fatal:用户“postgres”的密码身份验证失败 fatal:用户“postgres”的密码身份验证失败

    • 我已经在我的Windows7机器中安装了“Erlang”和“RabbitMQ”。但当我试图运行这段代码时,我遇到了一个例外。 我得到了这个例外。 =信息报告====11-apr-2016::14:08:52===接受AMQP连接<0.360.0>(127.0.0.1:55327->127.0.0.1:5672) =错误报告====11-apr-2016::14:08:52===AMQP连接错误<0

    • “Git push origin MyBranchName”抛出错误“HTTP Basic:拒绝访问” 我在GitLab中配置了密码。 我有SSL密钥创建后,项目是在GitLab上。

    • 错误是: > < li> 无法连接到MySQL:用户“ZEBRAHEAD”@“localhost”的访问被拒绝(使用密码:是) DB_USER使用的是计算机的名称。 身份验证.php index.php

    • 我正在尝试运行Google Javascript YouTube API示例,当页面(search.html)加载时,我收到以下错误: 拒绝显示“https://accounts.google.com/o/oauth2/auth?client_id=[XXX]...F/本地主机 我是从一个主机上运行这个程序的,该主机在我在谷歌开发者控制台中的证书的“JavaScript源代码”部分中得到了授权。