当前位置: 首页 > 面试题库 >

如何在Docker Compose中通过Srivpy同时在Privoxy上同时使用Splash和Tor

郁鸿博
2023-03-14
问题内容

我正在尝试运行具有两个“扩展名”的Scrapy蜘蛛:

  1. 启动以渲染JavaScript,
  2. Tor-Privoxy提供匿名性。

例如,我quotes.toscrape.com在https://github.com/scrapy-plugins/scrapy-
splash/tree/master/example中
使用的刮板。这是我的目录结构:

.
├── docker-compose.yml
└── example
    ├── Dockerfile
    ├── scrapy.cfg
    └── scrashtest
        ├── __init__.py
        ├── settings.py
        └── spiders
            ├── __init__.py
            └── quotes.py

其中example目录是从克隆scrapy-splash库。我添加了以下docker-compose.yml文件:

version: '3'

services:
  scraper:
    build: ./example
    environment:
      - http_proxy=http://tor-privoxy:8118
    links:
      - tor-privoxy
      - splash

  tor-privoxy:
    image: rdsubhas/tor-privoxy-alpine

  splash:
    image: scrapinghub/splash

settings.py我在文件中更改的位置SPLASH_URL

# SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_URL = 'http://splash:8050'

因为Splash不在本地主机上运行,​​而是在名为的单独链接容器中运行splash。在DockerfilescraperIS

FROM python:alpine
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash
RUN pip install scrapy scrapy-splash
COPY . /scraper
WORKDIR /scraper
CMD ["scrapy", "crawl", "quotes"]

问题是,当我使用docker-compose build和运行此命令时docker-compose up,得到以下日志:

Starting examplecompose_tor-privoxy_1
Starting examplecompose_splash_1
Recreating examplecompose_scraper_1
Attaching to examplecompose_splash_1, examplecompose_tor-privoxy_1, examplecompose_scraper_1
splash_1       | 2017-07-11 16:10:13+0000 [-] Log opened.
splash_1       | 2017-07-11 16:10:13.794595 [-] Splash version: 3.0
tor-privoxy_1  | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Privoxy version 3.0.23
tor-privoxy_1  | 2017-07-11 16:10:13.568 7f08e999eee8 Info: Program name: privoxy
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Tor v0.2.6.10 (git-58c51dc6087b0936) running on Linux with Libevent 2.0.22-stable, OpenSSL 1.0.2d and Zlib 1.2.8.
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
splash_1       | 2017-07-11 16:10:13.795925 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
splash_1       | 2017-07-11 16:10:13.796204 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
tor-privoxy_1  | Jul 11 16:10:13.578 [notice] Configuration file "/etc/tor/torrc" not present, using reasonable defaults.
tor-privoxy_1  | Jul 11 16:10:13.581 [notice] Opening Socks listener on 127.0.0.1:9050
splash_1       | 2017-07-11 16:10:13.796541 [-] Open files limit: 1048576
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
splash_1       | 2017-07-11 16:10:13.796706 [-] Can't bump open files limit
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
splash_1       | 2017-07-11 16:10:13.903844 [-] Xvfb is started: ['Xvfb', ':1896918638', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
splash_1       | QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
tor-privoxy_1  | Jul 11 16:10:13.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.
splash_1       | 2017-07-11 16:10:13.984515 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 0%: Starting
splash_1       | 2017-07-11 16:10:14.041562 [-] verbosity=1
splash_1       | 2017-07-11 16:10:14.041732 [-] slots=50
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 5%: Connecting to directory server
splash_1       | 2017-07-11 16:10:14.041806 [-] argument_cache_max_entries=500
tor-privoxy_1  | Jul 11 16:10:13.000 [notice] Bootstrapped 80%: Connecting to the Tor network
splash_1       | 2017-07-11 16:10:14.043083 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
splash_1       | 2017-07-11 16:10:14.044088 [-] Site starting on 8050
splash_1       | 2017-07-11 16:10:14.044240 [-] Starting factory <twisted.web.server.Site object at 0x7f73a4e4b3c8>
tor-privoxy_1  | Jul 11 16:10:14.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
scraper_1      | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrashtest)
scraper_1      | 2017-07-11 16:10:15 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrashtest', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scrashtest.spiders', 'SPIDER_MODULES': ['scrashtest.spiders']}
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled extensions:
scraper_1      | ['scrapy.extensions.corestats.CoreStats',
scraper_1      |  'scrapy.extensions.telnet.TelnetConsole',
scraper_1      |  'scrapy.extensions.memusage.MemoryUsage',
scraper_1      |  'scrapy.extensions.logstats.LogStats']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
scraper_1      | ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.retry.RetryMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
scraper_1      |  'scrapy_splash.SplashCookiesMiddleware',
scraper_1      |  'scrapy_splash.SplashMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
scraper_1      |  'scrapy.downloadermiddlewares.stats.DownloaderStats']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled spider middlewares:
scraper_1      | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
scraper_1      |  'scrapy_splash.SplashDeduplicateArgsMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.referer.RefererMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
scraper_1      |  'scrapy.spidermiddlewares.depth.DepthMiddleware']
scraper_1      | 2017-07-11 16:10:15 [scrapy.middleware] INFO: Enabled item pipelines:
scraper_1      | []
scraper_1      | 2017-07-11 16:10:15 [scrapy.core.engine] INFO: Spider opened
scraper_1      | 2017-07-11 16:10:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scraper_1      | 2017-07-11 16:10:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
tor-privoxy_1  | Jul 11 16:10:16.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
tor-privoxy_1  | Jul 11 16:10:17.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
tor-privoxy_1  | Jul 11 16:10:17.000 [notice] Bootstrapped 100%: Done
tor-privoxy_1  | Jul 11 16:10:17.000 [warn] Received http status code 404 ("Not found") from server '216.218.222.10:443' while fetching "/tor/keys/fp/585769C78764D58426B8B52B6651A5A71137189A+80550987E1D626E3EBA5E5E75A458DE0626D088C".
scraper_1      | 2017-07-11 16:10:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
scraper_1      | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.goodreads.com': <GET https://www.goodreads.com/quotes>
scraper_1      | 2017-07-11 16:10:29 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapinghub.com': <GET https://scrapinghub.com>
tor-privoxy_1  | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:44.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/adulthood/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/be-yourself/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:55.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/success/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/books/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:56.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
tor-privoxy_1  | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
tor-privoxy_1  | Jul 11 16:10:57.000 [notice] Have tried resolving or connecting to address '[scrubbed]' at 3 different places. Giving up.
scraper_1      | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/classic/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error
scraper_1      | 2017-07-11 16:10:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/tag/aliteracy/page/1/ via http://splash:8050/render.json> (failed 1 times): 500 Internal Server Error

为了简洁起见,我中断了该过程。似乎scrapertor-privoxy服务交替抱怨a 500 Internal Service Error并且无法“解析或连接到address”。

我正在努力弄清为什么http_proxyand Splash不能“一起工作”。谁能指出我正确的方向?


问题答案:

在水族馆模板项目(https://github.com/TeamHG-Memex/aquarium)之后,我发现诀窍是使Splash使用Tor,而不是直接使用蜘蛛。

我改编的项目具有以下结构:

.
├── docker-compose.yml
├── example
│   ├── Dockerfile
│   ├── scrapy.cfg
│   └── scrashtest
│       ├── __init__.py
│       ├── settings.py
│       └── spiders
│           ├── __init__.py
│           └── quotes.py
└── splash
    └── proxy-profiles
        └── default.ini

docker-compose.yml

version: '3'

services:
  scraper:
    build: ./example
    links:
      - splash

  tor-privoxy:
    image: rdsubhas/tor-privoxy-alpine

  splash:
    image: scrapinghub/splash
    volumes:
      - ./splash/proxy-profiles:/etc/splash/proxy-profiles:ro
    links:
      - tor-privoxy

在http://splash.readthedocs.io/en/stable/api.html#proxy-
profiles之后
,我将proxy- profiles目录作为卷挂载到了splash容器中。在读default.ini

[proxy]

host=tor-privoxy
port=8118

(我也注意到称它为必不可少的default.ini)。

通过此设置,on docker-compose builddocker-compose up刮板使用Splash成功运行。



 类似资料:
  • 我们的软件使用api(filenet p8),需要配置log4j。我们使用logBack和Spring Boot。我注意到,要在Spring Boot中使用log4j,我们必须排除logBack。这是不可能的。有没有办法在Spring Boot中并行运行log4j和logBack?谢啦

  • 问题内容: Eclipse是一个非常好的编辑器,我更喜欢使用它,但是缺少用于Eclipse的GUI设计工具。另一方面,NetBeans在GUI设计中确实很好用。 使用NetBeans进行GUI设计并将Eclipse用于同一项目上的其他所有内容,是否有任何提示,技巧或陷阱? 编辑: 我尝试了Maven,它似乎不起作用(太复杂,对于我的需求)。 问题答案: 使用Netbeans创建GUI。将Eclip

  • 我正在尝试在Vue中实现从头开始的自动完成,但是我有问题在下拉菜单中选择选项。我正在启用(shownig)这个下拉列表,在点击输入或用户输入时。然后,在输入之外的焦点上,我想摆脱下拉列表。但是,这意味着当我选择dropdown中的内容时,我将关闭dropdown,而不是触发。如何同时保持对和对下拉选项?到目前为止,使起作用的唯一方法是删除,这不是我想要的...

  • 问题内容: 可以将 Spark RDD 通过管道传输到Python吗? 因为我需要一个python库来对数据进行一些计算,但是我的主要Spark项目基于Scala。有没有办法将两者混合使用或让python访问相同的spark上下文? 问题答案: 实际上,您可以使用Scala和Spark以及常规Python脚本来传递到python脚本。 test.py 火花壳(scala) 输出量 你好约翰 你好林

  • 问题内容: 我希望这将是足够的信息,所以就在这里。如果您需要更多信息,请在评论中了解。 我有一班有两个内部班。内部类每个都有两个方法来调用外部类中的方法。因此,它看起来像这样: 重要的是要注意: 这是针对Android应用的。的实例,并传递给作为网页视图一个JavaScript接口,所以并可以随时调用,没有特定的顺序。 目前,我在使用现有代码(不使用synced关键字)时遇到了一个问题,该问题在同

  • 问题内容: 我们的Java程序之一在启动时仅监听IPv6(8080) 例如 问题是无法从外部访问(本地主机除外),要解决此问题,我需要手动添加 但这使得该程序仅适用于IPv4网络。 是否可以执行上述类似 sshd的 操作,并且都支持IPv4和IPv6? 问题答案: 我怀疑这不是Java编程问题,而是OS网络堆栈/ OS网络配置问题: http://coding.derkeiler.com/Arch