Why not use the Splash HTTP API directly?

壤驷高洁

2023-12-01

https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly

The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:

import json

import scrapy
from scrapy.http.headers import Headers

RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
            headers = Headers({'Content-Type': 'application/json'})
            yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
                                 body=body, headers=headers)

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # ...

It works and is easy enough, but there are some issues that you should be aware of:

There is a bit of boilerplate.
As seen by Scrapy, we're sending requests to RENDER_HTML_URL instead of the target URLs. It affects concurrency and politeness settings: CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain.
As seen by Scrapy, response.url is an URL of the Splash server. scrapy-splash fixes it to be an URL of a requested page. "Real" URL is still available as response.real_url.
Some options depend on each other - for example, if you use timeout Splash option then you may want to set download_timeout scrapy.Request meta key as well.
It is easy to get it subtly wrong - e.g. if you won't use sort_keys=True argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly.
Default Scrapy duplication filter doesn't take Splash specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.
Splash Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. SplashMiddleware logs content of HTTP 400 Splash responses by default (it can be turned off by setting SPLASH_LOG_400 = False option).
Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Splash.
Large Splash arguments which don't change with every request (e.g. lua_source) may take a lot of space when saved to Scrapy disk request queues. scrapy-splash provides a way to store such static parameters only once.
Splash 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper save_args and load_args values and handle HTTP 498 responses.

scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.

Why not use the Splash HTTP API directly?

相关阅读

相关文章

相关问答