https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly
The obvious alternative to scrapy-splash would be to send requests directly to the Splash HTTP API. Take a look at the example below and make sure to read the observations after it:
import json import scrapy from scrapy.http.headers import Headers RENDER_HTML_URL = "http://127.0.0.1:8050/render.html" class MySpider(scrapy.Spider): start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True) headers = Headers({'Content-Type': 'application/json'}) yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST", body=body, headers=headers) def parse(self, response): # response.body is a result of render.html call; it # contains HTML processed by a browser. # ...
It works and is easy enough, but there are some issues that you should be aware of:
RENDER_HTML_URL
instead of the target URLs. It affects concurrency and politeness settings: CONCURRENT_REQUESTS_PER_DOMAIN
, DOWNLOAD_DELAY
, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain.response.real_url
.download_timeout
scrapy.Request meta key as well.sort_keys=True
argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly.SPLASH_LOG_400 = False
option).lua_source
) may take a lot of space when saved to Scrapy disk request queues. scrapy-splash
provides a way to store such static parameters only once.save_args
and load_args
values and handle HTTP 498 responses.scrapy-splash utlities allow to handle such edge cases and reduce the boilerplate.