如何使用硒和scrapy来自动化该过程？

丌官绍元

2023-03-14

问题内容：

我一开始就知道您需要使用诸如硒之类的webtoolkit来自动进行抓取。

我将如何能够单击Google Play商店上的下一个按钮，以便仅出于我的大学目的刮取评论！

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time


class Product(scrapy.Item):
    title = scrapy.Field()


class FooSpider(CrawlSpider):
    name = 'foo'

    start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe")
        self.browser.implicitly_wait(60) #

    def parse(self,response):
        self.browser.get(response.url)
        sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')
        items = []
        for i in range(0,200):
            time.sleep(20)
            button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
            button.click()
            self.browser.implicitly_wait(30)    
            for site in sites:
                item = Product()

                item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
                yield item

我已经更新了代码，一次又一次地重复了40个项目。for循环出了什么问题？

似乎正在更新的源代码没有传递到xpath，这就是为什么它返回相同的40个项目的原因

问题答案：

我会做这样的事情：

from scrapy import CrawlSpider
from selenium import webdriver
import time

class FooSpider(CrawlSpider):
    name = 'foo'
    allow_domains = 'foo.com'
    start_urls = ['foo.com']

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Firefox()
        self.browser.implicitly_wait(60)

    def parse_foo(self.response):
        self.browser.get(response.url)  # load response to the browser
        button = self.browser.find_element_by_xpath("path") # find 
        # the element to click to
        button.click() # click
        time.sleep(1) # wait until the page is fully loaded
        source = self.browser.page_source # get source of the loaded page
        sel = Selector(text=source) # create a Selector object
        data = sel.xpath('path/to/the/data') # select data
        ...

不过，最好不要等待固定的时间。因此time.sleep(1)，您可以使用http://www.obeythetestinggoat.com/how-
to-get-selenium-to-wait-for-page-load-after-a-
click.html中

介绍的方法之一来代替。

如何使用硒和scrapy来自动化该过程？

相关阅读

相关文章

相关问答

相关工具

相关文档