当前位置: 首页 > 知识库问答 >
问题:

硒:刮一页,直到所有产品加载

晋奕
2023-03-14

我是selenium的新手,正在尝试一个需要从页面中抓取URL的项目。

来源:-https://www.autofurnish.com/audi-car-accessories

我想搜集数据以获取这些产品的URL。我能够完成它,但面临滚动部分的问题。我需要抓取这个页面上所有产品的所有URL。这是一个巨大的页面,有很多结果。

我尝试过:-

1.

 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

我试过这个代码,但它只是向下滚动到最后,所有的产品都没有加载。

2.

data = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
for i in data:
    driver.execute_script("arguments[0].scrollIntoView();", i)

项目 = [] last_height = driver.execute_script(“返回文档.body.滚动高度”) item_targetcount = 1000,而item_targetcount

试图从中获取帮助:- 如何在Python硒中向下滚动 一步一步地滚动到元素使用Web驱动程序?尝试观看一些YouTube视频仍然无法解决此问题。

我刮其他细节的主要代码是:-

prod_details = []
for i in models:
    driver.find_element(By.XPATH,"//span[@aria-labelledby='select2-brand-container']").click()
    time.sleep(2)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(i)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(Keys.ENTER)
    driver.find_element(By.XPATH,"//div[@class='btnred sbv-link sbv-inactive']").click()
    time.sleep(3)
    prod = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
    for i in prod:
        prod_details.append(i.get_attribute("href"))
    driver.get('https://www.autofurnish.com/')
    time.sleep(2)

仍然无法完全加载页面并获得所有输出。

共有2个答案

洪胤
2023-03-14

要从元素中提取href属性的值,可以使用列表理解,也可以使用以下定位器策略之一:

>

  • 使用CSS_SELECTOR:

    driver.get('https://www.autofurnish.com/audi-car-accessories#/pageSize=32&viewMode=grid&orderBy=0')
    print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.CSS_SELECTOR, "h2.product-title a")])
    driver.quit()
    

    使用 XPATH:

    driver.get('https://www.autofurnish.com/audi-car-accessories#/pageSize=32&viewMode=grid&orderBy=0')
    print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//h2[@class='product-title']//a")])
    driver.quit()
    

    控制台输出:

    ['https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6841-back-cushion-hecta-6851-each-set-of-two-beige', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6840-back-cushion-hecta-6850-each-set-of-two-black', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6843-back-cushion-hecta-6853-each-set-of-two-coffee', 'https://www.autofurnish.com/combo-of-7d-premium-car-pillow-neck-rest-hecta-6842-back-cushion-hecta-6852-each-set-of-two-tan', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-beige', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-black', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-coffee', 'https://www.autofurnish.com/universal-2d-premium-leather-car-foot-mats-for-2-rows-tan', 'https://www.autofurnish.com/autofurnish-3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two-brown', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-set-of-two', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-holder-hanger-accessory-tan', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-beige', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-black', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-coffee', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-set-of-2-tan', 'https://www.autofurnish.com/3d-car-auto-seat-back-multi-pocket-storage-bag-organizer-with-car-meal-tray-tan', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-beige', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-black', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-coffee', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a4-2021-tan', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-beige', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-black', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-coffee', 'https://www.autofurnish.com/5d-premium-custom-fitted-car-mats-for-audi-a6-2020-tan']
    

  • 包翔
    2023-03-14

    这是一个相当棘手的问题……我遇到了几个意想不到的问题,试图让它发挥作用。

    主要问题是等待加载微调器并将其保留在屏幕上。我最初尝试像您一样滚动到页面底部,这会使页面进入加载新产品部分的无限循环,因为页脚太大,加载微调器位于可见页面上方(至少对我来说是这样)。我通过滚动到最后一个可见的产品来修复这个问题,它足以触发下一个部分加载,但不会太低,以至于进入无限加载模式。

    在大多数情况下,当涉及到加载微调器时,您希望等待它变得可见,然后不可见。这可以防止不良的时机情况,是等待新产品加载的最可靠方法。

    基本流程是

      < li >加载页面 < li >开始循环 < ol > < li >获取所有产品A标签 < li >使用JS将页面向下滚动到最后一个A标记 < li >等待装载旋转器变得可见,然后不可见 < li >如果不再装载产品或达到最大产品计数,退出循环

    代码

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    ...
    
    # may need to adjust the timeout based on your experience... the site is really slow for me
    wait = WebDriverWait(driver, 60)
    new_count = 0
    old_count = 0
    while True:
        old_count = new_count
        products = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2.product-title > a"))
        new_count = len(products)
    
        # scroll down to last product to trigger the loading spinner
        driver.execute_script("arguments[0].scrollIntoView();", products[len(products) - 1])
    
        # wait for loading spinner to appear and then disappear
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))
    
        # if the count didn't change, we've loaded all products on the page
        # I put a max of 50 products to load as a demo. You can adjust higher as needed but you should put something reasonably sized here to prevent the script from running for an hour
        if new_count == old_count or new_count > 50
            break
    
    # print results
    print(len(products))
    for product in products:
        print(product.get_attribute("href"))
    
     类似资料:
    • 我是使用selenium进行网络抓取的新手,我正在抓取SeeTicket。我们的刮刀工作原理如下。 < li >登录 < li >搜索事件 < li >单击每个事件 < li >收集数据 回来吧 < li >单击下一个事件 < li >重复 现在的问题是,某些事件不包含某些元素,例如此事件:https://wl.seetickets.us/event/Beta-Hi-Fi/484490?afflk

    • 我想自动从其他网站获取产品数据,或者通过抓取它,或者通过使用cURL访问API。由于我们的网站使用Wordpress,我正在尝试制作一个插件。我现在尝试在插件的设置页面上获取字段,以填写网站名称、cURL的链接格式以及应该导入的产品ID。插件的设置页面上会有一个按钮,当再次单击时,该按钮会添加相同的字段。我试图使用一个对象类,因为我想使用多个网站。我在我们的网站上收到HTTP错误500,所以我认为

    • 一、业务背景 随着互联网宽带在中国的普及,人们对Internet提出了多样化的应用需求。比如今年以来,视频直播成为了一个炙手可热的业务模式。在视频业务的服务模式中,基于互联网基础网络的视频应用,需要面对大并发量的用户,这就需要高效的内容分发和传输技术做为依托,为最终用户提供更友好更极致的体验。 二、产品概述 CDN直播产品是基于CDN节点的流媒体服务器,为客户提供直播流推送、转码、分发、和播放功能

    • 验证当你进入产品列表页,如服装等->在72个项目后,“查看更多”不应该自动加载更多,但需要点击底部的按钮,该按钮应该加载更多项目,如果该页面中有超过12个项目。另外,当我过滤结果时,我如何验证页面中返回的产品数量?

    • ***我的代码只用于练习! 我试图从FPL的网站上删除每个玩家的名字和团队https://www.premierleague.com/我的代码有一些问题。 问题是它只得到的页面与'-1'在网址的末尾,whch我甚至没有灌输在我的页面列表! 页面没有任何逻辑-基本url是https://www.premierleague.com/players?se=363 我的代码: