问题：

用Python和selenium刮URL

秦俊发

2023-03-14

取文本文件booktitle.txt，它是书名列表。

然后使用Python/Selenium在网站goodreads.com中搜索该标题。

获取结果的URL并创建一个新的.csv文件，其中列1=书名，列2=站点URL

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.options import Options
from pyvirtualdisplay import Display
#from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common import keys
import csv
import time
import json

class Book:
    def __init__(self, title, url):
        self.title = title
        self.url = url
    def __iter__(self):
        return iter([self.title, self.url])

url = 'https://www.goodreads.com/'

def create_csv_file():
    header = ['Title', 'URL']
    with open('/home/l/gDrive/AudioBookReviews/WebScraping/GoodReadsBooksNew.csv', 'w+', encoding='utf-8') as csv_file:
        wr = csv.writer(csv_file, delimiter=',')
        wr.writerow(header)

def read_from_txt_file():
    lines = [line.rstrip('\n') for line in open('/home/l/gDrive/AudioBookReviews/WebScraping/BookTitles.txt', encoding='utf-8')]
    return lines

def init_selenium():
    chrome_options = Options()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage') 
    options = Options()
    options.add_argument('--headless')
    global driver
    driver = webdriver.Chrome("/home/l/gDrive/AudioBookReviews/WebScraping/chromedriver",  chrome_options=chrome_options)
    driver.get(url)
    time.sleep(30)
    driver.get('https://www.goodreads.com/search?q=')

def search_for_title(title):
    search_field = driver.find_element_by_xpath('//*[@id="search_query_main"]')
    search_field.clear()
    search_field.send_keys(title)
    search_button = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[1]/div[1]/div[2]/form/div[1]/input[3]')
    search_button.click()

def scrape_url():
    try:
        url = driver.find_element_by_css_selector('a.bookTitle').get_attribute('href')
    except:
        url = "N/A"

    return url

def write_into_csv_file(vendor):
   with open('/home/l/gDrive/AudioBookReviews/WebScraping/GoodReadsBooksNew.csv', 'a', encoding='utf-8') as csv_file:
        wr = csv.writer(csv_file, delimiter=',')
        wr.writerow(list(vendor))

create_csv_file()
titles = read_from_txt_file()    
init_selenium()

for title in titles:
    search_for_title(title)
    url = scrape_url()
    book = Book(title, url)
    write_into_csv_file(book)

共有1个答案

农弘毅

2023-03-14

现在我能看到几个错误：

1）当您稍后在代码中传递chromedriver时，您必须取消注释chrome选项并注释Firefox

# from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options

顺便说一句，pyvirtualdisplay是无头chrome替代品，您不需要导入它。

def init_selenium():
    chrome_options = Options()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage') 
    chrome_options.add_argument('--headless')

类似资料：

用selenium刮网

问题内容：我正尝试在此网站上搜索selenium表中的清单。我是新手，并编写了以下代码：但是，我可以获取以下标签，但不能获取其中的数据。我以前也尝试过BS4进行刮擦，但失败了。任何帮助深表感谢。问题答案：该结果是在一个iframe -切换到它，然后得到：我还要添加一个等待表加载的方法：
使用Python和Selenium刮取难以找到的Web表

我一直在使用Python和Selenium从特定的州健康网页中获取数据，并将该表输出到本地CSV。我在其他几个州使用类似的代码取得了很多成功。但是，我遇到了一种状态，即使用看起来像R的东西来创建动态仪表板，而我无法使用常规方法真正访问这些仪表板。我花了很多时间梳理StackOverflow。我已经检查了是否有一个iframe可以切换，但是，我只是没有看到页面上iframe中我想要的数据。使用
尝试使用Python和Selenium迭代滚动和刮擦网页

我最近问了一个问题（这里引用：Python Web Scring(Beautiful Soup、Selenium和PhantomJS):只刮整页的一部分），这有助于确定我在滚动时动态更新的页面上刮所有内容时遇到的问题。然而，我仍然无法使用selenium来使用代码指向正确的元素，并迭代地向下滚动页面。我还发现，当我手动向下滚动页面时，有一些原始内容在页面加载时消失，而新内容则更新。例如，看下面的图
用python刮etoro

我正在尝试使用Selenium自动连接到我的etoro帐户，并从我的投资组合中获取一些数据。我在谷歌实验室工作，从现在开始，这里是我的: 但是，我有以下错误消息我试图改变和使用find_element_by_class，by_xpath等，但我找不到该怎么做。你能帮我一下吗？
用beautifulsoup和selenium webdriver帮助网页刮板

因此，我正在尝试webscrape https://data.bls.gov/cgi-bin/surveymost？bls，并且能够弄清楚如何通过点击进行webscrape以获得一个表。我正在练习的选择是在您选择与薪酬下的“雇用成本指数(ECI)文职（未调整）-CIU1010000000000A”相关联的复选框之后，然后选择“检索数据”。处理完这两个之后，将显示一个表。这就是我要刮的桌子。下
用Selenium刮网站时的NoSuchElementException

我正试图从以下URL中刮取球员姓名和位置:https://theDraftNetwork.com/articles/2021-NFL-draft-big-board-marino

用Python和selenium刮URL

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档