我正在尝试使用selenium/python从印度中央污染控制委员会读取数据表。这是一个输出示例。我基本上遵循此处介绍的方法:https://github.com/RachitKamdar/Python-Scraper.
多亏了@Prophet,我才能够从第一页读取数据(使用Python的XPATH选择元素?)但我无法让selenium在切换到第2页时等待数据表重新加载。我试图添加webdriverwait指令,但这似乎确实有效。任何帮助都将不胜感激。谢谢
这就是我想做的
browser.find_element_by_tag_name("select").send_keys("100")
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='DataTables_Table_0_paginate']/span/a")))
maxpage = int(browser.find_elements(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a")[-1].text)
i = 1
while i < maxpage + 1:
browser.find_element(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a[contains(text(),'{}')]".format(i)).click()
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.ID,"DataTables_Table_0_wrapper")))
#this works ok for page 1
#this does not wait after the click for the data table to update. As a result res is wrong for page 2 [empty].
res = browser.page_source
soup = BeautifulSoup(res, 'html.parser')
soup = soup.find(id = 'DataTables_Table_0')
...
i = i + 1
更新1:根据Prophet的建议,我做了以下修改:
browser.find_element_by_tag_name("select").send_keys("100")
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.ID,"DataTables_Table_0_wrapper")))
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='DataTables_Table_0_paginate']/span/a")))
maxpage = int(browser.find_elements(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a")[-1].text)
print(maxpage)
i = 1
while i < maxpage + 1:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.ID,"DataTables_Table_0_wrapper")))
res = browser.page_source
soup = BeautifulSoup(res, 'html.parser')
soup = soup.find(id = 'DataTables_Table_0')
if i == 1:
data = getValsHtml(soup)
else:
data = data.append(getValsHtml(soup))
print(i)
print(data)
i = i + 1
browser.find_element(By.XPATH,'//a[@class="paginate_button next"]').click()
这仍然会在第2页崩溃(数据为空)。此外,数据应该包含第1页中的100个项目,但只包含10个。maxpage编号正确(15)。
更新2:
以下是整合先知建议后的整个剧本[原稿如下https://github.com/RachitKamdar/Python-Scraper]. 这只从第一页检索到10个点,无法切换到下一页。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import Select
def getValsHtml(table):
data = []
heads = table.find_all('th')
data.append([ele.text.strip() for ele in heads])
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols]) # Get rid of empty values
data.pop(1)
data = pd.DataFrame(data[1:],columns = data[0])
return data
def parameters(br,param):
br.find_element_by_class_name("list-filter").find_element_by_tag_name("input").send_keys(param)
br.find_elements_by_class_name("pure-checkbox")[1].click()
br.find_element_by_class_name("list-filter").find_element_by_tag_name("input").clear()
timeout = 60
url = 'https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqm-landing/data'
chdriverpath="/net/f1p/my_soft/chromedriver"
option = webdriver.ChromeOptions()
browser = webdriver.Chrome(executable_path="{}".format(chdriverpath), chrome_options=option)
browser.get(url)
station="Secretariat, Amaravati - APPCB"
state="Andhra Pradesh"
city="Amaravati"
sd=['01', 'Jan', '2018']
ed=['31', 'Dec', '2021']
duration="24 Hours"
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME,"toggle")))
browser.find_elements_by_class_name("toggle")[0].click()
browser.find_element_by_tag_name("input").send_keys(state)
browser.find_element_by_class_name("options").click()
browser.find_elements_by_class_name("toggle")[1].click()
browser.find_element_by_tag_name("input").send_keys(city)
browser.find_element_by_class_name("options").click()
browser.find_elements_by_class_name("toggle")[2].click()
browser.find_element_by_tag_name("input").send_keys(station)
browser.find_element_by_class_name("options").click()
browser.find_elements_by_class_name("toggle")[4].click()
browser.find_element_by_class_name("filter").find_element_by_tag_name("input").send_keys(duration)
browser.find_element_by_class_name("options").click()
browser.find_element_by_class_name("c-btn").click()
for p in ['NH3']:
print(p)
try:
parameters(browser,p)
except:
print("miss")
browser.find_element_by_class_name("list-filter").find_element_by_tag_name("input").clear()
pass
browser.find_element_by_class_name("wc-date-container").click()
browser.find_element_by_class_name("month-year").click()
browser.find_element_by_id("{}".format(sd[1].upper())).click()
browser.find_element_by_class_name("year-dropdown").click()
browser.find_element_by_id("{}".format(int(sd[2]))).click()
browser.find_element_by_xpath('//span[text()="{}"]'.format(int(sd[0]))).click()
browser.find_elements_by_class_name("wc-date-container")[1].click()
browser.find_elements_by_class_name("month-year")[1].click()
browser.find_elements_by_id("{}".format(ed[1].upper()))[1].click()
browser.find_elements_by_class_name("year-dropdown")[1].click()
browser.find_element_by_id("{}".format(int(ed[2]))).click()
browser.find_elements_by_xpath('//span[text()="{}"]'.format(int(ed[0])))[1].click()
browser.find_elements_by_tag_name("button")[-1].click()
next_page_btn_xpath = '//a[@class="paginate_button next"]'
actions = ActionChains(browser)
#This is how you should treat the Select drop down
select = Select(browser.find_element_by_tag_name("select"))
select.select_by_value('100')
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH,'//div[@class="dataTables_wrapper no-footer"]')))
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='DataTables_Table_0_paginate']/span/a")))
maxpage = int(browser.find_elements(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a")[-1].text)
i = 1
while i < maxpage + 1:
res = browser.page_source
soup = BeautifulSoup(res, 'html.parser')
soup = soup.find(id = 'DataTables_Table_0')
if i == 1:
data = getValsHtml(soup)
else:
data = data.append(getValsHtml(soup))
print(i)
print(data)
i = i + 1
#scroll to the next page btn and then click it
next_page_btn = browser.find_element_by_xpath(next_page_btn_xpath)
actions.move_to_element(next_page_btn).perform()
browser.find_element(By.XPATH,next_page_btn).click()
browser.quit()
而不是
browser.find_element(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a[contains(text(),'{}')]".format(i)).click()
尝试单击此元素:
browser.find_element(By.XPATH,'//a[@class="paginate_button next"]').click()
这只是下一页按钮,它不会改变每一页你
同样,而不是
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.ID,"DataTables_Table_0_wrapper")))
试试这个
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH,'//div[@class="dataTables_wrapper no-footer"]')))
当您尝试使用仅为第一页定义的所有页面时,此元素将相同。
UPD
正确的代码应该是这样的:
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import Select
next_page_btn_xpath = '//a[@class="paginate_button next"]'
actions = ActionChains(driver)
#This is how you should treat the Select drop down
select = Select(driver.find_element_by_tag_name("select"))
select.select_by_value('100')
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH,'//div[@class="dataTables_wrapper no-footer"]')))
maxpage = int(browser.find_elements(By.XPATH,"//*[@id='DataTables_Table_0_paginate']/span/a")[-1].text)
i = 1
while i < maxpage + 1:
res = browser.page_source
soup = BeautifulSoup(res, 'html.parser')
soup = soup.find(id = 'DataTables_Table_0')
if i == 1:
data = getValsHtml(soup)
else:
data = data.append(getValsHtml(soup))
print(i)
print(data)
i = i + 1
#scroll to the next page btn and then click it
next_page_btn = driver.find_element_by_xpath(next_page_btn_xpath)
actions.move_to_element(next_page_btn).perform()
browser.find_element(By.XPATH,next_page_btn).click()
当我运行以下代码时,执行会突然结束,除非我取消对Thread.sleep()的注释。因此,我在撤回url servlet中的代码不会被执行。单击是一个提交按钮单击,加载另一个页面。 让硒等到页面加载的正确方法是什么? 我正在使用以下selenium版本
问题内容: 有人知道如何等待页面加载吗?我尝试了在网上找到的所有可能的变体,但根本无法正常工作。 触发click()命令后,我需要等待,Web服务器上存在一些内部脚本,这些脚本会愚弄检查,例如(我排除了导入所需模块并使用标准命名约定的代码): 要么 要么 上述所有检查均无效,从某种意义上来说,即使页面仍在加载中,它们也会返回True。这会导致我正在阅读的文本不完整,因为click()命令后页面未完
我想在点击后获取页面的页面源。然后使用browser.back()函数返回。但是Selenium不会让页面在点击后完全加载,并且由JavaScript生成的内容不包含在该页面的页面源中。
我使用Python3和Selenium firefox提交一个表单,然后获取它们登陆的URL。我是这样做的 这在大多数情况下都有效,但有时页面加载时间超过5秒,我会得到旧的URL,因为新的URL还没有加载。 我应该怎么做?
我试图创建一个等待加载页面的方法,但我出现了一个错误。可能我没有正确使用这个方法。 错误是:
我被困在一个有硒等待的情况下。我正在使用硒爪哇和cucumber。点击一个按钮,一个新的页面加载,但内容还不能点击。当页面加载到背面时,将显示一个灰色屏幕阻止器,以便在加载整个页面之前使其不可编辑。所以我不能使用waitforpageload或wait for element使其可见,因为它们都返回true,因为元素在后台可用。我尝试使用一个条件来检查元素是否可点击,以确保页面完全加载。但那也没用