当前位置: 首页 > 知识库问答 >
问题:

Selenium循环将多个表追加在一起

蒋永宁
2023-03-14

我是这里的一个新的python用户。我一直在写一个代码,使用selenium和beautiful soup去一个网站,得到html表,并把它变成一个数据帧。

states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", 
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire",
"New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", 
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
"Washington", "West Virginia", "Wisconsin", "Wyoming"]

period = "2020"

num_states = len(states)

state_list = []

for state in states:
    driver = webdriver.Chrome(executable_path = 'C:/webdrivers/chromedriver.exe')
    driver.get('https://www.nbc.gov/pilt/counties.cfm')
    driver.implicitly_wait(20)
    state_s = driver.find_element(By.NAME, 'state_code')
    drp = Select(state_s)
    drp.select_by_visible_text(state)
    year_s = driver.find_element(By.NAME, 'fiscal_yr')
    drp = Select(year_s)
    drp.select_by_visible_text(period)
    driver.implicitly_wait(10)
    link = driver.find_element(By.NAME, 'Search')
    link.click()
    url = driver.current_url
    page = requests.get(url)
    #dfs  = pd.read_html(addrss)[2]
    # Get the html
    soup = BeautifulSoup(page.text, 'lxml')
    table = soup.findAll('table')[2]
    headers = []

    for i in table.find_all('th'):
        title = i.text.strip()
        headers.append(title)

    df = pd.DataFrame(columns = headers)

    for row in table.find_all('tr')[1:]:
        data = row.find_all('td')
        row_data = [td.text.strip() for td in data]
        length = len(df)
        df.loc[length] = row_data
    df = pd.DataFrame.rename(columns={'Total Acres':'Total_acres'})
    for i in range(s,num_states):
        state_list.append([County[i].text, Payment[i].text, Total_acres[i].text])

print(df)

state_list=[]

df=pd.dataframe()

对于状态中的状态:driver=webdriver.chrome(executable_path='c://webdrivers/chromedriver.exe')driver.Get('https://www.nbc.gov/pilt/county.cfm')driver.implicitly_wait(20)state_s=driver.find_element(by.name,'state_code')driver.implicitly_wait(20)date_s=driver.find_element(by.name,'state_code')drp=find_element(by.name,'search')link.click()url=driver.current_url page=requests.Get(url)#dfs=pd.read_html(addrss)[2]#Get the html soup=BeautifulSoup(page.text,'lxml')table=soup.findall('table')[2]headers=[]

for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)


for row in table.find_all('tr')[1:]:
    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data


dfs = pd.concat([df for state in states])

打印(df)

结果:valueError:无法设置没有定义列的框架

共有1个答案

晋涛
2023-03-14

通过熊猫访问餐桌!pls参照已添加的行的注释。

states = ["Alabama", "Alaska"]

period = "2020"

num_states = len(states)

state_list = []
driver = webdriver.Chrome()
result=[] # change 1 , list to store the {state:df}
for state in states:
    
    driver.get('https://www.nbc.gov/pilt/counties.cfm')
    driver.implicitly_wait(20)
    state_s = driver.find_element(By.NAME, 'state_code')
    drp = Select(state_s)
    drp.select_by_visible_text(state)
    year_s = driver.find_element(By.NAME, 'fiscal_yr')
    drp = Select(year_s)
    drp.select_by_visible_text(period)
    driver.implicitly_wait(10)
    link = driver.find_element(By.NAME, 'Search')
    link.click()
    url = driver.current_url
    page = requests.get(url)
    temp_res={}
    soup = BeautifulSoup(driver.page_source, 'lxml')
    df_list=pd.read_html(soup.prettify(),thousands=',,') # access the table through pandas
    try:
        df_list[2].drop('PAYMENT.1', axis=1, inplace=True) # some states giving this column , so deleted
    except:
        print(f"state: {state} does have payment 1")
    try:
        df_list[2].drop('PAYMENT.2', axis=1, inplace=True)  # some states giving this column , so deleted
    except:
        print(f"state: {state} does have payment 2")
    temp_res[state]=df_list[2] # the table at occurance 2
    result.append(temp_res)

输出如下所示:

for each_run in result :
    for each_state in each_run:
        print(each_run[each_state].head(1))
 COUNTY PAYMENT TOTAL ACRES
0  AUTAUGA COUNTY  $4,971       1,758
                   COUNTY   PAYMENT TOTAL ACRES
0  ALEUTIANS EAST BOROUGH  $668,816   2,663,160
 类似资料:
  • 我有一个两个项目的列表,每个项目是一个文本字符串。我想围绕这两个项目循环,如果一个单词不在一组单词中,则基本上删除它。但是,下面的代码将所有单词放在一起,而不是创建两个单独的项。我希望我的更新列表包含两个项目,每个原始项目对应一个im更新:

  • 问题内容: 我是一名基本的python程序员,因此希望我的问题的答案会很容易。我正在尝试拿字典并将其附加到列表中。然后,字典更改值,然后再次循环添加。似乎每次执行此操作时,列表中的所有词典都会更改其值以匹配刚刚添加的值。例如: 我认为结果是,但是我得到了: 任何帮助是极大的赞赏。 问题答案: 您需要追加一个 副本 ,否则您将一遍又一遍地添加对同一词典的引用: 我用和代替和; 您不想掩盖内置类型。

  • 我试图通过循环元素,然后通过分页单击来获得链接列表。我不确定如何在熊猫数据帧中的每个循环经过下面显示的分页后追加,这样我就可以在循环之外调用数据帧来列出所有的链接。 它总是覆盖并打印出最后一行。

  • 我试图将多个数据帧附加到一个空数据帧中,但它不起作用。为此,我使用本教程我的代码如下所示: 我在循环中生成一个框架,我的代码是: 我如何才能做到这一点使用熊猫和什么是最好的可能的方式做到这一点。 注意:这里这一行 正在从API中获取一些数据

  • 我正在尝试创建一个新的创建列表,其中通过for循环的每次迭代将一个新项附加到列表中。到目前为止,代码如下所示: 当我这样做时,会创建一个列表,但列表中的每个项目都是相同的。我希望每个项目都使用在上一个for循环中创建的date_字符串变量,该变量在每次迭代中都应该更改。这是它创建的列表的一个示例: 我尝试过改变循环的缩进,但这并没有解决问题。 我也试图提高使用Numpy数组的效率,但这个问题的答案