问题：

用BeautifulSoup或Pandas刮表数据

墨雨华

2023-03-14

我对使用python有点陌生，我接到了一个任务，需要从表中抓取数据。我也不太懂html。我以前从来没有这样做过，花了几天时间研究各种刮桌子的方法。不幸的是，所有的例子都是一个看起来比我所处理的更简单的网页布局。我尝试了很多不同的方法，但没有一种方法允许我选择所需的表数据。

下面网页底部的“每日水位”选项卡下的表怎么刮？

url=https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/hydrograph.aspx？gw_logid=harn0052657

我已经尝试使用了以下链接中的方法，其他未在此显示的方法：

漂亮的刮汤台

用美味可口的汤刮桌子

from bs4 import BeautifulSoup
import requests

html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table")  # {"class": "xxxx"})

import pandas as pd
df_list = pd.read_html(url)
df_list

共有1个答案

越雨泽

2023-03-14

下面尝试使用Python-Requests的方法，当涉及请求时，需要简单、直接、可靠、快速和更少的代码。我已经从网站本身获取API URL后，检查了网络部分的谷歌chrome浏览器。

下面的脚本到底在做什么：

首先，它将使用API URL并执行一个带有动态参数（以CAPS为单位）的GET请求，您可以更改Well No、Start和end date的值以获得所需的结果。

 import json
 import requests
 from urllib3.exceptions import InsecureRequestWarning
 requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
 import csv

 def scrap_daily_water_level():

 file_path = '' #Input File path here
 file_name = 'daily_water_level_data.csv' #File name

 #CSV headers
 csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface'] 
 list_of_water_readings = []

 #Dynamic Params
 WELL_NO = 'HARN0052657'
 START_DATE = '1/1/1905'
 END_DATE = '12/30/2050'

 #API URL
 URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'

 response = requests.get(URL,verify=False) #GET API call
 json_result = json.loads(response.text) #JSON loads to parse JSON data

 print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
 extracted_data = json_result['feature_list'] #Extracted data in JSON form

 for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
     list_of_water_readings.append({ #append and create list of data with headers for further usage
                                 'Line #': idx + 1, 
                                 'GW Log Id' : item['gw_logid'],
                                 'GW Site Id': item['gw_site_id'],
                                 'Land Surface Elevation': item['land_surface_elevation'], 
                                 'Record Date': item['record_date'],
                                 'Restrict to OWRD only': item['restrict_to_owrd_only'],
                                 'Reviewed Status':item['reviewed_status'],
                                 'Reviewed Status Description': item['reviewed_status_description'],
                                 'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
                                 'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
                                 })
 #Create CSV and write data in to it.
 with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode 
     csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
     print('Writing CSV header now...')
     csvwriter.writeheader() #Write headers in CSV file
     for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
         print('Writing data rows now..')
         print(item)            
         csvwriter.writerow(item)

 scrap_daily_water_level()

类似资料：

用Beautifulsoup刮iframe

嗨，我想刮与美丽的汤，但通常iframe src应该是一个html链接，这次我遇到一个wordpress URL，基本上是文件夹结构，导致PHP文件。我在想有没有什么办法可以把那个文件里的桌子刮开？当我检查Chrome中的元素时，表DIV标记存在，然而，当我用BeautifulSoup加载链接时，iframe中的内容就会消失（表）。请帮忙
无法使用BeautifulSoup find_all或pandas.read_html函数从表中刮取数据

BeautifulSoup尝试（替换最后3行）这不会输出任何东西--在这个页面上，找到一些标签（divs、spans等）可以工作，但另一些则不行。在本例中，它没有按照预期找到带有game_info的表。
使用beautifulsoup从span标记中刮取数据

我正在尝试刮网页，在那里我需要解码整个表到一个数据帧。我正为此使用漂亮的汤。在某些标记中，有一些标记没有任何文本。但这些值会显示在网页上的特定span标记中。下面的代码对应于该网页, 但是，这个标记中显示的值是。我试着删掉它，但我收到的是空短信。如何刮这个价值使用美丽的汤。 URL：https://en.tutiempo.net/climate/ws-432950.html 下面给出了我的用于
使用BeautifulSoup进行刮削标记

我试图刮一个页面与美丽的汤，有
使用beautifulsoup从页面中刮取表格时，找不到表格

我一直想把桌子从这里刮下来，但在我看来BeautifulSoup找不到桌子。我写道：基于其他类似的问题，我假设HTML在某种程度上被破坏了，但我不是专家…我找不到答案：（Beautiful soup缺少一些html表标签）、（从网站提取表）、（使用Beautiful soup刮表），甚至（Python+Beautiful soup：从网页刮表）多谢了！
BeautifulSoup刮.文本属性问题

我有下面的代码来刮一个页面，https://www.hotukdeals.com 由于某种原因，这种方法起作用，在循环中刮取交易的价格一定的次数，然后停止工作。程序输出：从输出中可以看到，在前四行之后，属性为空，但元素中有文本。有人知道这事吗？有什么想法或解决办法吗？

用BeautifulSoup或Pandas刮表数据

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档