问题：

在BeautifulSoup中查找所有（）返回空结果集

韩喜

2023-03-14

我试图从一个网站刮数据练习网络刮。但是findall（）返回空集。我如何解决这个问题？

#importing required modules

import requests,bs4

#sending request to the server

req = requests.get("https://www.udemy.com/courses/search/?q=python")

# checking the status on the request

print(req.status_code)
req.raise_for_status()

#converting using BeautifulSoup

soup = bs4.BeautifulSoup(req.text,'html.parser')

#Trying to scrape the particular div with the class but returning 0

container = soup.find_all('div',class_='popover--popover--t3rNO popover--popover-hover--14ngr')

#trying to print the number of container returned.
print(len(container))

输出：

200
0

共有2个答案

罗浩然

2023-03-14

要获取所需的数据，您需要向适当的API发送请求。为此，您需要创建会话：

import requests

s = requests.Session()
cookies = s.get('https://www.udemy.com').cookies
headers={"Referer": "https://www.udemy.com/courses/search/?q=python&skip_price=false"}

for page_counter in range(1, 500):
    data = s.get('https://www.udemy.com/api-2.0/search-courses/?p={}&q=python&skip_price=false'.format(page_counter), cookies=cookies, headers=headers).json()
    for course in data['courses']:
        params = {'course_ids': [str(course['id']),],
              'fields/[pricing_result/]': ['price',]}
        title = course['title']
        price = s.get('https://www.udemy.com/api-2.0/pricing/', params=params, cookies=cookies).json()['courses'][str(course['id'])]['price']['amount']
        print({'title': title, 'price': price})

令狐和裕

2023-03-14

请参阅我关于它完全是javascript驱动的内容的评论。现代网站通常会使用javascript调用对服务器的HTTP请求，以便在需要时按需获取数据。在这里，如果你禁用javascript，当你检查页面时，你可以通过进入更多的设置，在chrome中轻松做到这一点。您将看到此网站上没有可用的文本。正如您所指出的，这可能与imdb有很大不同。如果您检查beautifulsoup解析的html，您将看到没有任何实际的javascript派生的页面源代码。

有两种方法可以从javascript呈现的网站获取数据

模拟对服务器的HTTP请求
浏览器自动化包，如selenium

第一个选项更好、更高效，因为第二个选项更脆弱，不适合较大的数据集。

幸运的是，udemy正在从一个APIendpoint获取所需的数据，它使用javascript向该endpoint发出HTTP请求，并将响应反馈给浏览器。

import requests

cookies = {
    '__udmy_2_v57r': '4f711b308da548b49394854a189d3179',
    'ud_firstvisit': '2020-05-29T13:48:56.584511+00:00:1jefNY:9F1BJVEUJpv7gmNPgYNini76UaE',
    'existing_user': 'true',
    'optimizelyEndUserId': 'oeu1590760136407r0.2130390415126655',
    'EUCookieMessageShown': 'true',
    '_ga': 'GA1.2.1359933509.1590760142',
    '_pxvid': '26d89ed1-a1b3-11ea-9179-cb750fa4136b',
    '_ym_uid': '1585144165890161851',
    '_ym_d': '1590760145',
    '__ssid': 'd191bc02a1063fd2c75fbab525ededc',
    'stc111655': 'env:1592304425%7C20200717104705%7C20200616111705%7C1%7C1014616:20210616104705|uid:1590760145861.374775813.04725504.111655.1839745362:20210616104705|srchist:1069270%3A1%3A20200629134905%7C1014624%3A1592252104%3A20200716201504%7C1014616%3A1592304425%3A20200717104705:20210616104705|tsa:0:20200616111705',
    'ki_t': '1590760146239%3B1592304425954%3B1592304425954%3B3%3B5',
    'ki_r': 'aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8%3D',
    'IR_PI': '00aea1e6-9da9-11ea-af3a-42010a24660a%7C1592390825988',
    '_gac_UA-12366301-1': '1.1592304441.CjwKCAjw26H3BRB2EiwAy32zhfcltNEr_HHFK5JRaJar5qxUn4ifG9FVFctWyTUXigNZvKeOCz7PgxoCAfAQAvD_BwE',
    'csrftoken': 'pPOdtdbH0HPaHvDfAZMzEOdvWqKZuQWufu8dUrEeXuy5mOOrnFRbWZ9vq8Dfd2ts',
    '__cfruid': 'f1963d736e3891a2e307ebc9f918c89065ffe40f-1596962093',
    '__cfduid': 'df4d951c87bc195c73b2f12b5e29568381597085850',
    'ud_cache_price_country': 'GB',
    'ud_cache_device': 'desktop',
    'ud_cache_language': 'en',
    'ud_cache_logged_in': '0',
    'ud_cache_release': '0804b40d37e001f97dfa',
    'ud_cache_modern_browser': '1',
    'ud_cache_marketplace_country': 'GB',
    'ud_cache_brand': 'GBen_US',
    'ud_cache_version': '1',
    'ud_cache_user': '',
    'seen': '1',
    'eventing_session_id': '66otW5O9TQWd5BYq1_etrA-1597087737933',
    'ud_cache_campaign_code': '',
    'exaff': '%7B%22start_date%22%3A%222020-08-09T08%3A52%3A04.083577Z%22%2C%22code%22%3A%22_7fFXpljNdk-m3_OJPaWBwAQc5gVKutaSg%22%2C%22merchant_id%22%3A39197%2C%22aff_type%22%3A%22LS%22%2C%22aff_id%22%3A60680%7D:1k5D3W:2PemPLTm4xaHixBYRvRyBaAukL4',
    'evi': 'SlFfLh4RBzwTSVBjXFdHehNJUGMYQE99HVFdIExYQ3gARVY8QkAWIEEDCXsVQEd0BEsJexVAA24LQgdjGANXdgZBG3ETH1luRBdHKBoHV3ZKURl5XVBXdkpRXWNUU1luRxIJe1lTQXhMDgdjHRAFbgsICXNWVk1uCwgJN0xYRGATBUpjVFVEdAEOB2NcWkR+E0lQYxhAT30dUV0gTFhCfAhDVm1MUEJ0B1EROkwUV3YAXwk3D0BPewFAHzxCQEd0BUcJexVAA24LQgdjGANXdgZCHHETTld+BkUdY1QZVzoTSRptTBQUbgtFEnleHwhgEwBcY1QZV34HShtjVBlXOhNJE21MFBRuC0UceV4fWW4DSxh3TFgObkdREXBCQAMtE0kccFtUCGATQR54VkBPNxMFCXtfTlc6UFERd1tUTTEdURlzX1JXdkpRXWNUU1luRxIJe1tXQnpMXwlzVldDbgsICTdMWEdgEwVKY1RVRHUJDgdjXFdCdBNJUGMYQE99HVFdIExYQ3kCQ1Y8Ew==',
    'ud_rule_vars': 'eJyFjkuOwyAQBa9isZ04agyYz1ksIYxxjOIRGmhPFlHuHvKVRrPItvWqus4EXT4EDJP9jSViyobPktKRgZqc4GrkmmmuBHdU6YlRqY1P6RgDMQ05D2SOueCDtZPDMNT7QDrooAXRdrqhzHBlRL8XUjPgXwAGYCC7ulpdRX3acglPA8bvPwbVgm6g4p0Bvqeyhsh_BkybXyxmN8_R21J9vvpcjm5cn7ZDTidc7G2xxnvlm87hZwvlU7wE2VP1en0hlyuoG10j:1k5D3W:nxRv-tyLU7lxhsF2jRYvkJA53uM',
}

headers = {
    'authority': 'www.udemy.com',
    'x-udemy-cache-release': '0804b40d37e001f97dfa',
    'x-udemy-cache-language': 'en',
    'x-udemy-cache-user': '',
    'x-udemy-cache-modern-browser': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
    'accept': 'application/json, text/plain, */*',
    'x-udemy-cache-brand': 'GBen_US',
    'x-udemy-cache-version': '1',
    'x-requested-with': 'XMLHttpRequest',
    'x-udemy-cache-logged-in': '0',
    'x-udemy-cache-price-country': 'GB',
    'x-udemy-cache-device': 'desktop',
    'x-udemy-cache-marketplace-country': 'GB',
    'x-udemy-cache-campaign-code': '',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.udemy.com/courses/search/?q=python',
    'accept-language': 'en-US,en;q=0.9',
}

params = (
    ('q', 'python'),
    ('skip_price', 'false'),
)

response = requests.get('https://www.udemy.com/api-2.0/search-courses/', headers=headers, params=params, cookies=cookies)

ids = []
titles = []
durations = []
ratings = []
for a in response.json()['courses']:
    title = a['title']
    duration =int(a['estimated_content_length']) / 60
    rating = a['rating']
    id = str(a['id'])
    titles.append(title)
    ids.append(id)
    durations.append(duration)
    ratings.append(rating)


clean_ids = ','.join(ids)
params2 = (
    ('course_ids', clean_ids),
    ('fields/[pricing_result/]', 'price,discount_price,list_price,price_detail,price_serve_tracking_id'),
)

response = requests.get('https://www.udemy.com/api-2.0/pricing/', params=params2)
data = response.json()['courses']
prices = []
for a in ids: 
    price = response.json()['courses'][a]['price']['amount']
    prices.append(price)

data = zip(titles, durations,ratings, prices)
for a in data:
    print(a)

('Learn Python Programming Masterclass', 56.53333333333333, 4.54487, 14.99)
('The Python Mega Course: Build 10 Real World Applications', 25.3, 4.51476, 16.99)
('Python for Beginners: Learn Python Programming (Python 3)', 2.8833333333333333, 4.4391, 17.99)
('The Python Bible™ | Everything You Need to Program in Python', 9.15, 4.64238, 17.99)
('Python for Absolute Beginners', 3.066666666666667, 4.42209, 14.99)
('The Modern Python 3 Bootcamp', 30.3, 4.64714, 16.99)
('Python for Finance: Investment Fundamentals & Data Analytics', 8.25, 4.52908, 12.99)
('The Complete Python Course | Learn Python by Doing', 35.31666666666667, 4.58885, 17.99)
('REST APIs with Flask and Python', 17.033333333333335, 4.61233, 12.99)
('Python for Financial Analysis and Algorithmic Trading', 16.916666666666668, 4.53173, 12.99)
('Python for Beginners with Examples', 4.25, 4.27316, 12.99)
('Python OOP : Four Pillars of OOP in Python 3 for Beginners', 2.6166666666666667, 4.46451, 12.99)
('Python Bootcamp 2020 Build 15 working Applications and Games', 32.13333333333333, 4.2519, 14.99)
('The Complete Python Masterclass: Learn Python From Scratch', 32.36666666666667, 4.39151, 16.99)
('Learn Python MADE EASY : A Concise Python Course in Python 3', 2.1166666666666667, 4.76601, 12.99)
('Complete Python Web Course: Build 8 Python Web Apps', 15.65, 4.37577, 13.99)
('Python for Excel: Use xlwings for Data Science and Finance', 16.116666666666667, 4.92293, 12.99)
('Python 3 Network Programming - Build 5 Network Applications', 12.216666666666667, 4.66143, 12.99)
('The Complete Python & PostgreSQL Developer Course', 21.833333333333332, 4.5664, 12.99)
('The Complete Python Programmer Bootcamp 2020', 13.233333333333333, 4.63859, 12.99)

有两种方法可以做到这一点，这里是重新设计请求，这是更有效的解决方案。要获得必要的信息，您需要检查页面，并查看哪些HTTP请求提供了哪些信息。你可以通过网络工具做到这一点-

我通常将javascript调用的HTTP请求的CURL复制到CURL中。特里尔沃克斯。com，并将必要的头、参数和cookie转换为python格式。

在第一个请求中，需要标头、cookie和参数。第二个请求，只需要参数。

您得到的响应是一个json对象<代码>响应。json（）将其转换为python字典。为了得到你想要的东西，你得在这本字典里翻一翻。但是对于响应中的每个项目。json（）['courses']网站上每个“卡”的所有必要数据都在那里。因此，我们在我们创建的字典中数据所在的位置执行for循环。我会用响应来处理这个问题。json（），直到您感觉到对象为您提供了什么来理解代码。

持续时间以分钟为单位，因此我在这里快速转换为小时。id也必须是字符串，因为在第二个请求中，我们使用它们作为参数来获取课程的必要价格。我们将ID转换为字符串，并将其作为参数提供。

第二个请求然后给我们必要的价格，再次你必须去挖掘字典对象，我建议你自己这样做来确认嵌套在里面的是价格。

我们压缩的数据合并了所有的数据列表，然后我做了一个for循环来打印所有的数据。如果你想的话，你可以把它喂给熊猫。。。

在BeautifulSoup中查找所有（）返回空结果集

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档