问题：

使用BeautifulSoup进行刮擦移动到下一页

华峰

2023-03-14

我需要从一个网站刮去内容（只是标题）。我做了一个页面，但我会需要做的网站上的所有页面。目前，我正在做以下工作：

import bs4, requests
import pandas as pd
import re

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
    
    
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')


title=soup.find_all('h2')

我知道，当我移动到下一页时，url会发生如下变化：

website/page/2/
website/page/3/
... 
website/page/49/
...

我尝试使用next_page_url=base_url+next_page_partial构建一个递归函数，但它不会移动到下一页。

if soup.find("span", text=re.compile("Next")):
    page = "https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/".format(page_num)
    page_num +=10 # I should scrape all the pages so maybe this number should be changed as I do not know at the beginning how many pages there are for that section
    print(page_num)
else:
    break

我遵循这个问题（和答案）：移动到下一页使用BeautifulSoup刮刮

如果你需要更多的信息请告诉我。多谢

更新的代码：

import bs4, requests
import pandas as pd
import re

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

page_num=1
website="https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina"

while True:
  r = requests.get(website, headers=headers)
  soup = bs4.BeautifulSoup(r.text, 'html')


  title=soup.find_all('h2')

  if soup.find("span", text=re.compile("Next")):
      page = f"https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{page_num}".format(page_num)
      page_num +=10
  else:
      break

共有1个答案

陈胤

2023-03-14

如果使用f“url/{page_num}”，则删除格式(page_num)。

您可以使用以下任何您想要的内容：

page=f“https://catania.liveuniversity.it/notizie-catania-cronaca/cronacaCateneseSicilina/page/{page_num}”

或

page=“https://catania.liveuniversity.it/notizie-catania-cronaca/cronacaCateneseSicilina/page/{}”。format(page_num)

祝你好运！

最终的答案是：

import bs4, requests
import pandas as pd
import re

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

page_num=1
website="https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina"

while True:
  r = requests.get(website, headers=headers)
  soup = bs4.BeautifulSoup(r.text, 'html')


  title=soup.find_all('h2')

  if soup.find("span", text=re.compile("Next")):
      website = f"https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{page_num}"
      page_num +=1
  else:
      break

类似资料：

使用BeautifulSoup进行刮削标记

我试图刮一个页面与美丽的汤，有
用Beautifulsoup刮iframe

嗨，我想刮与美丽的汤，但通常iframe src应该是一个html链接，这次我遇到一个wordpress URL，基本上是文件夹结构，导致PHP文件。我在想有没有什么办法可以把那个文件里的桌子刮开？当我检查Chrome中的元素时，表DIV标记存在，然而，当我用BeautifulSoup加载链接时，iframe中的内容就会消失（表）。请帮忙
使用lxml刮擦动态html字段

提前感谢你的帮助。
使用beautifulSoup在没有类的情况下从标记中进行刮取

如果我想从锚标记中的href属性和字符串“水平零黎明”中刮出链接。因为锚标记没有自己的类，并且在整个源代码中有更多的锚标记。使用beautifulSoup可以做些什么来获取所需的数据？
使用刮擦飞溅会严重影响刮擦速度吗？

问题内容：到目前为止，我一直只使用scrapy并编写自定义类来使用ajax处理网站。但是，如果我要使用scrapy-splash，据我所知，它会在javascript之后刮擦呈现的html，那么对我的抓取工具的速度会产生重大影响吗？用scrapy刮擦香草html页面与使用scrapy-splash渲染javascript html所花费的时间之间的比较是什么？最后，scrapy-splas
（Python 3，BeautifulSoup 4）-在Div中进行刮页分页

我可以浏览此网站的第一页： http://ratings.food.gov.uk/enhanced-search/en-GB/^/伦敦/相关性/0/^/^/0/1/10 但我正试图通过使用网站分页中的“下一步”按钮来刮除网站上的所有其他页面。我单击了Next按钮，可以看到第2页的参数从0/1/10更改为0/2/10，以此类推。我已经看了分页代码，我可以看到分页在一个Div中问题是，我仅使用以

使用BeautifulSoup进行刮擦移动到下一页

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档