我看了stackoverflow上关于漂亮汤的大部分问题,只从网站上抓取了一半的数据,但是到目前为止没有一个有效。我尝试过将该功能更改为lxml或html5lib等。我也尝试使用硒,现在我尝试用硒向下滚动网站,加载网站上的所有内容,并使用漂亮的汤来抓取数据,但是当需要超过100个项目时,它只保留抓取16个项目。我在下面附上了我的代码。
我试图刮掉的网站链接:https://www.ranker.com/list/kpop-disbanded-groups/ranker-music?ref=listed_on
from selenium import webdriver
from selenium.webdriver.common import timeouts
from selenium.webdriver.common.keys import Keys
import os
from bs4 import BeautifulSoup
import requests
import time
url = 'https://www.ranker.com/list/kpop-disbanded-groups/ranker-music?ref=listed_on&pos=2'
driver = webdriver.Safari()
driver.get(url)
SCROLL_PAUSE_TIME = 3
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
html_content = driver.execute_script('return document.body.innerHTML')
soup = BeautifulSoup(html_content, 'html.parser')
for years in soup.findAll('div', class_= 'gridItem_itemDescription__2Etxm gridItem_blather__2Mozw'):
print(years.p.text)
break
last_height = new_height
import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"limit": 200,
"offset": 0,
"useDefaultNodeLinks": "false",
"liCacheKey": "decacb20-5d77-4b04-a871-7b2c54e3db15",
"include": "votes,wikiText,rankings,serviceProviders,openListItemContributors,taggedLists",
"propertyFetchType": "ALL"
}
r = requests.get(url, params=params).json()['listItems']
for x in r:
yr = x.get('blather', 'N/A')
soup = BeautifulSoup(yr, 'lxml')
print("Name: {:<30}, Year: {}".format(
x['name'], soup.get_text(strip=True)))
main('https://api.ranker.com/lists/2713714/items')
输出:
Name: 2NE1 , Year: 2009–2016
Name: Wanna One , Year: 2017-2019
Name: SISTAR , Year: 2010–2017
Name: 4Minute , Year: 2009–2016
Name: I.O.I , Year: 2016–2017
Name: Wonder Girls , Year: 2007–2017
Name: X1 , Year: 2019–2020
Name: T-ara , Year: N/A
Name: Miss A , Year: 2010–2017
Name: PRISTIN , Year: N/A
Name: IZ*ONE , Year: 2018–2021
Name: KARA , Year: 2007-2016
Name: Triple H , Year: N/A
Name: Orange Caramel , Year: N/A
Name: After School , Year: N/A
Name: GFriend , Year: 2015–2021
Name: Pristin V , Year: N/A
Name: Nine Muses , Year: 2010-2019
Name: 2AM , Year: 2008–2017
Name: JBJ , Year: 2017–2018
Name: Boyfriend , Year: N/A
Name: HELLOVENUS , Year: N/A
Name: Sistar19 , Year: 2011–2017
Name: UNB , Year: N/A
Name: Rainbow , Year: 2009–2016
Name: The Ark , Year: N/A
Name: History , Year: 2013–2017
Name: UNI.T , Year: N/A
Name: MADTOWN , Year: 2014–2017
Name: MYTEEN , Year: N/A
Name: Speed , Year: 2012–2016
Name: SPICA , Year: 2012–2017
Name: Stellar , Year: 2011–2018
Name: FIESTAR , Year: 2012-2018
Name: Secret , Year: 2009–2018
Name: B.I.G , Year: 2014- 2020
Name: Melody Day , Year: 2012-2018
Name: RaNia , Year: N/A
Name: C-Clown , Year: 2012–2015
Name: Hi Suhyun , Year: N/A
Name: BESTie , Year: 2013-2018
Name: Playback , Year: N/A
Name: HIGH4 , Year: N/A
Name: Seo Taiji and Boys , Year: N/A
Name: 15& , Year: N/A
Name: Glam , Year: 2012–2015
Name: CHI CHI , Year: N/A
Name: BIGSTAR , Year: N/A
Name: Coed School , Year: 2010-2013
Name: Bambino , Year: N/A
Name: 1Punch , Year: 2015
Name: EvoL , Year: 2012–2015
Name: A-Jax , Year: N/A
Name: Rainz , Year: N/A
Name: Leessang , Year: 2002–2017
Name: Tiny-G , Year: 2012–2015
Name: M&D , Year: N/A
Name: LIPBUBBLE , Year: N/A
Name: Homme , Year: 2010–2018
Name: Drug Restaurant , Year: N/A
Name: 2YOON , Year: 2013–2016
Name: HONEYST , Year: N/A
Name: Jewelry , Year: 2001–2015
Name: D.Holic , Year: 2014–2017
Name: Lucky J , Year: 2014-2016
Name: M.I.B , Year: 2011–2017
Name: Tahiti , Year: N/A
Name: The Legend , Year: 2014–2017
Name: Wassup , Year: N/A
Name: Rainbow Pixie , Year: 2009–2016
Name: 14U , Year: N/A
Name: The East Light , Year: N/A
Name: F-ve Dolls , Year: 2011–2015
Name: 8Eight , Year: 2007–2014
Name: 2EYES , Year: N/A
Name: Untouchable , Year: N/A
Name: DMTN , Year: 2010–2014
Name: Baby V.O.X , Year: 2013-2015
Name: LC9 , Year: 2013–2015
Name: ChoColat , Year: 2011–2017
Name: CoCoSoRi , Year: N/A
Name: MyB , Year: 2015–2016
Name: I.B.I , Year: 2016–2017
Name: A-Prince , Year: 2012–2015
Name: M.Pire , Year: 2013-2015
Name: Bob Girls , Year: 2014–2015
Name: Gangkiz , Year: 2012–2014
Name: TraxX , Year: N/A
Name: GI , Year: 2013–2016
Name: BTL , Year: 2014–2016
Name: SKarf , Year: 2012–2014
Name: T-max , Year: 2007–2012
Name: Sunny Days , Year: 2012–2016
Name: SeeYa , Year: 2006–2011
Name: 4L , Year: 2014–2016
Name: Blady , Year: 2011-2017
Name: Phantom , Year: 2011–2017
Name: Puretty , Year: 2012–2014
Name: Double-A , Year: 2011–2015
Name: The SeeYa , Year: 2012–2015
Name: D-Unit , Year: 2012–2013
Name: Unicorn , Year: 2015–2017
Name: N-Sonic , Year: 2011–2016
Name: Supreme Team , Year: 2009–2013
Name: GP Basic , Year: 2010–2015
Name: Shu-I , Year: 2009 -2015
Name: Big Mama , Year: 2003—2012
Name: N-Train , Year: 2011–2013
Name: NOM , Year: 2013–2016
Name: Ledt , Year: 2010–2016
Name: PARAN , Year: 2005-2011
Name: N.EX.T , Year: 1992-1997, 2003-2014
Name: Kiha & The Faces , Year: N/A
Name: Rumble Fish , Year: 2003-2010
问题内容: 我想使用漂亮的汤删除html文件中的所有注释。由于BS4将每个注释作为一种特殊类型的可导航字符串,所以我认为这段代码可以工作: 所以那行不通…。如何使用BS4查找所有评论? 问题答案: 您可以将函数传递给find_all()来帮助它检查字符串是否为Comment。 例如我有下面的HTML: 码: 输出将是: 顺便说一句,我认为不起作用的原因是(来自BeautifulSoup文档): 输
问题内容: 我正在尝试解析一个网站,并通过BeautifulSoup.findAll获取一些信息,但它并没有全部找到。.我正在使用python3 代码是这个 它只打印其中一半… 问题答案: 不同的HTML解析器对损坏的HTML的处理方式不同。该页面提供了损坏的HTML,解析器对此的处理不佳: 标准库在此特定页面上的麻烦较少: 使用将其转换为您的特定代码示例,您将这样指定解析器:
我使用BeautifulSoup从HTML文件中提取信息。我希望能够捕获信息的位置,即BS标记对象的标记在HTML文件中的偏移量。 有办法做到这一点吗? 我目前使用的是lxml解析器,因为它是默认的。
考虑一下这段代码: 它只打印“divTag” 更新: 我基本上想在'a'标签中提取“字符串”值。
我试图从一个网站上为我的项目收集数据。但是问题是我没有在我的输出中得到我在我的开发者工具栏屏幕中看到的标签。以下是我想从其中抓取数据的DOM的快照: 我能够获得类为“bigContainer”的div标记,但是我不能在这个标记中刮取标记。例如,如果我想得到网格项标记,我得到了一个空列表,这意味着它表明没有这样的标记。为什么会这样?请帮忙!!
问题内容: 我正在尝试从Google搜索结果中提取链接。检查元素告诉我,我感兴趣的部分具有“ class = r”。第一个结果如下所示: 要提取“ href”,我要做: 但是我意外地得到: 我想要的地方: 属性“ ping”似乎使它感到困惑。有任何想法吗? 问题答案: 发生了什么? 如果您打印响应内容(即),则会看到您得到的HTML完全不同。页面源和响应内容不匹配。 因为内容是动态加载的,所以 不