问题：

解析向下滚动的整个网页的html代码

宇文兴言

2023-03-14

from bs4 import BeautifulSoup
import urllib,sys
reload(sys)
sys.setdefaultencoding("utf-8")
r = urllib.urlopen('https://twitter.com/ndtv').read()
soup = BeautifulSoup(r)

这将不会给我整个网页向下滚动结束，我想要的，但只有它的一部分。

编辑：

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import urllib,sys,requests
reload(sys)
sys.setdefaultencoding("utf-8")

class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False

def return_html_code(url):
    driver = webdriver.Firefox()
    driver.maximize_window()
    driver.get(url)
    # initial wait for the tweets to load
    wait = WebDriverWait(driver, 10)
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))
    # scroll down to the last tweet until there is no more tweets loaded
    while True:
        tweets = driver.find_elements_by_css_selector("li[data-item-id]")
        number_of_tweets = len(tweets)
        print number_of_tweets
        driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])
        try:
            wait.until(wait_for_more_than_n_elements_to_be_present((By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))
        except TimeoutException:
            break
    html_full_source=driver.page_source
    driver.close()
    return html_full_source


url='https://twitter.com/thecoolstacks'
#using selenium browser
html_source=return_html_code(url)
soup_selenium = BeautifulSoup(html_source)
print soup_selenium
text_tweet=[]
alltweets_selenium = soup_selenium.find_all(attrs={'data-item-type' : 'tweet'})
for tweet in alltweets_selenium:
    #Text of tweet
    html_tweet= tweet.find_all("p", class_="TweetTextSize TweetTextSize--16px js-tweet-text tweet-text")
    text_tweet.append(''.join(html_tweet[0].findAll(text=True)))    
print text_tweet

预期输出：

import requests from bs4 import BeautifulSoup      url='https://twitter.com/thecoolstacks' 
req = requests.get(url) 
soup = BeautifulSoup(req.content) 
alltweets = soup.find_all(attrs={'data-item-type' : 'tweet'}) 
print alltweets[0]

共有1个答案

孟哲

2023-03-14

我仍然坚持使用Twitter API。

或者，下面是如何使用selenium解决问题：

使用显式等待并定义自定义预期条件，以等待推文在滚动时加载

实施：

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False


url = "https://twitter.com/ndtv"
driver = webdriver.Firefox()
driver.maximize_window()
driver.get(url)

# initial wait for the tweets to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))

# scroll down to the last tweet until there is no more tweets loaded
while True:
    tweets = driver.find_elements_by_css_selector("li[data-item-id]")
    number_of_tweets = len(tweets)

    driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])

    try:
        wait.until(wait_for_more_than_n_elements_to_be_present((By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))
    except TimeoutException:
        break

这将根据需要向下滚动，以加载该频道中的所有现有tweet。

以下是HTML解析片段，提取推文：

page_source = driver.page_source
driver.close()

soup = BeautifulSoup(page_source)
for tweet in soup.select("div.tweet div.content"):
    print tweet.p.text

它打印：

Father's Day Facebook post by arrested cop Suhas Gokhale's son got nearly 10,000 likes http://goo.gl/aPqlxf  pic.twitter.com/JUqmdWNQ3c
#HWL2015 End of third quarter! Breathtaking stuff. India 2-2 Pakistan - http://sports.ndtv.com/hockey/news/244463-hockey-world-league-semifinal-india-vs-pakistan-antwerp …
Why these Kashmiri boys may miss their IIT dream http://goo.gl/9LVKfK  pic.twitter.com/gohX21Gibi
...

类似资料：

解析整个网页的html代码

问题内容： from bs4 import BeautifulSoup import urllib,sys reload(sys) sys.setdefaultencoding(“utf-8”) r = urllib.urlopen('https://twitter.com/ndtv’).read() soup = BeautifulSoup(r) 这不会使我整个网页向下滚动到我想要的结尾，而只有
解析整个页面的html代码

问题内容： from bs4 import BeautifulSoup import urllib,sys reload(sys) sys.setdefaultencoding(“utf-8”) r = urllib.urlopen('https://twitter.com/ndtv’).read() soup = BeautifulSoup(r) 这不会使整个网页滚动到我想要的末尾，而只会滚动其
解析html网页的数据

用于解析html网页数据。作者说：ZHParseHtmlData这个类是我自己写的，解析html的。发现之前用过的TFHpple还有许多都有问题，有的GB2312或者其他编码会乱码或者是不规范的Xml或者不规范的html都解析不出来。现在用我这个类让浏览器对之前页面优化再解析就可以了，只要解析的类是GDataXMLNode，谷歌的东西。也可以用JS解析，但是那样太麻烦，为何不拿着谷歌现有的用呢。
golang解析html网页的方法

本文向大家介绍golang解析html网页的方法，包括了golang解析html网页的方法的使用技巧和注意事项，需要的朋友参考一下 1.先看一下整个结构：主要是web和html目录，分别存放go代码和html相关的资源文件。 2.html代码比较简单，代码如下：就是显示一张图片，然后加登陆表单。 3.而go代码也比较简单，如下：主要是注意显示图片的路径，不能是原来的html的路径，必须是go
用jQuery解析完整的HTML页面

问题内容：我用ajax加载html。我想将结果加载到jquery对象中。我试过了，但返回null。我怎样才能做到这一点？我有一个完整的页面，包括doctype，head元素和body元素。我使用此功能加载数据。问题答案：前一阵子，但也许您仍然对此感兴趣。的内部实现无法构建包含或标签的jQuery对象。它将简单地忽略它们并将所有元素向上移动。因此，如果您的字符串例如生成的jQuery对
如何从JavaScript网页下载完整的html？

我想下载一个网页的完整html，我已经写了一些代码来做到这一点。然而，当我回头看下载的html时，我发现只有大约一半的html存在。我认为这是因为网页是动态的，当你与网页交互时，会加载更多的信息。我一直在尝试使用PhantomJS与ChromeDriver Manager协调一致地执行此操作，但运气不佳。这是只下载部分html的代码（我再次相信，因为页面是动态的）：这是我对PhantomJS的尝

解析向下滚动的整个网页的html代码

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档