当前位置: 首页 > 知识库问答 >
问题:

在python中使用硒铬网络驱动程序提取推特追随者数据?无法加载所有的追随者

白泽语
2023-03-14

我正在尝试使用Selenium chrome webdriver和BeautifulSoup为一个拥有8万粉丝的帐户获取twitter粉丝数据。我在剧本中面临两个问题:

1) 在加载所有跟随者后滚动到页面底部以获取整个页面源代码时,我的脚本不会一直滚动到底部。加载随机数量的追随者后,它停止在两者之间滚动,然后开始遍历每个追随者配置文件以获取其数据。我希望它加载页面上的所有追随者,然后开始遍历配置文件。

2) 我的第二个问题是,每次我运行脚本时,它都会尝试一个接一个地滚动到底部,直到加载了所有的追随者,然后开始通过一次解析一个追随者数据来提取数据。这将需要4到5天来获取我的案例中的所有追随者数据(80K追随者)。有没有更好的办法。

这是我的剧本:

from bs4 import BeautifulSoup
import sys
import os,re
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from os import listdir
from os.path import isfile, join

print "Running for chrome."

chromedriver=sys.argv[1]
download_path=sys.argv[2]
os.system('killall -9 "Google Chrome"')
try:
	os.environ["webdriver.chrome.driver"]=chromedriver
	chromeOptions = webdriver.ChromeOptions()
	prefs = {"download.default_directory" : download_path}
	chromeOptions.add_experimental_option("prefs",prefs)
	driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions)
	driver.implicitly_wait(20)
	driver.maximize_window()
except Exception as err:
	print "Error:Failed to open chrome."
	print "Error: ",err
	driver.stop_client()
	driver.close()
	
#opening the web page
try:
	driver.get('https://twitter.com/login')
except Exception as err:
	print "Error:Failed to open url."
	print "Error: ",err
	driver.stop_client()
	driver.close()

username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']")
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']")

username.send_keys("###########")
password.send_keys("###########")
driver.find_element_by_xpath("//button[@type='submit']").click()
#os.system('killall -9 "Google Chrome"')
driver.get('https://twitter.com/sadserver/followers')



followers_link=driver.page_source  #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')

output=open('twitter_follower_sadoperator.csv','a')
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
name_list=[]
lastHeight = driver.execute_script("return document.body.scrollHeight")


for _ in xrange(0, followers_count/followers_per_page + 1):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
                followers_link=driver.page_source  #follwer page 18at a time
                soup=BeautifulSoup(followers_link,'html.parser')
                div = soup.find('div',{'class':'GridTimeline-items has-items'})
                bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
                for name in bref:
                        name_list.append(name['href'])
                break
        lastHeight = newHeight
        followers_link=''

print len(name_list)


for x in range(0,len(name_list)):
        #print name['href']
        #print name.text
        driver.stop_client()
        driver.get('https://twitter.com'+name_list[x])
        page_source=driver.page_source
        each_soup=BeautifulSoup(page_source,'html.parser')
        profile=each_soup.find('div',{'class':'ProfileHeaderCard'})
                            
        try:
                name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text
                if name:
                        output.write('"'+name.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in name:',e

        try:
                handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text
                if handle:
                        output.write('"'+handle.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in handle:',e

        try:
                location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text
                if location:
                        output.write('"'+location.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in location:',e

        try:
                bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text
                if bio:
                        output.write('"'+bio.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in bio:',e
                        
        try:
                joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text
                if joinDate:
                        output.write('"'+joinDate.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in joindate:',e
        
        try:
                url =  [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1]
                if url:
                        output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n')
                else:
                        output.write(' '+'\n')
        except Exception as e:
                output.write(' '+'\n')
                print 'Error in url:',e
        


        
output.close()


os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")

共有3个答案

汤博
2023-03-14
  1. 在Firefox或其他浏览器中,打开开发者控制台,并在向下滚动页面的过程中写入(复制)请求-您将使用它来构造您的请求。请求看起来像这样-https://twitter.com/DiaryofaMadeMan/followers/users?include_available_features=1

这种方式比API好得多,因为您可以不受任何限制地加载大量数据

鱼征
2023-03-14

有更好的办法。使用twitterapi,这里有一个快速的Github脚本,我发现它是Github脚本。很抱歉,您可能会觉得使用Selenium(不使用API有好处)节省了很多时间。这是一篇关于自动化和了解东西如何工作的好文章:twitterapi

有一种滚动多次的方法,但是你必须做一些数学运算或者设置一个条件来阻止它。

driver.execute_script("window.scrollTo(0, 10000);") 

假设你有10万个追随者,初始显示100个追随者,然后你会在每个卷轴上加载10个追随者。您将再滚动990次。

当然,这是alecxe对你的案例的确切用法: D。Qudora*答案by-alecxe-

html = driver.page_source

这一旦你显示了所有的关注者(滚动),然后用类似于BeautifulSoup的东西对其进行解析,就可以使用page_source

毛勇
2023-03-14

我没有按照alecxe在回答中提到的实现,但我的脚本仍然没有解析所有的追随者。它仍在加载随机数量的追随者。似乎无法弄清这件事的真相。有人可以试着在他们的端运行这个,看看他们是否能够加载所有的追随者。以下是修改后的脚本

from bs4 import BeautifulSoup
import sys
import os,re
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from os import listdir
from os.path import isfile, join

print "Running for chrome."

chromedriver=sys.argv[1]
download_path=sys.argv[2]
os.system('killall -9 "Google Chrome"')
try:
	os.environ["webdriver.chrome.driver"]=chromedriver
	chromeOptions = webdriver.ChromeOptions()
	prefs = {"download.default_directory" : download_path}
	chromeOptions.add_experimental_option("prefs",prefs)
	driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions)
	driver.implicitly_wait(20)
	driver.maximize_window()
except Exception as err:
	print "Error:Failed to open chrome."
	print "Error: ",err
	driver.stop_client()
	driver.close()
	
#opening the web page
try:
	driver.get('https://twitter.com/login')
except Exception as err:
	print "Error:Failed to open url."
	print "Error: ",err
	driver.stop_client()
	driver.close()

username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']")
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']")

username.send_keys("*****************")
password.send_keys("*****************")
driver.find_element_by_xpath("//button[@type='submit']").click()
#os.system('killall -9 "Google Chrome"')
driver.get('https://twitter.com/sadoperator/followers')



followers_link=driver.page_source  #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')

output=open('twitter_follower_sadoperator.csv','a')
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
name_list=[]
lastHeight = driver.execute_script("return document.body.scrollHeight")

followers_link=driver.page_source  #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')

followers_per_page = 18
followers_count = 15777


for _ in xrange(0, followers_count/followers_per_page + 1):
        driver.execute_script("window.scrollTo(0, 7755000);")
        time.sleep(2)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
                followers_link=driver.page_source  #follwer page 18at a time
                soup=BeautifulSoup(followers_link,'html.parser')
                div = soup.find('div',{'class':'GridTimeline-items has-items'})
                bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
                for name in bref:
                        name_list.append(name['href'])
                break
        lastHeight = newHeight
        followers_link=''

print len(name_list)

'''
for x in range(0,len(name_list)):
        #print name['href']
        #print name.text
        driver.stop_client()
        driver.get('https://twitter.com'+name_list[x])
        page_source=driver.page_source
        each_soup=BeautifulSoup(page_source,'html.parser')
        profile=each_soup.find('div',{'class':'ProfileHeaderCard'})
                            
        try:
                name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text
                if name:
                        output.write('"'+name.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in name:',e

        try:
                handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text
                if handle:
                        output.write('"'+handle.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in handle:',e

        try:
                location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text
                if location:
                        output.write('"'+location.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in location:',e

        try:
                bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text
                if bio:
                        output.write('"'+bio.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in bio:',e
                        
        try:
                joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text
                if joinDate:
                        output.write('"'+joinDate.strip().encode('utf-8')+'"'+',')
                else:
                        output.write(' '+',')
        except Exception as e:
                output.write(' '+',')
                print 'Error in joindate:',e
        
        try:
                url =  [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1]
                if url:
                        output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n')
                else:
                        output.write(' '+'\n')
        except Exception as e:
                output.write(' '+'\n')
                print 'Error in url:',e
        


        
output.close()
'''

os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")
 类似资料:
  • 我想使用Python检索Spotify播放列表的追随者计数。我一直在搜索https://developer.spotify.com/documentation/web-api/reference-beta/#category-playlists,但还没有找到方法。然而,我找到了一个工作代码,如何检索播放列表的轨道ID,我如何调整它来获得追随者呢?

  • 我正在尝试切换选项卡并通过导航到url进行操作'http://toolsqa.com/'然后转到演示站点-- 使用的不同代码: 代码1: 代码2: 代码3: 以上是我用来切换标签的方法,但是还没有成功。请在这方面帮助我。

  • 我的目标是使用硒为Python自动在线账单支付。 使用以下代码使用 Web 驱动程序登录成功: 登录后,一个新的页面加载,我的下一步是点击一个链接。代码: 什么也没发生。没有导航到该法案 但仍然没有。我还应该尝试其他什么? 错误: Traceback(最近调用的最后一次):File"/home/队长/. PyCharmEdu30/config/划痕/scratch_1.py",第12行,在clic

  • 本文向大家介绍解释领导者和追随者的概念。相关面试题,主要包含被问及解释领导者和追随者的概念。时的应答技巧和注意事项,需要的朋友参考一下 答:在Kafka的每个分区中,都有一个服务器充当领导者,0到多个服务器充当追随者的角色。

  • 我在eclipse中使用SeleniumWebDriver和TestNG。问题是页面在某些数据的中途重新登录,并且重新加载的时间是灵活的,这就是为什么我不能应用显式等待时间。我想让webdriver等待,直到重新加载完成。 我正在尝试通过此代码执行此操作……但它不起作用。

  • 这不是一个新问题。但旧的话题在2016年结束了,我想一定有一个新的解决方案来获取公共元数据,如个人资料图片和私人instagram账户的追随者数据。 新的api有一个endpoint:https://developers.facebook.com/docs/instagram-basic-display-api/reference/user,但这只适用于正确的用户ID-如何通过用户名获得它? 旧的