C#开源爬虫NCrawler源代码解读以及将其移植到python3.2（5）(selenium登场)

钮高朗

2023-12-01

"在这一篇文章中，我们将使用 python 一个著名的网页解析库 BeautifulSoup 来实现一个标准的 Handler，并使用广度优先算法让爬虫工作起来。"

本来如上文预想,是要用bs4的,不过 bs4 这个库太简单了，网上教程都有，所以改用 selenium 。

selenium是一个著名的网站自动化测试的框架, 它能模拟手工操作浏览器, 获取一些传统爬虫无法获取的网页内容(比如通过js+ajax 后期加载的一些网页内容)。

本文将使用selenium 的 rc 版本 ,即远程控制版本,具体来说是一个jar 包, 为了运行selenium必须装好java runtime , 并在selenium.jar包的目录下运行命令行

java -jar selenium-server.jar

这个命令行还可以加些参数设置请求网页的超时时间。

运行正常的话会出一个控制台，然后就是在python中装好selenium的包

from selenium import selenium

然后调用api来控制selenium rc,selenium rc再调用指定的浏览器。在api中定位标签需要用到XPATH，网页元素的xapth可以通过一个firefox插件获取，具体自己搜吧。

下面给段代码：

import Handler.BaseHandler
from selenium import selenium
import time


class SeleniumHandler(Handler.BaseHandler.BaseHandler):
    def __init__(self):
        super(SeleniumHandler, self).__init__()
        #config begin
        self.host = 'localhost'
        self.port = 4444
        self.baseurl = 'http://huati.weibo.com/'
        # self.browserpath='*firefox3 D:/Program Files/Mozilla Firefox/firefox.exe'
        self.browserpath = '*googlechrome C:\Program Files\Google\Chrome\Application\chrome.exe'

上面是初始化的代码，4444是selenium rc监听指令的端口，也可以通过命令行参数来改。一个selenium对象的实例只能爬一个域名。

下面是另一端参考代码：

def Handle(self, url, pbags):
        super(SeleniumHandler, self).Handle(url, pbags)
        if not self.s:
            self.s = selenium(self.host, self.port, self.browserpath, self.baseurl) #生成实例
            self.s.start()
            self.openpage('/')
            self.s.window_maximize()
        else:
            self.s.open(self.baseurl)
        self.topicpage = 1
        while True:
            try:
                self.settopicurl()
                self.gettopicinfo()
            except Exception as ext:
                # error handle
                print(ext.message)
            if self.topicpage < self.topicpagecount:
                self.topicpage += 1
                self.gotonexttopicspage(self.topicpage)
            else:
                break
        self.TopiclistHandler()

再推荐一个详解selenium的链接,有兴趣的同学可以了解下

http://www.cnblogs.com/hyddd/archive/2009/05/30/1492536.html

C#开源爬虫NCrawler源代码解读以及将其移植到python3.2（5）(selenium登场)

相关阅读

相关文章

相关问答

相关文档