当前位置：首页 > 软件库 > 应用工具 > 网络爬虫 >

WebCollector-Python

基于 Python 的开源网络爬虫框架

授权协议 GPLv3

开发语言 Python

所属分类应用工具、网络爬虫

软件类型开源软件

地区国产

投递者唐麒

操作系统跨平台

开源组织无

适用人群未知

软件概览

WebCollector-Python

WebCollector-Python 是一个无须配置、便于二次开发的 Python 爬虫框架（内核），它提供精简的的 API，只需少量代码即可实现一个功能强大的爬虫。

WebCollector Java版本

WebCollector Java版相比WebCollector-Python具有更高的效率: https://github.com/CrawlScript/WebCollector

安装

pip安装命令

pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip

示例

Basic

快速入门

自动探测URL

demo_auto_news_crawler.py:

# coding=utf-8
import webcollector as wc


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True)
        self.num_threads = 10
        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

手动探测URL

demo_manual_news_crawler.py:

# coding=utf-8
import webcollector as wc


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=False)
        self.num_threads = 10
        self.add_seed("https://github.blog/")

    def visit(self, page, detected):

        detected.extend(page.links("https://github.blog/[0-9]+.*"))

        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用detected_filter插件过滤探测到的URL

demo_detected_filter.py:

# coding=utf-8
import webcollector as wc
from webcollector.filter import Filter
import re


class RegexDetectedFilter(Filter):
    def filter(self, crawl_datum):
        if re.fullmatch("https://github.blog/2019-02.*", crawl_datum.url):
            return crawl_datum
        else:
            print("filtered by detected_filter: {}".format(crawl_datum.brief_info()))
            return None


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True, detected_filter=RegexDetectedFilter())
        self.num_threads = 10
        self.add_seed("https://github.blog/")

    def visit(self, page, detected):

        detected.extend(page.links("https://github.blog/[0-9]+.*"))

        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用RedisCrawler进行可断点的采集（可在关闭后恢复）

demo_redis_crawler.py:

# coding=utf-8
from redis import StrictRedis
import webcollector as wc


class NewsCrawler(wc.RedisCrawler):

    def __init__(self):
        super().__init__(redis_client=StrictRedis("127.0.0.1"),
                         db_prefix="news",
                         auto_detect=True)
        self.num_threads = 10
        self.resumable = True # you can resume crawling after shutdown
        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用Requests定制Http请求

demo_custom_http_request.py:

# coding=utf-8

import webcollector as wc
from webcollector.model import Page
from webcollector.plugin.net import HttpRequester

import requests


class MyRequester(HttpRequester):
    def get_response(self, crawl_datum):
        # custom http request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
        }

        print("sending request with MyRequester")

        # send request and get response
        response = requests.get(crawl_datum.url, headers=headers)

        # update code
        crawl_datum.code = response.status_code

        # wrap http response as a Page object
        page = Page(crawl_datum,
                    response.content,
                    content_type=response.headers["Content-Type"],
                    http_charset=response.encoding)

        return page


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True)
        self.num_threads = 10

        # set requester to enable MyRequester
        self.requester = MyRequester()

        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

使用案例

ubuntu apt-get升级python3.5到python3.6

1.安装软件管理工具 sudo apt-get install software-properties-common 2.添加软件源 sudo add-apt-repository ppa:jonathonf/python-3.6 3更新apt sudo apt update 4.安装python3.6 sudo apt-get install python3.6 5.验证 root@iZ
python文件选择：tkFileDialog 基础

tkFileDialog有两种形式: 一个是.askopenfilename(option=value, ...) 这个是"打开"对话框另一个是:asksaveasfilename(option=value, ...) 这个是另存为对话框 option参数如下: defaultextension = s 默认文件的扩展名 filetypes = [(label1, pattern1), (l
python线程暂停

Python基础系列讲解——线程锁Lock的使用介绍我们知道Python的线程是封装了底层操作系统的线程，在Linux系统中是Pthread(全称为POSIX Thread)，在Windows中是Windows Thread。因此Python的线程是完全受操作系统的管理的。但是在计算密集型的任务中多线程反而比单线程更慢。这是为什么呢？在CPyt... 文章千锋Python讲堂 2019-1
Python中selenium+webdriver爬取数据时各种节点元素的获取方式

新建实例driver = webdriver.Chrome() 1.通过标签属性Id查找元素方法：find_element_by_id(element_id) 实例：driver.find_element_by_id(“Username”) 2.通过标签属性name查找元素方法：find_element_by_name(element_name) 实例：driver.find_element_
Python爬虫 ---(1)爬虫基础知识

引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括： 1.分布式爬虫框架：Nutch 2.Java单机爬虫框架：Crawler4j, WebMagic, WebCollector、Heritrix 3.python单机爬虫框架：scrapy、pyspider Nutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂
Python多框架选择

一、Python框架的分类 1.分布式爬虫：Nutch 优点：分布式抓取，存储和索引，有hadoop支持，第三方插件丰富缺点： 1.分布式爬虫是好几台机器在同时运行，如何保证不同的机器爬取页面的时候不会出现重复爬取的问题。同样，分布式爬虫在不同的机器上运行，在把数据爬完后如何保证保存在同一个地方。 2.使用上手难，用Nutch进行爬虫的二次开发，爬虫的编写和调试所需的时间，往往是单机爬虫所需
Python爬虫实战

引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括： 1.分布式爬虫框架：Nutch 2.Java单机爬虫框架：Crawler4j, WebMagic, WebCollector、Heritrix 3.python单机爬虫框架：scrapy、pyspider Nutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂
Python之web框架介绍

python主要框架为什么要选择Python进行Web开发？ Python的优点：有几个因素可以简化Python在Web开发中的使用：低入门门槛 Python与我们日常生活中使用的英语相似。语法的简单性使您可以处理复杂的系统，并确保所有元素之间都具有明确的关系。因此，更多的新手程序员可以学习该语言并更快地加入编程社区。良好的可视化效果通过使用不同的图和图表，可以以易于理解的格式表示数据。

WebCollector-Python

WebCollector-Python

WebCollector Java版本

安装

pip安装命令

示例

Basic

快速入门

自动探测URL

手动探测URL

用detected_filter插件过滤探测到的URL

用RedisCrawler进行可断点的采集（可在关闭后恢复）

用Requests定制Http请求

同类工具

相关阅读

相关文章

相关问答

相关文档