当前位置：首页 > 软件库 > 应用工具 > 网络爬虫 >

AntNest

简明飞快的异步爬虫框架

授权协议 LGPL

开发语言 Python

所属分类应用工具、网络爬虫

软件类型开源软件

地区国产

投递者黄朗

操作系统跨平台

开源组织无

适用人群未知

软件官网

官方下载

软件概览

AntNest

简明飞快的异步爬虫框架（python3.6+），只有600行左右的代码

功能

开箱即用的HTTP客户端
提供Item extractor, 可以明确地声明如何从response解析数据(支持xpath, jpath or regex)
通过 "ensure_future" and "as_completed" api 提供方便的工作流

安装

pip install ant_nest

使用方式：

创建一个Demo项目:

>>> ant_nest -c examples

自动会创建以下文件：

drwxr-xr-x   5 bruce  staff  160 Jun 30 18:24 ants
-rw-r--r--   1 bruce  staff  208 Jun 26 22:59 settings.py

假设我们想获取GitHub热门仓库，让我们创建一个"examples/ants/example2.py":

from ant_nest import *
from yarl import URL


class GithubAnt(Ant):
    """Crawl trending repositories from github"""
    item_pipelines = [
        ItemFieldReplacePipeline(
            ('meta_content', 'star', 'fork'),
            excess_chars=('\r', '\n', '\t', '  '))
    ]
    concurrent_limit = 1  # save the website`s and your bandwidth!

    def __init__(self):
        super().__init__()
        self.item_extractor = ItemExtractor(dict)
        self.item_extractor.add_pattern(
            'xpath', 'title', '//h1/strong/a/text()')
        self.item_extractor.add_pattern(
            'xpath', 'author', '//h1/span/a/text()', default='Not found')
        self.item_extractor.add_pattern(
            'xpath', 'meta_content',
            '//div[@class="repository-meta-content col-11 mb-1"]//text()',
            extract_type=ItemExtractor.EXTRACT_WITH_JOIN_ALL)
        self.item_extractor.add_pattern(
            'xpath',
            'star', '//a[@class="social-count js-social-count"]/text()')
        self.item_extractor.add_pattern(
            'xpath', 'fork', '//a[@class="social-count"]/text()')

    async def crawl_repo(self, url):
        """Crawl information from one repo"""
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item['origin_url'] = response.url

        await self.collect(item)  # let item go through pipelines(be cleaned)
        self.logger.info('*' * 70 + 'I got one hot repo!\n' + str(item))

    async def run(self):
        """App entrance, our play ground"""
        response = await self.request('https://github.com/explore')
        for url in response.html_element.xpath(
                '/html/body/div[4]/div[2]/div/div[2]/div[1]/article//h1/a[2]/'
                '@href'):
            # crawl many repos with our coroutines pool
            self.schedule_coroutine(
                self.crawl_repo(response.url.join(URL(url))))
        self.logger.info('Waiting...')

然后我们可以列出所有可运行的爬虫(在"examples"文件夹下)

>>> $ant_nest -l
ants.example2.GithubAnt

运行! (without debug log):

>>> ant_nest -a ants.example2.GithubAnt
INFO:GithubAnt:Opening
INFO:GithubAnt:Waiting...
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi&hellip;', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '�� A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '��\u200d�� Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J&hellip;', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
INFO:GithubAnt:Closed
INFO:GithubAnt:Get 7 Request in total
INFO:GithubAnt:Get 7 Response in total
INFO:GithubAnt:Get 6 dict in total
INFO:GithubAnt:Run GithubAnt in 18.157656 seconds

我们可以通过类属性来配置我们的爬虫

class Ant(abc.ABC):
    response_pipelines: List[Pipeline] = []
    request_pipelines: List[Pipeline] = []
    item_pipelines: List[Pipeline] = []
    request_cls = Request
    response_cls = Response
    request_timeout = DEFAULT_TIMEOUT.total
    request_retries = 3
    request_retry_delay = 5
    request_proxies: List[Union[str, URL]] = []
    request_max_redirects = 10
    request_allow_redirects = True
    response_in_stream = False
    connection_limit = 100  # see "TCPConnector" in "aiohttp"
    connection_limit_per_host = 0
    concurrent_limit = 100

关于Item

Item代表我们最终想获取的单个数据，不是一个具体的类，它可以是一个简单的字典，自定义的类甚至是ORM对象，取决于我们的需要和选择

使用案例

AntNest 模块完全解耦方案

2017-03-09 | carlSQ | iOS 简介 AntNest 是吸收了 Go 语言的 Interface 模型的 iOS 的 App 模块化解耦编程的框架。完全解耦的面向接口插件化模块开发运行框架模块具体实现与接口调用分离易扩展的模块生命周期、事件分发设计原则 Go 语言的 Interface 模型蚁巢的蚁室蚁道模型基本架构 antRoom 为单独的模块 antChanne

AntNest

功能

关于Item

同类工具

相关阅读

相关文章

相关问答

相关文档