gecco爬虫初见与综述

蔡明贤

2023-12-01

2021SC@SDUSC

Gecco是一款用java语言开发的轻量化的易用的网络爬虫。作为国内大佬研发的java网络爬虫，Gecco整合了jsoup、httpclient、fastjson、spring、htmlunit、redission等优秀框架，我们使用者只需要配置一些jquery风格的选择器就能很快的写出一个爬虫。同时这个Gecco框架有优秀的可扩展性，框架基于开闭原则进行设计，对修改关闭、对扩展开放。

以上是gecco爬虫官网的介绍，作为一个java爬虫，它

简单易用，使用jquery风格的选择器抽取元素
支持爬取规则的动态配置和加载
支持页面中的异步ajax请求
支持页面中的javascript变量抽取
利用Redis实现分布式抓取
支持结合Spring开发业务逻辑
支持htmlunit扩展
支持插件扩展机制
支持下载时UserAgent随机选取
支持下载代理服务器随机选取

其中的quick start提供了一个简易的网络爬虫

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
    private String user;//url中的{user}值

    @RequestParameter("project")
    private String project;//url中的{project}值

    @Text
    @HtmlField(cssPath=".repository-meta-content")
    private String title;//抽取页面中的title

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
    private int star;//抽取页面中的star

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
    private int fork;//抽取页面中的fork

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme;//抽取页面中的readme

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public int getStar() {
        return star;
    }

    public void setStar(int star) {
        this.star = star;
    }

    public int getFork() {
        return fork;
    }

    public void setFork(int fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        //工程的包路径
        .classpath("com.geccocrawler.gecco.demo")
        //开始抓取的页面地址
        .start("https://github.com/xtuhcy/gecco")
        //开启几个爬虫线程
        .thread(1)
        //单个爬虫每次抓取完一个请求后的间隔时间
        .interval(2000)
        //循环抓取
        .loop(true)
        //使用pc端userAgent
        .mobile(false)
        //开始运行
        .run();
    }
}

作为软件工程实践的项目，接下来的一段时间中，我会使用gecco，通过爬取京东，淘宝等商品网站的数据，以及csdn的博客数据，与我的小组成员一道，通过该项目，理解网络爬虫的运行原理，内部构造，设计逻辑，举一反三，对这个爬虫框架的内容提出自己的分析和优化。

gecco爬虫初见与综述

相关阅读

相关文章

相关问答

相关文档