当前位置：首页 > 软件库 > 应用工具 > 网络爬虫 >

spider-utils-for-php

PHP爬虫工具包

授权协议 GPL

开发语言 PHP

所属分类应用工具、网络爬虫

软件类型开源软件

地区国产

投递者史景铄

操作系统跨平台

开源组织无

适用人群未知

软件官网

软件文档

官方下载

软件概览

spider-utils-for-php:

原则：

简单、易用、灵活、任性任性任性就是任性！

特色：

php 界内最简单易用的 http-utils，自动识别支持 curl、socket、file_get_contents 三种方式。
http 请求支持 gzip，加速请求，节约请求成本。
跟踪 301、302 跳转（可设置最大跳转数量）。
支持统一转码为 utf-8，不再需要关心页面是否是 gbk、big5、utf8 等编码。
字符串支持通配符、正则表达式、DOM表达式三种方式匹配。
url支持匹配后自动相对路径转绝对路径。
ToBe Continue.

什么？转换相对路径到绝对路径

    // $result = http://baidu.com/bac/index.html
    $result = spider::abs_url('http://baidu.com/abc/', '../bac/index.html');

什么？html2txt?

 // $result = 123  $result = spider::html2txt('<p><a href="">1</a>23<p>');

什么？字符串截取？

    // $result = 23abcde
    $result = spider::cut_str('123abcdef', '1', 'f');

什么？通配符匹配？

    // $result = abc
    $result = spider::mask_match('123abc123', '123(*)123');
    // $result = abc
    $result = spider::mask_match('abc123', '(*)123');
    // $result = 123 $result = spider::mask_match('123abcabc', '(*)abc');
    // $result = 123abc
    $result = spider::mask_match('123abcdef', '(*)abc', true);

What？发送http GET请求？

    // 自动转码 utf-8, 
    $result =  spider::fetch_url('http://www.baidu.com/');

What？发送http POST请求？

    $post = "wd=".urlencode("你的网址"); 
    // 数组也一样
    // $post = array("wd" => urlencode("你的网址"));
    $result = spider::fetch_url('http://www.baidu.com/s?',$post);

What？POST File？

    $post = array("wd" => "http://", "file" => "@c:/1.txt");
    $result = spider::fetch_url('http://www.baidu.com/s?',$post);

What？要带 UserAgent 和 Cookie?

// 一切 headers 都可以传入 
$headers = array( 'Cookie' => 'uid=1; my_name_is=mzphp', 'UserAgent' => 'userAgentForIphone', 'Referer' => 'http://baidu.com/',
    ); 
$result = spider::fetch_url('http://www.baidu.com/s?', $post, $headers);

What？这些操作如何漂亮的“在一起”？

   // 首先你需要一个女朋友
    $key = "魔爪小说阅读器";
    $url = 'http://www.sogou.com/web?query='.urlencode($key).'&ie=utf8';
    $html = spider::fetch_url($url, '', array('Referer'=>'http://www.sogou.com/'));
    // 对你的女朋友进行分析
    $keywordlist = spider::match($html, array('list'=>array(
        'cut' => '相关搜索</caption>(*)</tr></table>',
        'pattern' => '#id="sogou_\d+_\d+">(?<key>[^>]*?)</a>#is',
    )));
    //
    $newarr = array();
    foreach($keywordlist['list'] as $key=>$val){
        $newarr[$val['key']] = array('key'=>$val['key']);
    }

More？

好吧，你可以参考一下 mzphp2 项目中的 start_example 里的index_control，on_spider 方法：

http://git.oschina.net/mz/mzphp2/blob/master/start_example/control/index_control.class.php

使用案例

spider-定向抓取

网络爬虫（web crawler）又称为网络蜘蛛（web spider）是一段计算机程序，它从互联网上按照一定的逻辑和算法抓取和下载互联网的网页,是搜索引擎的一个重要组成部分。一般的爬虫从一部分start url开始，按照一定的策略开始爬取，爬取到的新的url在放入到爬取队列之中，然后进行新一轮的爬取，直到抓取完毕为止。我们看一下crawler一般会遇到什么样的问题吧：抓取的网页量很大网页更
Scrapy项目(东莞阳光网)---利用Spider爬取贴子内容，包含图片（使用Pycharm）

1、创建Scrapy项目 scapy startproject dongguan2 2.进入项目目录，使用命令genspider创建Spider scrapy genspider xixi "wz.sun0769.com" 3、定义要抓取的数据（处理items.py文件） # -*- coding: utf-8 -*- import scrapy class Dongguan2Item(scra

spider-utils-for-php

原则：

特色：

什么？转换相对路径到绝对路径

什么？html2txt?

什么？字符串截取？

什么？通配符匹配？

What？发送http GET请求？

What？发送http POST请求？

What？POST File？

What？要带 UserAgent 和 Cookie?

同类工具

相关阅读

相关文章

相关问答

相关文档