phpquery抓取网站内容简单介绍

常源

2023-12-01

经常会需要抓取别人网站的内容，但直接抓取整个页面的数据总是用使用正则进行匹配过滤，对于正则不熟悉的人挺头疼的，

而使用phpquery使抓取变得简单很多，只要对jquery有了解，就可以轻松的使用类似jq的方式抓取网站的内容

下面简单介绍下phpquery使用，以及我在使用中遇到过的一些问题

首先，下载phpquery，可以直接到phpquery官网下载，

我个人上传了一个到百度云，链接:https://pan.baidu.com/s/1devVhXlL5noNvRlO1EHLbQ 密码:irq7

//引入phpquery

include_once 'xxx/lib/phpQuery/phpQuery.php';

//设置页面编码，根据要抓取的网页编码而定

phpQuery::$defaultCharset = 'utf-8';

$url="xxx.com";//要抓取的网站页面地址；

phpQuery::newDocumentFile($url);

//以抓取带有.star_m的内容为例

$star = pq(".star_m,star_blue")->html();

echo $star //即输出你想到抓取的带.star类的页面数据

//有些页面phpquery设置了编码但是输出结果还是乱码情况，解决方案：使用mb_convert_encoding进行转码；

$star = mb_convert_encoding($star,'ISO-8859-1','UTF-8');

//有些页面使用phpquery设置了编码但是无法抓取到内容，可能是phpquery没能正确识别编码，解决方法

//先使用file_get_contents抓取页面，在替换页面编码之后再使用phpquery即可

$content = file_get_contents($url);

$content = mb_convert_encoding($content,"utf-8","gb2312");

$content = str_replace('charset=gb2312','charset=utf-8>',$content);

phpQuery::newDocument($content);