python goose_提取数据之goose使用

方承弼

2023-12-01

1.简介

Python-goose项目是用Python重写的Goose，Goose原来是用Java写的文章提取工具。Python-goose的目标是给定任意资讯文章或者任意文章类的网页，不仅提取出文章的主体，同时提取出所有元信息以及图片等信息，支持中文网页。

Python-goose可提取的信息包括：

文章主体内容

文章主要图片

文章中嵌入的任何Youtube/Vimeo视频

元描述

元标签

2.安装

virtualenv --no-site-packages goose

cd goose#windows下

Scripts\activate#linux下使用/bin/acitvate

git clone https://github.com/grangier/python-goose.git

cd python-goose

pip install-r requirements.txt

python setup.py install

3.使用

>>> from goose import Goose

>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'

>>> g = Goose()

>>> article = g.extract(url=url)

>>> article.title

u'Occupy London loses eviction fight'

>>> article.meta_description

"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."

>>> article.cleaned_text[:150]

(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi

>>> article.top_image.src

http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

对于中文文章，需要

g = Goose({'browser_user_agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.3

6','stopwords_class':StopWordsChinese})

参考:

python goose_提取数据之goose使用

相关阅读

相关文章

相关问答

相关文档