python 之 goose3 库

赵锐

2023-12-01

GOOSE3 最初是用Java编写的一篇文章提取器，最近将它（Auff2011）转换成Scala项目，这是python中的完全重写。该软件的目标是获取任何新闻文章或文章类型的网页，不仅提取文章的主体，而且还提取所有元数据和图片。

GOOSE3具体实现功能：

文章的正文
文章内图片
文章中嵌入的任何视频
文章描述
标签元

安装：

pip install goose3

mkvirtualenv --no-site-packages goose3
git clone https://github.com/goose3/goose3.git
cd goose3
pip install -r ./requirements/python
python setup.py install

使用：

from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
article.title
u'Occupy London loses eviction fight'
article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

配置：

g = Goose({'browser_user_agent': 'Mozilla'})

g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

g = Goose({'strict': False})

g = Goose({'enable_image_fetching': True})

GOOSE支持语言：

from goose3 import Goose

url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'

g = Goose()

article = g.extract(url=url)
article.title

article.cleaned_text[:150]

中文语言支持：

from goose3 import Goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)
print article.cleaned_text[:150]
## output>>>:
'''
香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。

梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。

一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有
'''

python 之 goose3 库

相关阅读

相关文章

相关问答

相关文档