当前位置: 首页 > 工具软件 > Python-goose > 使用案例 >

python 之 goose3 库

赵锐
2023-12-01

GOOSE3 最初是用Java编写的一篇文章提取器,最近将它(Auff2011)转换成Scala项目,这是python中的完全重写。该软件的目标是获取任何新闻文章或文章类型的网页,不仅提取文章的主体,而且还提取所有元数据和图片。

GOOSE3具体实现功能:

  • 文章的正文
  • 文章内图片
  • 文章中嵌入的任何视频
  • 文章描述
  • 标签元

安装:

  • pip install goose3
  • mkvirtualenv --no-site-packages goose3
    git clone https://github.com/goose3/goose3.git
    cd goose3
    pip install -r ./requirements/python
    python setup.py install

使用:

from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
article.title
u'Occupy London loses eviction fight'
article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

配置:

g = Goose({'browser_user_agent': 'Mozilla'})
g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})
g = Goose({'strict': False})
g = Goose({'enable_image_fetching': True})

GOOSE支持语言:

from goose3 import Goose

url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'

g = Goose()

article = g.extract(url=url)
article.title

article.cleaned_text[:150]

中文语言支持:

from goose3 import Goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)
print article.cleaned_text[:150]
## output>>>:
'''
香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。

梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。

一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有
'''

 

 类似资料: