GOOSE3 最初是用Java编写的一篇文章提取器,最近将它(Auff2011)转换成Scala项目,这是python中的完全重写。该软件的目标是获取任何新闻文章或文章类型的网页,不仅提取文章的主体,而且还提取所有元数据和图片。
GOOSE3具体实现功能:
安装:
mkvirtualenv --no-site-packages goose3 git clone https://github.com/goose3/goose3.git cd goose3 pip install -r ./requirements/python python setup.py install
使用:
from goose3 import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
article.title
u'Occupy London loses eviction fight'
article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
配置:
g = Goose({'browser_user_agent': 'Mozilla'})
g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})
g = Goose({'strict': False})
g = Goose({'enable_image_fetching': True})
GOOSE支持语言:
from goose3 import Goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose()
article = g.extract(url=url)
article.title
article.cleaned_text[:150]
中文语言支持:
from goose3 import Goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url=url)
print article.cleaned_text[:150]
## output>>>:
'''
香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。
梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。
一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有
'''