goose3主要用于新闻、文章的主要信息提取。
GOOSE将尝试提取以下信息:
文章主文
文章图片
文章中的YouTube / Vimeo视频
描述标记
标签
使用pip安装
pip install goose3
用法:
>>> from goose3 import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
当然,goose3也是支持中文的
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。
梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。
一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有
可以看到这里还是有一点点爬虫的味道,下面是安装依赖,可以看到很多熟悉的下载器requests、解析器lxml,图像处理Pillow,做NLP的jieba、nltk等
requests
Pillow
lxml
cssselect
jieba
beautifulsoup4
nltk
python-dateutil
那么实际使用效果如何呢。
我拿腾讯新闻试了下,感觉还是可以的。如下。
from goose3 import Goose
from goose3.text import StopWordsChinese
g = Goose({'stopwords_class': StopWordsChinese})
article = g.extract(url='https://new.qq.com/omn/20181129/20181129A1EPXY.html')
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/x5/kc5k_f1n6sn7wv8jv2jkv6rh0000gn/T/jieba.cache
Loading model cost 1.480 seconds.
Prefix dict has been built succesfully.
article.cleaned_text[:150]
Out[10]: '蒋劲夫家暴日本女友事件在一周内持续发酵,11月28日,蒋劲夫在日本自首,使此案有了新的进展。28日晚间,蒋劲夫律师团接受访问,表示能否和解取决于赔偿金能否使受害者一方感到满意。\n\n且在昨日的新闻中,有细心网友发现,为蒋劲夫打官司的是一个律师团队,而非一位律师。\n\n要知道日本的律师是十分高昂的,单单是'
article.title
Out[11]: '疑似蒋劲夫家境曝光:老爸名下四家公司,聘律师团打官司'
article.meta_keywords
Out[12]: '蒋劲夫,蒋春来,腾讯网,腾讯新闻'
找了半天,看到标题提取的方法
title_element = self.parser.getElementsByTag(self.article.doc, tag='title')
if title_element is not None and len(title_element) > 0:
title = self.parser.getText(title_element[0])
return self.clean_title(title)
###
@classmethod
def getText(cls, node):
txts = [i for i in node.itertext()]
return innerTrim(' '.join(txts).strip())
感觉就是做了很多判断吧,一层层去找。
其实还有一个第三方库叫做:python-goose,用法非常相似,这是这个是Python2的,所以就比较蛋疼。这是python-goose的版本要求
'Programming Language :: Python',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
毕竟现在用Python2的人应该很少了吧。