动机
新闻网页,结构大多是类似的。
所以,能不能用一种通用的爬取方法来提取其中的数据?
简介
Goose
最初是一个Java项目,在2011年被转为了scala项目1.
Py-goose
2是使用python重写的版本。这个软件的主要目的不仅是提取一个 新闻/文章 页面的主要文本,而且会试着提取它所有的元数据以及图片数据。
值得一提的是,py-goose
相对于newspaper
3,多支持了很多不同语言的网页:
- spanish
- chinese
- arabic
- korean
用法
>>> from goose import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
个人评分
类型 | 评分 |
---|---|
实用性 | ⭐️⭐️ |
易用性 | ⭐️⭐️⭐️ |
有趣性 | ⭐️⭐️⭐️⭐️ |