news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-please combines the power of multiple state-of-the-art libraries and tools, such as scrapy, Newspaper, and readability. news-please also features a library mode, which allows Python developers to use the crawling and extraction functionality within their own program. Moreover, news-please allows to conveniently crawl and extract articles from the (very) large news archive at commoncrawl.org.
If you want to contribute to news-please, please have a look at our list of issues that need help or look here.
03/23/2021: If you're interested in sentiment classification in news articles, check out our large-scale dataset for target-dependent sentiment classification. We also publish an easy-to-use neural model that achieves state-of-the-art performance. Visit the project here.
06/01/2018: If you're interested in news analysis, you might also want to check out our new project, Giveme5W1H - a tool that extracts phrases answering the journalistic five W and one H questions to describe an article's main event, i.e., who did what, when, where, why, and how.
news-please extracts the following attributes from news articles. An examplary json file as extracted by news-please can be found here.
news-please supports three main use cases, which are explained in more detail in the following.
python3 -m newsplease.examples.commoncrawl
It's super easy, we promise!
news-please runs on Python 3.5+.
$ pip3 install news-please
You can access the core functionality of news-please, i.e. extraction of semi-structured information from one or more news articles, in your own code by using news-please in library mode. If you want to use news-please's full website extraction (given only the root URL) or continuous crawling mode (using RSS), you'll need to use the CLI mode.
from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
print(article.title)
A sample of an extracted article can be found here (as a JSON file).
If you want to crawl multiple articles at a time, optionally with a timeout in seconds
NewsPlease.from_urls([url1, url2, ...], timeout=6)
or if you have a file containing all URLs (each line containing a single URL)
NewsPlease.from_file(path)
or if you have raw HTML data (you can also provide the original URL to increase the accuracy of extracting the publishing date)
NewsPlease.from_html(html, url=None)
or if you have a WARC file (also check out our commoncrawl workflow, which provides convenient methods to filter commoncrawl's archive for specific news outlets and dates)
NewsPlease.from_warc(warc_record)
In library mode, news-please will attempt to download and extract information from each URL. The previously described functions are blocking, i.e., will return once news-please has attempted all URLs. The resulting list contains all successfully extracted articles.
$ news-please
news-please will then start crawling a few examples pages. To terminate the process press CTRL+C
. news-please will then shut down within 5-60 seconds. You can also press CTRL+C
twice, which will immediately kill the process (not recommended, though).
The results are stored by default in JSON files in the data
folder. In the default configuration, news-please also stores the original HTML files.
Most likely, you will not want to crawl from the websites provided in our example configuration. Simply head over to the sitelist.hjson
file and add the root URLs of the news outlets' web pages of your choice. news-please also can extract the most recent events from the GDELT project, see here.
news-please also supports export to ElasticSearch. Using Elasticsearch will also enable the versioning feature. First, enable it in the config.cfg
at the config directory, which is by default ~/news-please/config
but can also be changed with the -c
parameter to a custom location. In case the directory does not exist, a default directory will be created at the specified location.
[Scrapy]
ITEM_PIPELINES = {
'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.ElasticsearchStorage':350
}
That's it! Except, if your Elasticsearch database is not located at http://localhost:9200
, uses a different username/password or CA-certificate authentication. In these cases, you will also need to change the following.
[Elasticsearch]
host = localhost
port = 9200
...
# Credentials used for authentication (supports CA-certificates):
use_ca_certificates = False # True if authentification needs to be performed
ca_cert_path = '/path/to/cacert.pem'
client_cert_path = '/path/to/client_cert.pem'
client_key_path = '/path/to/client_key.pem'
username = 'root'
secret = 'password'
news-please allows for storing of articles to a PostgreSQL database, including the versioning feature. To export to PostgreSQL, open the corresponding config file (config_lib.cfg
for library mode and config.cfg
for CLI mode) and add the PostgresqlStorage module to the pipeline and adjust the database credentials:
[Scrapy]
ITEM_PIPELINES = {
'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.PostgresqlStorage':350
}
[Postgresql]
# Postgresql-Connection required for saving meta-informations
host = localhost
port = 5432
database = 'news-please'
user = 'user'
password = 'password'
If you plan to use news-please and its export to PostgreSQL in a production environment, we recommend to uninstall the psycopg2-binary
package and install psycopg2
. We use the former since it does not require a C compiler in order to be installed. See here, for more information on differences between psycopg2
and psycopg2-binary
and how to setup a production environment.
We have collected a bunch of useful information for both users and developers. As a user, you will most likely only deal with two files: sitelist.hjson
(to define sites to be crawled) and config.cfg
(probably only rarely, in case you want to tweak the configuration).
You can find more information on usage and development in our wiki! Before contacting us, please check out the wiki. If you still have questions on how to use news-please, please create a new issue on GitHub. Please understand that we are not able to provide individual support via email. We think that help is more valuable if it is shared publicly so that more people can benefit from it.
For bug reports, we ask you to use the Bug report template. Make sure you're using the latest version of news-please, since we cannot give support for older versions. Unfortunately, we cannot give support for issues or questions sent by email.
Your donations are greatly appreciated! They will free us up to work on this project more, to take on tasks such as adding new features, bug-fix support, and addressing further concerns with the library.
This project would not have been possible without the contributions of the following students (ordered alphabetically):
We also thank all other contributors, which you can find on the contributors page!
If you are using news-please, please cite our paper (ResearchGate, Mendeley):
@InProceedings{Hamborg2017,
author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
title = {news-please: A Generic News Crawler and Extractor},
year = {2017},
booktitle = {Proceedings of the 15th International Symposium of Information Science},
location = {Berlin},
doi = {10.5281/zenodo.4120316},
pages = {218--223},
month = {March}
}
You can find more information on this and other news projects on our website.
Do you want to contribute? Great, we are always happy for any support on this project! We are particularly looking for pull requests that fix bugs. We also welcome pull requests that contribute your own ideas.
By contributing to this project, you agree that your contributions will be licensed under the project's license.
We love contributions by our users! If you plan to submit a pull request, please open an issue first and desribe the issue you want to fix or what you want to improve and how! This way, we can discuss whether your idea could be added to news-please in the first place and, if so, how it could best be implemented in order to fit into architecture and coding style. In the issue, please state that you're planning to implement the described features.
Unfortunately, we do not have resources to implement features requested by users. Instead, we recommend that you implement features you need and if you'd like open a pull request here so that the community can benefit from your improvements, too.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use news-please except in compliance with the License. A copy of the License is included in the project, see the file LICENSE.txt.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The news-please logo is courtesy of Mario Hamborg.
Copyright 2016-2021 The news-please team
TOP STORIES AWARDS Vicki Hanson Elected to National Academy of Engineering ACM CEO and Executive Director Vicki L. Hanson has been elected as a member of theNational Academy of Engineering. Also joini
1.查看ftp端口21是否开放 firewall-cmd --permanent --query-port=21/tcp 2.开放21端口 [root@localhost ~]# firewall-cmd --permanent --add-port=21/tcp success [root@localhost ~]# firewall-cmd --reload 3.查看ftp状态 service
NEWS ==== This file gives a brief overview of the major changes between each OpenSSL release. For more details please read the CHANGES file. OpenSSL Releases ---------------- - [OpenSSL 3.0](#openssl-
问题描述 昨天晚上,打开Anaconda的时候,提示可以从1.9.6更新到1.9.7,于是选择了更新,结果出现Available Invalid Channel的错误提示,于是在terminal里输入conda update --prefix /Users/用户名/anaconda3 anaconda后,显示我的源不可用,这才发现原来国内的清华源和中科大源都已经因为授权问题停止服务了,所以我就删除
booney@TitanXSLI:~/Dev/Github/glog$ ./autogen.sh Makefile.am: error: required file './NEWS' not found autoreconf: automake failed with exit status: 1 booney@TitanXSLI:~/Dev/Github/glog$ ./autogen.sh b
News for Nextcloud �� Feed Reader and �� Podcast Player for Nextcloud ☁️ (unofficial) ✨ Main Features: Subscribe to your favorite RSS and Atom feeds Sync with your personal Nextcloud News server Smoot
Easy News是一个可以让您轻松显示网站新闻或图文内容展示的jQuery插件,可设定Fade In-Out ,Slide Up-Down, Left To Right 的效果,调整显示速度,可自行修改CSS改变您为喜欢的风格。 Easy News 1.0.6 有以下的特点 跨浏览器( 已于 IE7 及 Firefox 2.0 测试正常) 可 "往前" 或 "往后" 自动播放 鼠标移过的暂停功
A basic newsticker, which fades between items in a list, showing only one at a time.
�� FE News FE News는 네이버 FE 엔지니어들이 엄선한 양질의 FE 및 주요한 기술 소식들을 큐레이션해 공유하는 것을 목표로 합니다.이를통해 국내 개발자들에게 지식 공유에 대한 가치 인식과 성장에 도움을 주고자 합니다. �� 네이버 Front-end 조직이 어떤 일을 하고, 개발자들이 어떻게 성장하고 있는지 궁금하신가요? 네이버 Front-en
haxor-news Coworker who sees me looking at something in a browser: "Glad you're not busy; I need you to do this, this, this..." Coworker who sees me staring intently at a command prompt: Backs away, s
jQuery News Ticker 是一个非常不错的jQuery插件,它能够实现类似于BBC在网页上播报新闻的打字效果。 它可以从无序的列表,RSS,或一个HTML文件中取得要打字的内容。然后一个可以定制的界面中显示。 在显示界面中有向前/向后,播放/暂停控制新闻浏览。