维基百科作为全球网络上最大且最受欢迎的参考工具书目前已被许多自然语言处理方面的研究人员所青睐,并将其视为优质的语言资料来源。大多数情况下,我们获取维基百科信息是通过其提供的数据库(http://dumps.wikimedia.org)来实现,但是其数据量巨大让我们难以转存至自己的本机数据库当中(英文的基本10G以上,电脑没有16G内存基本上搞不定),因此如何快速获取其维基数据一直是个难题。
通过在实验室研究基于维基百科的概念先决条件关系,本人开发出一套合理的程序,访问维基百科的API(https://www.mediawiki.org/wiki/API:Main_page)来获取维基信息,由于维基百科的基本信息量非常大,所以不可能面面俱到,但是本程序涉及到维基百科当中的大部分概念特征例如(引用关系、分类关系等),因此举一反三我们也可以获取到其余的所有信息。另外Python的第三方库也提供了可以访问维基百科的接口,但是实际测试发现其网络速度会受到时间限制,因此使用当中会感觉到很慢或者直接报错,如果有兴趣可以自行前去了解(wikipedia)。
希望本文对NLP感兴趣的研究者和维基百科爱好者有所帮助。
# -*- coding:utf-8 -*-
# Author:Zhou Yang
import requests
import json
import logging
import sys
import os.path
import re
agreement = 'https://'
language = 'en'
organization = '.wikipedia.org/w/api.php'
API_URL = agreement + language + organization
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
def pageid(title = None, np = 0):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'info',
'format': 'json',
'titles': title
}
if np != 0:
query_params['titles'] = 'Category:' + title
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
html = ""
if html == "":
return -1
else:
try:
text = json.loads(html, encoding='gb2312')
except json.JSONDecodeError:
return -1
try:
for i in text["query"]['pages']:
return int(i)
except:
return -1
def summary(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'extracts',
'explaintext': '',
'exintro': '',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error summary about ' + title)
return ""
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
try:
return text["query"]["pages"][id]["extract"]
except:
return ""
def body(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'extracts',
'exlimit' : 'max',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error body about ' + title)
return ""
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
try:
html_text = text["query"]["pages"][id]["extract"]
def stripTagSimple(htmlStr):
'''
最简单的过滤html <>标签的方法 注意必须是<任意字符> 而不能单纯是<>
:param htmlStr:
'''
# dr =re.compile(r'<[^>]+>',re.S)
dr = re.compile(r'</?\w+[^>]*>', re.S)
htmlStr = re.sub(dr, '', htmlStr)
return htmlStr
html_text = stripTagSimple(html_text)
html_text = str(html_text).replace("\n", "")
return html_text
except:
return ""
def links(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'links',
'pllimit': 'max',
'plnamespace': '0',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error links about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
link = list()
summ = summary(title)
try:
for obj in text["query"]['pages'][id]["links"]:
if obj['title'] in summ or obj['title'].lower() in summ:
link.append(obj['title'])
except:
return link
return link
def linkss(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'links',
'pllimit': 'max',
'plnamespace': '0',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error linkss about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
link = list()
try:
for obj in text["query"]['pages'][id]["links"]:
link.append(obj['title'])
except:
return link
return link
def backlinks(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'list': 'backlinks',
'bllimit': 'max',
'blnamespace': '0',
'format': 'json',
'bltitle': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error backlinks about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
link = list()
try:
link = [obj['title'] for obj in text["query"]["backlinks"]]
except:
return link
return link
def categories(title = None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'categories',
'cllimit': 'max',
'clshow': '!hidden',
'format': 'json',
'clcategories': '',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error categories about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
category = set()
if id != -1:
try:
category = [obj['title'][9:] for obj in text["query"]['pages'][id]["categories"]]
except:
return category
return category
def redirects(title=None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'redirects',
'rdlimit': 'max',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error redirects about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
redirect = list()
if id != -1:
try:
redirect = [obj['title'] for obj in text["query"]['pages'][id]["redirects"]]
except:
return redirect
return redirect
def subcats(title=None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'list': 'categorymembers',
'cmtype': 'subcat',
'cmlimit': 'max',
'format': 'json',
'cmtitle': 'Category:' + title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error subcats about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
subcat = list()
try:
subcat = [obj['title'][9:] for obj in text["query"]['categorymembers']]
except:
return subcat
return subcat
def supercats(title=None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'categories',
'cllimit': 'max',
'format': 'json',
'clshow': '!hidden',
'titles': 'Category:' + title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error supercats about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
supercat = list()
if id != -1:
try:
supercat = [obj['title'][9:] for obj in text["query"]['pages'][id]["categories"]]
except:
return supercat
return supercat
def contributors(title=None):
global API_URL
URL = API_URL
query_params = {
'action': 'query',
'prop': 'contributors',
'pclimit': 'max',
'format': 'json',
'titles': title
}
try:
r = requests.get(URL, params=query_params)
r.raise_for_status()
html, r.encoding = r.text, 'gb2312'
except:
logger.error('error linkss about ' + title)
return list()
text = json.loads(html, encoding='gb2312')
id = list(text["query"]["pages"].keys())[0]
contributors = list()
try:
for obj in text["query"]['pages'][id]["contributors"]:
contributors.append(obj['userid'])
except:
return contributors
return contributors
if __name__ == '__main__':
title = "Computer networks"
id = pageid(title, np = 4)
summ = summary(title)
Out = links(title)
print(id)
print(summ)
print(Out)
233488
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.
['Algorithm', 'Artificial intelligence', 'Computational statistics', 'Computer systems', 'Computer vision', 'Data mining', 'Email filtering', 'Exploratory data analysis', 'Inference', 'Mathematica', 'Mathematical optimization', 'Predictive analytics', 'STATISTICA', 'Statistical model', 'Statistics', 'Supervised learning', 'Training data']
更多功能可以自己尝试去调用,提醒一下的是访问不同语言的话只需要修改代码里面的language参数即可,本文使用的英文维基百科,访问中文只需改为“zh”,但是访问中文需要翻墙,其余的语言类似。
本人第一次写技术博客,有什么不足的地方欢迎各位指正!