TF-IDF(Term Frequency-Inverse Document Frequency)算法是常用的一种文本关键词或者文本特征的提取方法。相比于单单考虑单词的出现频率(TF),TF-IDF引入了逆文档频率(IDF),使得我们提取的关键词更加有代表性,而代表性也是TF-IDF方法关注的焦点。
其主要思想是:如果在一篇文章中一个词的出现频率高,并且语料库中其他文章包含这个词的概率小,那么这个词可以被选作关键词使用。
接下来,我们详细讲述其原理:
在一篇文章中出现的频率(Term Frequency)高的词应该比出现频率低的词更有代表性。
T
F
=
n
i
∑
n
i
TF = \dfrac{n_i}{\sum n_i}
TF=∑nini
n
i
n_i
ni:一篇文章中一个词出现的次数
∑
n
i
\sum n_i
∑ni:这篇文章总词数
可以看出一个词的TF值随着它在这篇文章中出现频率的增加而增加
在其他文章出现的频率(Inverse Document Frequency)很少的词应该比出现频率高的词更有代表性。
I
D
F
=
l
g
∣
D
∣
∣
j
:
t
i
∈
d
j
∣
+
1
IDF = lg \dfrac{|D|}{|j:t_i\in d_j|+1}
IDF=lg∣j:ti∈dj∣+1∣D∣
∣
D
∣
|D|
∣D∣:语料库中的文件总数
∣
j
:
t
i
∈
d
j
∣
|j:t_i\in d_j|
∣j:ti∈dj∣:包含这个词语的文章总数
注:+1的目的是防止这个词语在语料中没有出现导致分母为0的问题
可以看出一个词的IDF值随着语料库中包含这个词的文章数目的减小而增大
TF-IDF = TF * IDF。
本文使用的包为nltk包,使用pip语句就可以安装,安装后使用 nltk.download()下载扩展部分
首先,我们配置需要使用的包和文本材料
import nltk
import math
import string
import nltk.stem
from nltk.corpus import stopwords
from collections import Counter
#设置三段文本
text_1 = "In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf."
text_2 = "Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification."
text_3 = "One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model."
punctuation_map = dict((ord(char), None) for char in string.punctuation) #引入标点符号,为下步去除标点做准备
s = nltk.stem.SnowballStemmer('english') #在提取词干时,语言使用英语,使用的语言是英语
在对文本进行提取关键词之前,我们需要对文本进行处理,去除文本的标点符号、去除文本的一些停用词(比如介词in,连词and等)。因为在文章中,in和and的出现频率可能会很高,但是它们没有实际意义,所以我们需要将其去除掉。
def stem_count(text):
l_text = text.lower() #全部转化为小写以方便处理
without_punctuation = l_text.translate(punctuation_map) #去除文章标点符号
tokens = nltk.word_tokenize(without_punctuation) #将文章进行分词处理,将一段话转变成一个list
without_stopwords = [w for w in tokens if not w in stopwords.words('english')] #去除文章的停用词
cleaned_text = []
for i in range(len(without_stopwords)):
cleaned_text.append(s.stem(without_stopwords[i])) #提取词干
count = Counter(cleaned_text) #实现计数功能
return count
接下来,我们定义TF-IDF的计算过程。
#定义TF-IDF的计算过程
def D_con(word, count_list):
D_con = 0
for count in count_list:
if word in count:
D_con += 1
return D_con
def tf(word, count):
return count[word] / sum(count.values())
def idf(word, count_list):
return math.log(len(count_list)) / (1 + D_con(word, count_list))
def tfidf(word, count, count_list):
return tf(word, count) * idf(word, count_list)
最后,我们分析文本
texts = [text_1, text_2, text_3]
count_list = []
for text in texts:
count_list.append(stem_count(text)) #填入清洗好后的文本
for i in range(len(count_list)):
print('For document {}'.format(i+1))
tf_idf = {}
for word in count_list[i]:
tf_idf[word] = tfidf(word, count_list[i], count_list)
sort = sorted(tf_idf.items(), key = lambda x: x[1], reverse=True) #将集合按照TF-IDF值从大到小排列
for word, tf_idf in sort[:7]:
print("\tWord: {} : {}".format(word, round(tf_idf, 6)))
代码结果:
For document 1
Word: word : 0.033803
Word: document : 0.022536
Word: inform : 0.016902
Word: retriev : 0.016902
Word: tf–idf : 0.016902
Word: corpus : 0.016902
Word: number : 0.016902
For document 2
Word: use : 0.025255
Word: variat : 0.018942
Word: tf–idf : 0.018942
Word: engin : 0.018942
Word: central : 0.018942
Word: tool : 0.018942
Word: score : 0.018942
For document 3
Word: function : 0.068663
Word: rank : 0.045776
Word: simplest : 0.034332
Word: comput : 0.034332
Word: sum : 0.034332
Word: mani : 0.034332
Word: sophist : 0.034332