tf-idf, short for term frequency-inverse document frequency.
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the the fact that some words appear more frequently in general.
Term Frequency
the number of times a term occurs in a document is called its term frequency.
Inverse document frequency
However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequency, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.
Inverse document frequency
The inverse document frequency is a measure of how much information the word provides,that is,whether the term is common or rare across all documents。
词频(TF)=某个词在文章中的出现次数
考虑到文章有长短之分,为了便于不同文章的比较,进行“词频”标准化
词频(TF)=某个词在文章中的出现次数/文章词的总数
第二步:计算逆文档频率
逆文档频率(IDF)=log(语料库的文档总数/包含该词的文档数+1)
如果一个词越常见,那么分母就越大,逆文档频率就越小越接近于0.分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。
第三步:
TF-IDF=词频(TF)*逆文档频率(IDF)
可以看到,TF-IDF与一个词在文档中出现的次数成正比,与该词在整个语言中的出现次数成反比。
TF-IDF算法的优点是简单快速,结果比较符合实际情况。缺点是,单纯以“词频”衡量一个词的重要性,不够全面,有时重要的词出现的次数并不多。而且,这种算法无法体现词的位置信息,出现位置靠前的词与出现位置靠后的词,都被视为重要性相同,这是不正确的。