如何使用TF-IDF和Python的Scikit-Learn从文本中提取关键字

莘昊
2023-12-01

by Kavita Ganesan

通过Kavita Ganesan

如何使用TF-IDF和Python的Scikit-Learn从文本中提取关键字 (How to extract keywords from text with TF-IDF and Python’s Scikit-Learn)

Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch. Neither Data Science nor GitHub were a thing back then and libraries were just limited.

早在2006年,当我不得不在Java中使用TF-IDF进行关键字提取时,我最终从头开始编写了所有代码。 那时,无论是Data Science还是GitHub都没有,而且图书馆也很有限。

The world is much different today. You have several libraries and open-source code repositories on Github that provide a decent implementation of TF-IDF. If you don’t need a lot of control over how the TF-IDF math is computed, I highly recommend re-using libraries from known packages such as Spark’s MLLib or Python’s scikit-learn.

今天的世界大不相同。 您在Github上有几个开放源代码存储库,它们提供了TF-IDF的良好实现。 如果您不需要太多控制TF-IDF数学的计算方式,我强烈建议您重用Spark的MLLibPython的scikit-learn等已知软件包中的库。

The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling, and text classification. TF-IDF can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with Wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of a document, and for text-analytics when we look at these keywords in aggregate.

我注意到这些库的一个问题是,它们是作为诸如群集,主题建模和文本分类之类的其他任务的第一步。 TF-IDF实际上可以用于从文档中提取重要的关键字,以了解文档的特征。 例如,如果要处理Wikipedia文章,则可以使用tf-idf提取给定文章所独有的单词。 这些关键字可以用作文档的非常简单的摘要,并且当我们汇总查看这些关键字时,可以用作文本分析。

In this article, I will show you how you can use scikit-learn to extract keywords from documents using TF-IDF. We will specifically do this on a stack overflow dataset. If you want access to the full Jupyter Notebook, please head over to my repo.

在本文中 ,我将向您展示如何使用scikit-learn使用TF-IDF从文档中提取关键字。 我们将特别在堆栈溢出数据集上执行此操作。 如果要访问完整的Jupyter Notebook ,请转到我的存储库

Important note: I’m assuming that folks following this tutorial are already familiar with the concept of TF-IDF. If you are not, please familiarize yourself with the concept before reading on. There are a couple of videos online that give an intuitive explanation of what it is. For a more academic explanation I would recommend my Ph.D advisor’s explanation.

重要说明:我假设遵循本教程的人们已经熟悉TF-IDF的概念。 如果不是,请在继续阅读之前熟悉一下该概念。 在线上有一些视频 ,直观地解释了它的含义。 有关学术上的更多解释,我将推荐我的博士导师的解释

数据集 (Dataset)

In this example, we will be using a Stack Overflow dataset which is a bit noisy and simulates what you could be dealing with in real life. You can find this dataset in my tutorial repo.

在此示例中,我们将使用一个有点嘈杂的Stack Overflow数据集,并模拟您在现实生活中可能要处理的内容。 您可以在我的教程库中找到此数据集。

Notice that there are two files. The larger file, stackoverflow-data-idf.json with 20,000 posts, is used to compute the Inverse Document Frequency (IDF). The smaller file, stackoverflow-test.json with 500 posts, would be used as a test set for us to extract keywords from. This dataset is based on the publicly available Stack Overflow dump from Google’s Big Query.

注意,有两个文件 。 较大的文件stackoverflow-data-idf.json具有20,000个帖子,用于计算反文档频率(IDF)。 较小的文件stackoverflow-test.json具有500个帖子,将用作我们从中提取关键字的测试集。 该数据集基于Google的Big Query公开可用的Stack Overflow转储

Let’s take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts.

让我们来看看我们的数据集。 下面的代码从data/stackoverflow-data-idf.json每行一个json字符串读取到pandas数据框中,并打印出其架构和帖子总数。

Here, lines=True simply means we are treating each line in the text file as a separate json string.

在这里, lines=True仅仅意味着我们将文本文件中的每一行视为一个单独的json字符串。

Notice that this Stack Overflow dataset contains 19 fields including post title, body, tags, dates, and other metadata which we don’t need for this tutorial. For this tutorial, we are mostly interested in the body and title. These will become our source of text for keyword extraction.

请注意,此Stack Overflow数据集包含19个字段,包括帖子标题,正文,标签,日期和我们在本教程中不需要的其他元数据。 对于本教程,我们主要对正文和标题感兴趣。 这些将成为我们提取关键字的文字来源。

We will now create a field that combines both body and title so we have the two in one field. We will also print the second text entry in our new field just to see what the text looks like.

现在,我们将创建一个结合了bodytitle的字段,因此我们将两者body在一个字段中。 我们还将在新字段中打印第二个文本条目,以查看文本的外观。

Uh oh, this doesn’t look very readable! Well, that’s because of all the cleaning that went on in pre_process(..). You can do a lot more stuff in pre_process(..), such as eliminate all code sections, and normalize the words to its root. For simplicity, we will perform only some mild pre-processing.

嗯,这看起来不太可读! 好吧,这是因为在pre_process(..)进行了所有清理工作。 您可以在pre_process(..)做更多的事情,例如消除所有代码节,并将单词归一化。 为简单起见,我们将仅执行一些温和的预处理。

为IDF创建词汇和字数统计 (Creating vocabulary and word counts for the IDF)

We now need to create the vocabulary and start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] , followed by the counts of words in the vocabulary:

现在,我们需要创建词汇表并开始计数过程。 我们可以使用CountVectorizerdf_idf['text']所有文本创建词汇表,然后是词汇表中的单词计数:

The result of the last two lines from the code above is a sparse matrix representation of the counts. Each column represents a word in the vocabulary. Each row represents the document in our dataset, where the values are the word counts.

上面代码的最后两行的结果是计数的稀疏矩阵表示。 每列代表词汇表中的一个单词。 每行代表我们数据集中的文档,其中值是字数。

Note that, with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

请注意 ,使用此表示形式,如果某个单词未出现在相应的文档中,则某些单词的计数可能为0。

Here we are passing two parameters to CountVectorizer, max_df and stop_words. The first is just to ignore all words that have appeared in 85% of the documents, since those may be unimportant. The later is a custom stop words list. You can also use stop words that are native to sklearn by setting stop_words='english'. The stop word list used for this tutorial can be found here.

在这里,我们将两个参数传递给CountVectorizer: max_dfstop_words 。 第一个只是忽略文档中85%出现的所有单词,因为这些单词可能不重要。 后面是自定义停用词列表。 您还可以通过设置stop_words='english'来使用sklearn固有的停用词。 可在此处找到本教程使用的停用词列表。

The resulting shape of word_count_vector is (20000,124901) since we have 20,000 documents in our dataset (the rows) and the vocabulary size is 124,901.

由于我们的数据集中有20,000个文档(行),词汇量为124,901,因此word_count_vector的最终形状为(20000,124901)。

In some text mining applications, such as clustering and text classification, we typically limit the size of the vocabulary. It’s really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer. For this tutorial let’s limit our vocabulary size to 10,000:

在某些文本挖掘应用程序中,例如聚类和文本分类,我们通常会限制词汇表的大小。 在实例化CountVectorizer时通过设置max_features=vocab_size确实很容易做到这一点。 对于本教程,我们将词汇量限制为10,000:

Now, let’s look at 10 words from our vocabulary:

现在,让我们来看一下词汇表中的10个单词:

['serializing', 'private', 'struct', 'public', 'class', 'contains', 'properties', 'string', 'serialize', 'attempt']

Sweet, these are mostly programming-related.

不错,这些大多与编程有关。

TfidfTransformer计算IDF (TfidfTransformer to compute the IDF)

It’s now time to compute the IDF values.

现在是时候计算IDF值了。

In the code below, we are essentially taking the sparse matrix from CountVectorizer (word_count_vector) to generate the IDF when you invoke fit(...) :

在下面的代码中,当调用fit(...)时,我们实际上是从CountVectorizer( word_count_vector )中获取稀疏矩阵来生成IDF:

Extremely important point: the IDF should always be based on a large corpora, and should be representative of texts you would be using to extract keywords. I’ve seen several articles on the Web that compute the IDF using a handful of documents. You will defeat the whole purpose of IDF weighting if its not based on a large corpora as:

极其重要的一点 :IDF应该始终基于大型语料库,并且应该代表将用于提取关键字的文本。 我在网络上看到过几篇文章,这些文章使用少量文档来计算IDF。 如果IDF加权不是基于大型语料库的,那么您将无法实现IDF加权的全部目的 ,例如:

  1. your vocabulary becomes too small, and

    您的词汇量太少,并且
  2. you have limited ability to observe the behavior of words that you do know about.

    您观察自己所知道的单词的行为的能力有限。

计算TF-IDF并提取关键字 (Computing TF-IDF and extracting keywords)

Once we have our IDF computed, we are ready to compute TF-IDF and then extract top keywords from the TF-IDF vectors.

一旦计算出IDF,就可以计算TF-IDF,然后从TF-IDF向量中提取顶级关键字。

In this example, we will extract the top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields — title and body — and getting the texts into a list.

在此示例中,我们将在data/stackoverflow-test.json提取问题的顶部关键字。 该数据文件包含500个问题,其字段与我们在上面看到的data/stackoverflow-data-idf.json字段相同。 我们将从读取测试文件开始,提取必要的字段(标题和正文)并将文本放入列表中。

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores.

下一步是通过调用tfidf_transformer.transform(...)计算测试集中给定文档的tf-idf值。 这将生成tf-idf分数的向量。

Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n keywords. In the example below, we are extracting keywords for the first document in our test set.

接下来,我们以tf-idf值的降序对向量中的单词进行排序,然后进行迭代以提取top-n关键字。 在下面的示例中,我们为测试集中的第一个文档提取关键字。

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then it’s really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

sort_coo(...)方法实质上是在保留列索引的同时对向量中的值进行排序。 一旦有了列索引,就可以很容易地找到相应的单词值,就像在extract_topn_from_vector(...)中看到的那样,我们在其中执行feature_vals.append(feature_names[idx])

一些结果! (Some results!)

In this section, you will see the stack overflow question followed by the corresponding extracted keywords.

在本节中,您将看到堆栈溢出问题以及相应的提取关键字。

关于Eclipse插件集成的问题 (Question about Eclipse Plugin integration)

From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war, and tomcat, which are all unique to this specific question.

从上面看,上面居然关键词有意义的关键字,它谈论eclipsemavenintegratewartomcat ,这些都是独一无二的这一具体问题。

There are a couple of keywords that could have been eliminated such as possibility and perhaps even project. You can do this by adding more common words to your stop list. You can even create your own set of stop list, very specific to your domain.

有几个关键字可能已经被消除,例如possibility甚至project 。 您可以通过在停止列表中添加更多常用词来做到这一点。 您甚至可以创建自己的停止列表集, 非常适合您的域

Now let’s look at another example.

现在让我们来看另一个例子。

有关SQL导入的问题 (Question about SQL import)

Even with all the html tags, because of the pre-processing, we are able to extract some pretty nice keywords here. The last word appropriately would qualify as a stop word. You can keep running different examples to get ideas of how to fine-tune the results.

即使使用了所有html标签,由于进行了预处理,我们仍然可以在此处提取一些非常不错的关键字。 最后一个词appropriately将被视为停用词。 您可以继续运行不同的示例,以获取有关如何微调结果的想法。

Voilà! Now you can extract important keywords from any type of text!

瞧! 现在,您可以从任何类型的文本中提取重要的关键字!

翻译自: https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/

 类似资料: