今天简单来介绍和实现textteaser摘要算法:
统计指标:
1)句子长度,长度为某个长度的句子为最理想的长度,依照距离这个长度的远近来打分。
2)句子位置,根据句子在全文中的位置,给出分数。(比如每段的第一句是核心句的比例大概是70%)
3)句子关键词打分,文本进行预处理之后,按照词频统计出排名前10的关键词,通过比较句子中包含关键词的情况,以及关键词分布的情况来打分。
综合上述3步的打分做累加,然后倒排得到每个句子的重要性得分,此时要考虑到摘要的可读性,通俗的做法是按照句子在文章中出现的顺序来输出。
程序主代码如下:
def computeScore(self, sentences, topKeywords): keywordList = [keyword['word'] for keyword in topKeywords]#先把频率最高的几个单词取出来 summaries = []#设置一个摘要列表 self.ws = WordSegmentation(stop_words_file=None, allow_speech_tags=libiary2.allow_speech_tags) for i, sentence in enumerate(sentences):#枚举句子 sent = self.parser.removePunctations(sentence)#去除句子中的标点 words=self.ws.segment(text=sent, lower=True, use_stop_words=False, use_speech_tags_filter=False)#对句子进行分词 sbsFeature = self.sbs(words, topKeywords, keywordList)#传入参数,句子单词,文章最高频率单词以及这些单词,返回1/单词数目*句子分数 dbsFeature = self.dbs(words, topKeywords, keywordList) sentenceLength = self.parser.getSentenceLengthScore(words)#理想句子长度是20,求句子长度得分 sentencePosition = self.parser.getSentencePositionScore(i, len(sentences))#得到句子位置权重 keywordFrequency = (sbsFeature + dbsFeature) / 2.0 * 10.0 totalScore = (keywordFrequency * 2.0 + sentenceLength * 0.5 + sentencePosition * 1.0) / 4.0 summaries.append({ # 'titleFeature': titleFeature, # 'sentenceLength': sentenceLength, # 'sentencePosition': sentencePosition, # 'keywordFrequency': keywordFrequency, 'totalScore': totalScore, 'sentence': sentence, 'order': i }) return summaries