Java库，用于从输入文本中提取关键字

孔海超

2023-03-14

问题内容：

我正在寻找Java库以从文本块中提取关键字。

该过程应如下所示：

停止单词清洗->词干->根据英语语言统计信息搜索关键字-意味着单词在单词中出现的次数比在英语中出现的次数多于候选单词。

是否有执行此任务的库？

问题答案：

这是使用ApacheLucene的可能解决方案。我没有使用最新版本，但使用3.6.2版本，因为这是我所知道的最好的版本。除了之外/lucene- core-x.x.x.jar，别忘了将/contrib/analyzers/common/lucene- analyzers-x.x.x.jar下载的存档中的添加到您的项目中：它包含特定于语言的分析器（在您的情况下尤其是英语）。

注意，这将 _仅_基于输入文本词的词干找到它们的频率。然后将这些频率与英语统计数据进行比较。

一个词干一词。不同的词可能具有相同的词干，因此具有相同的词干terms。每次找到新术语时，关键字频率都会增加（即使已经找到它-
一个集合会自动删除重复项）。

public class Keyword implements Comparable<Keyword> {

  private final String stem;
  private final Set<String> terms = new HashSet<String>();
  private int frequency = 0;

  public Keyword(String stem) {
    this.stem = stem;
  }

  public void add(String term) {
    terms.add(term);
    frequency++;
  }

  @Override
  public int compareTo(Keyword o) {
    // descending order
    return Integer.valueOf(o.frequency).compareTo(frequency);
  }

  @Override
  public boolean equals(Object obj) {
    if (this == obj) {
      return true;
    } else if (!(obj instanceof Keyword)) {
      return false;
    } else {
      return stem.equals(((Keyword) obj).stem);
    }
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(new Object[] { stem });
  }

  public String getStem() {
    return stem;
  }

  public Set<String> getTerms() {
    return terms;
  }

  public int getFrequency() {
    return frequency;
  }

}

实用工具

词干：

public static String stem(String term) throws IOException {

  TokenStream tokenStream = null;
  try {

    // tokenize
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));
    // stem
    tokenStream = new PorterStemFilter(tokenStream);

    // add each token in a set, so that duplicates are removed
    Set<String> stems = new HashSet<String>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      stems.add(token.toString());
    }

    // if no stem or 2+ stems have been found, return null
    if (stems.size() != 1) {
      return null;
    }
    String stem = stems.iterator().next();
    // if the stem has non-alphanumerical chars, return null
    if (!stem.matches("[a-zA-Z0-9-]+")) {
      return null;
    }

    return stem;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

要搜索集合（将由潜在关键字列表使用）：

public static <T> T find(Collection<T> collection, T example) {
  for (T element : collection) {
    if (element.equals(example)) {
      return element;
    }
  }
  collection.add(example);
  return example;
}

核心

这是主要的输入法：

public static List<Keyword> guessFromString(String input) throws IOException {

  TokenStream tokenStream = null;
  try {

    // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
    input = input.replaceAll("-+", "-0");
    // replace any punctuation char but apostrophes and dashes by a space
    input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
    // replace most common english contractions
    input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

    // tokenize input
    tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));
    // to lowercase
    tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);
    // remove dots from acronyms (and "'s" but already done manually above)
    tokenStream = new ClassicFilter(tokenStream);
    // convert any char to ASCII
    tokenStream = new ASCIIFoldingFilter(tokenStream);
    // remove english stop words
    tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());

    List<Keyword> keywords = new LinkedList<Keyword>();
    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
      String term = token.toString();
      // stem each term
      String stem = stem(term);
      if (stem != null) {
        // create the keyword or get the existing one if any
        Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
        // add its corresponding initial token
        keyword.add(term.replaceAll("-0", "-"));
      }
    }

    // reverse sort by frequency
    Collections.sort(keywords);

    return keywords;

  } finally {
    if (tokenStream != null) {
      tokenStream.close();
    }
  }

}

例

使用guessFromString的方法的Java
Wikipedia文章引言部分
，这里是第10个最常见的关键字（即茎）中发现：

java         x12    [java]
compil       x5     [compiled, compiler, compilers]
sun          x5     [sun]
develop      x4     [developed, developers]
languag      x3     [languages, language]
implement    x3     [implementation, implementations]
applic       x3     [application, applications]
run          x3     [run]
origin       x3     [originally, original]
gnu          x3     [gnu]

遍历输出列表，通过获取集合（在上述示例中的方括号之间显示），了解每个词干的 原始找到的单词 。terms``[...]

下一步是什么

将 词干频率/频率总和 比率与英语统计的比率进行比较，如果可以的话，让我保持循环：我也可能很感兴趣:)

Java库，用于从输入文本中提取关键字

实用工具

核心

例

下一步是什么

相关阅读

相关文章

相关问答

相关工具

相关文档