问题：

使用Lucene TokenFilter将令牌分解为子令牌

蔺山

2023-03-14

我的程序需要索引与Lucene(4.10)非结构化文档，内容可以是任何。因此，我的自定义分析器使用ClassicTokenizer首先标记文档。

public class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;

public SymbolSplitterFilter(TokenStream in) {
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
        return false;
    }

    final String currentTerm = termAtt.toString();

    System.err.println("The original word was " + termAtt.toString());
    final int bufferLength = termAtt.length();

    if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
        // If this is the first pass we fill in the stack with the terms
        if (termStack.isEmpty()) {
            // We split the token abc@cd.com into abc and cd.com
            termStack.addAll(Arrays.asList(currentTerm.split("@")));
            // Now we have the constituting terms of the email in the stack
            System.err.println("The terms on the stacks are ");
            for (int i = 0; i < termStack.size(); i++) {
                System.err.println(termStack.get(i));
                /** The terms on the stacks are 
                * xyz
                * gmail.com
                */

            }

            // I am not sure it is the right place for this.
             current = captureState();

        } else {
            // This part seems to never be reached!
            // We add the constituents terms as tokens.
            String part = termStack.pop();
            System.err.println("Current part is " + part);
            restoreState(current);
            termAtt.setEmpty().append(part);                 
            posIncAtt.setPositionIncrement(0);
        }
    }

    System.err.println("In the end we have " + termAtt.toString());
    // In the end we have xyz@gmail.com
    return true;

}

但是，从来不处理堆栈。实际上，我不知道incrementToken方法是如何工作的，尽管我读了这个SO问题，也不知道它何时从tokenStream中处理给定的token。

最后，我要实现的目标是：对于xyz@gmail.com作为输入文本，我希望生成以下子标记:xyz@gmail.com xyz gmail.com

任何帮助都很感激，

彭坚壁

2023-03-14

您的问题是，当堆栈第一次填充时，输入标记流已经耗尽。因此input.incrementToken()返回false。在递增输入之前，应先检查堆栈是否已填充。像这样：

public final class SymbolSplitterFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posIncAtt;
private final Stack<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAtt;

public SymbolSplitterFilter(TokenStream in)
{
    super(in);
    termStack = new Stack<>();
    termAtt = addAttribute(CharTermAttribute.class);
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
    typeAtt = addAttribute(TypeAttribute.class);
}

@Override
public boolean incrementToken() throws IOException
{
    if (!this.termStack.isEmpty()) {
        String part = termStack.pop();
        restoreState(current);
        termAtt.setEmpty().append(part);
        posIncAtt.setPositionIncrement(0);
        return true;
    } else if (!input.incrementToken()) {
        return false;
    } else {
        final String currentTerm = termAtt.toString();
        final int bufferLength = termAtt.length();

        if (bufferLength > 1 && currentTerm.indexOf("@") > 0) { // There must be sth more than just @
            if (termStack.isEmpty()) {
                termStack.addAll(Arrays.asList(currentTerm.split("@")));
                current = captureState();
            }
        }
        return true;

    }

}
}

注意，当测试显示生成的令牌时，您可能还希望纠正您的偏移量并更改令牌的顺序：

 public class SymbolSplitterFilterTest extends BaseTokenStreamTestCase {


@Test
public void testSomeMethod() throws IOException
{
    Analyzer analyzer = this.getAnalyzer();
    assertAnalyzesTo(analyzer, "hey xyz@example.com",
        new String[]{"hey", "xyz@example.com", "example.com", "xyz"},
        new int[]{0, 4, 4, 4},
        new int[]{3, 19, 19, 19},
        new String[]{"word", "word", "word", "word"},
        new int[]{1, 1, 0, 0}
        );
}

 private Analyzer getAnalyzer()
{
    return new Analyzer()
    {
        @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName)
        {
            Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
            SymbolSplitterFilter testFilter = new SymbolSplitterFilter(tokenizer);
            return new Analyzer.TokenStreamComponents(tokenizer, testFilter);
        }
    };
}

}

使用Lucene TokenFilter将令牌分解为子令牌

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档