对Lucene PhraseQuery的slop的理解
所谓PhraseQuery,就是通过短语来检索,比如我想查“big car”这个短语,那么如果待匹配的document的指定项里包含了"big car"这个短语,这个document就算匹配成功。可如果待匹配的句子里包含的是“big black car”,那么就无法匹配成功了,如果也想让这个匹配,就需要设定slop,先给出slop的概念:slop是指两个项的位置之间允许的最大间隔距离,下面我举例来解释:
我的待匹配的句子是:the quick brown fox jumped over the lazy dog.
例1: 如果我想用“quick fox”来匹配出上面的句子,我发现原句里是quick [brown] fox,就是说和我的“quick fox”中间相差了一个单词的距离,所以,我这里把slop设为1,表示quick和fox这两项之间最大可以允许有一个单词的间隔,这样所有“quick [***] fox”就都可以被匹配出来了。
例2:如果我想用“fox quick”来匹配出上面的句子,这也是可以的,不过比例1要麻烦,我们需要看把“fox quick”怎么移动能形成“quick [***] fox”,如下表所示,把fox向右移动3次即可:
fox | quick | |||
1 | fox|quick | |||
2 | quick | fox | ||
3 | quick | fox |
例3:如果我想用“lazy jumped quick”该如何匹配上面的句子呢?这个比例2还要麻烦,我们要考虑3个单词,不管多少个单词,slop表示的是间隔的最大距离,详细起见,我们分别来看每种组合:(我的待匹配的句子是:the quick brown fox jumped over the lazy dog.)
lazy | jumped | ||||
1 | lazy|jumped | ||||
2 | jumped | lazy | |||
3 | jumped | lazy | |||
4 | jumped | lazy |
lazy | jumped | quick | |||||||
1 |
| lazy|jumped | quick |
|
|
|
|
|
|
2 |
| jumped | lazy|quick |
|
|
|
|
|
|
3 |
| jumped | quick | lazy |
|
|
|
|
|
4 |
| jumped | quick |
| lazy |
|
|
|
|
5 |
| jumped | quick |
|
| lazy |
|
|
|
6 |
| jumped | quick |
|
|
| lazy |
|
|
7 |
| jumped | quick |
|
|
|
| lazy |
|
8 | jumped | quick | lazy |
综合以上3种情况,所以我们需要把slop设为8才令“lazy jumped quick”可以匹配到原句。
OK,就到这里吧,希望对大家有帮助,如果我理解有误,也请指出,谢谢~
3.PhrasePrefixQuery 主要用来进行同义词查询的:
IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
Document doc1 = new Document();
doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
writer.addDocument(doc1);
Document doc2 = new Document();
doc2.add(Field.Text("field","the fast fox hopped over the hound"));
writer.addDocument(doc2);
PhrasePrefixQuery query = new PhrasePrefixQuery();
query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
query.add(new Term("field", "fox"));
Hits hits = searcher.search(query);
assertEquals("fast fox match", 1, hits.length());
query.setSlop(1);
hits = searcher.search(query);
assertEquals("both match", 2, hits.length());