mg4j是个类似于lucene的全文检索包,发现网上的资料很少,稍微总结下心得。
1、DocumentSequence:A sequence of documents
主要包含方法
DocumentFactory factory()
DocumentIterator iterator()
2、it.unimi.di.big.mg4j.document. DocumentIterator[z1]: An iterator over documents
主要方法
Document nextDocument()
3、AbstractDocumentSequence:An abstract, safely closeable implementation of a document sequence
继承自DocumentSequence。主要子类有CompositeDocumentSequence,该子类的DocumentFactory为CompositeDocumentFactory。
4、lDocumentCollection:A collection of documents
继承自DocumentSequence和AbstractDocumentSequence。主要子类有StringArrayDocumentCollection、JdbcDocumentCollection、ConcatenatedDocumentCollection等。
StringArrayDocumentCollection的DocumentFactory为IdentityDocumentFactory。
IntArrayDocumentCollection的DocumentFactory为IntArrayDocumentFactory
主要包含方法
Document document(long index)
lDocument: An indexable document
存储field。主要子类为IdentityDocumentFactory. Document,CompositeDocument。
5、CompositeDocument是从索引库中取得document格式。
主要方法有
Object content(int field),注意不同的document返回的Object类型不同。
WordReader wordReader(int field)
6、lDocumentCollectionBuilder:An interface for classes that can build collections during the indexing process
将DocumentCollection生成以.collection结尾的文件。在索引过程中生成DocumentCollection,主要是ConcatenatedDocumentCollection。以便从索引库中取得原始的document,即CompositeDocument。
7、lDocumentFactory: A factory parsing and building documents of the same type
根据不同的输入源生成document。主要子类有CompositeDocumentFactory、IdentityDocumentFactory、IntArrayDocumentFactory。
CompositeDocumentFactory用于组合不同的DocumentFactory,生成的是CompositeDocument。
主要方法有
// Returns the document obtained by parsing the given byte stream.
//根据传入的rawContent和metadata生成一个document
Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
8、QueryBuilderVisitor:A visitor for a composite query
访问者模式中的Visitor,主要子类为DocumentIteratorBuilderVisitor
主要有方法
T visit( Term node ) throws QueryBuilderVisitorException;
T visit( Prefix node ) throws QueryBuilderVisitorException;
boolean visitPre( And node ) throws QueryBuilderVisitorException;
boolean visitPre( Or node ) throws QueryBuilderVisitorException;
// visitPost中会调用AndDocumentIterator.getInstance
boolean visitPost( And node ) throws QueryBuilderVisitorException;
9、it.unimi.di.big.mg4j.query.nodes.Query: A node of a composite representing a query
主要子类有Align、AND、OR、MultiTerm、Term、Weight等。
方法有
public <T> T accept( QueryBuilderVisitor<T> visitor ) throws QueryBuilderVisitorException;
10、it.unimi.di.big.mg4j.search.DocumentIterator: An iterator over documents (pointers) and their intervals
子类主要有AlignDocumentIterator 、AndDocumentIterator、OrDocumentIterator等。
主要包含方法
long nextDocument()
boolean mayHaveNext()
11、it.unimi.di.big.mg4j.index. IndexIterator: An iterator over an inverted list.
继承自DocumentIterator。主要方法有
// Returns the count, that is, the number of occurrences of the term in the current document.
int count()
// Returns the frequency, that is, the number of documents that will be returned by this iterator
long frequency()
// Returns the index over which this iterator is built
Index index()
//Returns the next position at which the term appears in the current document
nextPosition()
12、lIndexBuilder
run-> Scan.run-> Paste.run、Concatenate.run[z2]à DiskBasedIndex.getInstance->生成QuasiSuccinctIndex
13、Scan
run-> new Scan[ numberOfIndexedFields ]-> documentSequence .iterator()->
document.content、document.wordReader->scan[ i ].processDocument->sizes. writeGamma并生成termMap-> scan[ i ].dumpBatch()、scan[ i ].openSizeBitStream();、accumulator[ i ].writeData();
14、JdbcDocumentCollection[z3]
iterator()返回JdbcDocumentIterator-> nextDocument()
->CompositeDocumentFactory.getDocument( getStreamFromResultSet( rs, title ), metadata( index++, title ) )
-> new CompositeDocument( metadata, rawContent )
其中currDocument = documentFactory[ 0 ].getDocument( rawContent, metadata );
15、CompositeDocumentFactory
该Factory的numberOfFields为遍历DocumentFactory[]所有的numberOfFields之和。factoryIndex[ ]的长度与numberOfFields相同。初始化代码为
for( int i = 0; i < this.documentFactory.length; i++ ) {
for( int j = 0; j < documentFactory[ i ].numberOfFields(); j++ ) {
fieldType[ n ] = documentFactory[ i ].fieldType( j );
//例如numberOfFields都为10,则存储的值为
0=0;1=0; ……;9=0
10=1;11=1; ……;19=1
factoryIndex[ n ] = i;
//存储的值为
0=0;1=1; ……;9=9
10=0;11=1; ……;19=9
originalFieldIndex[ n ] = j;
n++;
}
}
document.content( indexedField[ i ] )即CompositeDocument. content( final int field )
//从factoryIndex中找到与field相对应的factory,并调用factory.getDocument方法得到currDocument
if ( currFactory < factoryIndex[ field ] ) {
while( currFactory < factoryIndex[ field ] ) {
rawContent.reset();
currFactory++;
}
if ( currDocument != null ) currDocument.close();
currDocument = documentFactory[ currFactory ].getDocument( rawContent, metadata );
}
//从originalFieldIndex中得到currDocument对应的field
currField = field;
return currDocument.content( originalFieldIndex[ field ] );
16、Index:An abstract representation of an index
对应一个字段。
lIndexIterator: An iterator over an inverted list
是it.unimi.di.big.mg4j.search.DocumentIterator的子类。检索返回值使用。
17、简单term检索流程
query.accept(QueryBuilderVisitor)->select. accept(QueryBuilderVisitor)->Term. accept(QueryBuilderVisitor)-> visitor.visit( this )-> DocumentIteratorBuilderVisitor.vistor(Term)
if ( node.termNumber != -1 ) {
return curr.top().documents( node.termNumber ).weight( weight() );
}
//调用当前Index.documents方法取得IndexIterator
return curr.top().documents( node.term ).weight( weight() );
18、DocumentIteratorBuilderVisitor
DocumentIteratorBuilderVisitor(
//index的map。key为自定义的index名字,value为Index。
final Object2ReferenceMap<String,Index> indexMap,
//没有指定查询的Index时的默认Index
final Reference2ReferenceMap<Index,Object> index2Parser,
final Index defaultIndex,
final int limit ) {
curr = new ObjectArrayList<Index>();
curr.push( defaultIndex );
}
// visitPre方法
public boolean visitPre( final Select node ) throws QueryBuilderVisitorException {
curr.push( indexMap.get( node.index.toString() ) );
return true;
}
// visitPost方法
public DocumentIterator visitPost( final Select node, final DocumentIterator subNode ) {
curr.pop();
return subNode;
}