当前位置: 首页 > 知识库问答 >
问题:

Lucene中的项文档矩阵

谢叶五
2023-03-14
    null
IndexReader reader = DirectoryReader.open(index);

for (int i = 0;  i < reader.maxDoc(); i++) {
    Document doc = reader.document(i);
    Terms terms = reader.getTermVector(i, "country_text");

    if (terms != null && terms.size() > 0) {
        // access the terms for this field
        TermsEnum termsEnum = terms.iterator(); 
        BytesRef term = null;

        // explore the terms for this field
        while ((term = termsEnum.next()) != null) {
            // enumerate through documents, in this case only one
            DocsEnum docsEnum = termsEnum.docs(null, null); 
            int docIdEnum;
            while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                // get the term frequency in the document 
                System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
            }
        }
    }
}

完整代码:

import java.io.*;
import java.util.Iterator;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.JSONValue;
import org.json.simple.parser.JSONParser;

public class LuceneIndex {

    public static void main(String[] args) throws IOException, ParseException {

        String jsonFilePath = "wiki_data.json";
        JSONParser parser = new JSONParser();
        // Specify the analyzer for tokenizing text.
        StandardAnalyzer analyzer = new StandardAnalyzer();
        // create the index
        Directory index = new RAMDirectory();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter w = new IndexWriter(index, config);

        try {     
            JSONArray a = (JSONArray) parser.parse(new FileReader(jsonFilePath));

            for (Object o : a) {
                JSONObject country = (JSONObject) o;
                String countryName = (String) country.get("country_name");
                String cityName = (String) country.get("city_name");
                String countryText = (String) country.get("country_text");
                String cityText = (String) country.get("city_text");
                System.out.println(cityName);
                addDoc(w, countryName, cityName, countryText, cityText);
            }
            w.close();

            IndexReader reader = DirectoryReader.open(index);

            for (int i = 0;  i < reader.maxDoc(); i++) {
                Document doc = reader.document(i);
                Terms terms = reader.getTermVector(i, "country_text");

                if (terms != null && terms.size() > 0) {
                    // access the terms for this field
                    TermsEnum termsEnum = terms.iterator(); 
                    BytesRef term = null;

                    // explore the terms for this field
                    while ((term = termsEnum.next()) != null) {
                        // enumerate through documents, in this case only one
                        DocsEnum docsEnum = termsEnum.docs(null, null); 
                        int docIdEnum;
                        while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                            // get the term frequency in the document 
                            System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
                        }
                    }
                }
            }

            // reader can be closed when there
            // is no need to access the documents any more.
            reader.close();

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (org.json.simple.parser.ParseException e) {
            e.printStackTrace();
        }
    }

    private static void addDoc(IndexWriter w, String countryName, String cityName, 
            String countryText, String cityText) throws IOException {
        Document doc = new Document();
        doc.add(new StringField("country_name", countryName, Field.Store.YES));
        doc.add(new StringField("city_name", cityName, Field.Store.YES));
        doc.add(new TextField("country_text", countryText, Field.Store.YES));
        doc.add(new TextField("city_text", cityText, Field.Store.YES));

        w.addDocument(doc);
    }

}

共有1个答案

沙海
2023-03-14

首先,感谢您的代码,我有一个小bug,您的代码帮助我完成了它。

对我来说,它的工作原理如下:(Lucene 7.2.1)

for(int i = 0; i < reader.maxDoc(); i++){
    Document doc = reader.document(i);
    Terms terms = reader.getTermVector(i, "text");

    if (terms != null && terms.size() > 0) {
        // access the terms for this field
        TermsEnum termsEnum = terms.iterator();
        BytesRef term = null;

        // explore the terms for this field
        while ((term = termsEnum.next()) != null) {
            // enumerate through documents, in this case only one
            PostingsEnum docsEnum = termsEnum.postings(null); 
            int docIdEnum;
            while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                // get the term frequency in the document
                System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
            }
        }
    }
}

这里的变化是我使用了postingsenum。Lucene7.2.1中不再提供DocsEnum。

private void addDoc(IndexWriter w, String text, String name, String id) throws IOException {
    Document doc = new Document();
    // Create own FieldType to store Term Vectors
    FieldType ft = new FieldType();
    ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
    ft.setTokenized(true);
    ft.setStored(true);
    ft.setStoreTermVectors(true);  //Store Term Vectors
    ft.freeze();
    StoredField t = new StoredField("text",text,ft);
    doc.add(t);


    doc.add(new StringField("name", name, Field.Store.YES));
    doc.add(new StringField("id", id, Field.Store.YES));
    w.addDocument(doc);
}
 类似资料:
  • Lucene 是一个基于 Java 的开源搜索库。 它非常受欢迎,也是一个快速搜索库。它在基于 Java 的应用程序中用于以非常简单和有效的方式向任何类型的应用程序添加文档搜索功能。

  • 问题内容: 我需要遍历Lucene索引中的所有文档,并获取每个术语在每个文档中出现的位置。据我能从Lucene javadoc所了解的,做到这一点的方法是做这样的事情: 但是,即使(1)索引的确包含相关字段上的位置,并且(2)术语向量声称具有位置(即:tv.hasPositions()== true),我仍会为所有变量获取“ -1”职位。 首先,我做错什么了吗?是否有其他方法可以按文档迭代发布?第

  • 我有一组数据:ID、名称和与ID相关联的ArrayList。我希望将这些数据存储在Lucene文档中。搜索将基于Id和姓名。不应对列表进行索引。 我如何为“某些列表”做类似的事情?

  • 问题内容: Lucene是否提供增强新文档的方法? 例如,假设Lucene文档包含日期字段。是否有可能在用户不以任何方式更改其查询的情况下,以更高的分数展示最新的文档? 我不想诉诸粗略的“按日期排序”解决方案,因为它将完全取消评分算法。 问题答案: 将文档放入索引时,请使用Document.setBoost(float value)。 您可以不断地重新调整现有文档上的值,或者具有随日期增加的浮点值

  • 我想知道这怎么可能。假设我正在搜索,那么的得分应该比的得分要多。如何提升那些文档?。我已经试过了。 我正在尝试使用,如下所示。但不管用。我用的是Lucene4.0