Apache Lucene的一个简单示例

太叔天宇

2023-12-01

Lucene是apache软件基金会jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。Lucene是一套用于全文检索和搜寻的开源程式库，由Apache软件基金会支持和提供。Lucene提供了一个简单却强大的应用程式接口，能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库，虽然与搜索引擎有关，但不应该将信息检索程序库与搜索引擎相混淆。

官方网站：http://lucene.apache.org

一个简单的例子

1、引入Maven依赖
JDK版本：1.8.0_181
Lucene版本：4.0.0
POI版本：3.17，可处理2016之后的Word和Excel
最新版本可到此查询mvnrepository

    <properties>
        <lucene.version>4.0.0</lucene.version>
        <poi.version>3.17</poi.version>
    </properties>
    
    <dependencies>
        <!--Lucene 核心包 START -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--一般分词器，适用于英文分词-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--中文分词器-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--对分词索引查询解析-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--检索关键字高亮显示-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <!--Lucene 核心包 END -->

        <!-- Excel和Word文档处理依赖 START -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.8</version>
        </dependency>
        <!-- Excel和Word文档处理依赖 END -->
    </dependencies>

2、创建需要检索的文件
在文件夹D:\luceneData\下手动创建1.txt，2.docx，3.xlsx三个文件，里面含有“中国”两个汉字的文本内容。

3、创建文件目录索引

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.*;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.List;

public class CreateLuceneIndex {
    private final static Logger log = LoggerFactory.getLogger(CreateLuceneIndex.class);

    private static String content = "";// 文件里的内容
    private static String INDEX_DIR = "D:\\luceneIndex";// 存放索引的位置
    private static String DATA_DIR = "D:\\luceneData";// 存放文件的位置
    private static Analyzer analyzer = null;
    private static Directory directory = null;
    private static IndexWriter indexWriter = null;

    /**
     * 创建当前文件目录的索引
     *
     * @param path 当前文件目录
     * @return 是否成功
     */
    public static boolean createIndex(String path) {
        Date date1 = new Date();
        File indexFile = new File(INDEX_DIR);
        if (!indexFile.exists()) {
            indexFile.mkdirs();
        }
        List<File> fileList = getFileList(path);
        for (File file : fileList) {
            content = "";
            // 获取文件后缀
            String type = file.getName().substring(file.getName().lastIndexOf(".") + 1);
            if ("txt".equalsIgnoreCase(type)) {
                content += txt2String(file);
            } else if ("doc".equalsIgnoreCase(type) || "docx".equalsIgnoreCase(type)) {
                content += doc2String(file);
            } else if ("xls".equalsIgnoreCase(type) || "xlsx".equalsIgnoreCase(type)) {
                content += xls2String(file);
            }

            System.out.println("name :" + file.getName());
            System.out.println("path :" + file.getPath());
            System.out.println("content :" + content);
            System.out.println("=======================");

            try {
                analyzer = new StandardAnalyzer(Version.LUCENE_40);// 使用中文分词器
                directory = FSDirectory.open(new File(INDEX_DIR));

                IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
                indexWriter = new IndexWriter(directory, config);
                Document document = new Document();
                document.add(new TextField("filename", file.getName(), Field.Store.YES));
                document.add(new TextField("content", content, Field.Store.YES));
                document.add(new TextField("path", file.getPath(), Field.Store.YES));
                indexWriter.addDocument(document);// 添加文档
                indexWriter.commit();
                closeWriter();// close了才真正写到文档中
            } catch (Exception e) {
                log.error("创建文件目录索引异常:" + e.getMessage(), e);
                return false;
            }
        }
        Date date2 = new Date();
        System.out.println("创建索引-----耗时：" + (date2.getTime() - date1.getTime()) + "ms");
        return true;
    }

    /**
     * 过滤目录下的文件
     *
     * @param dirPath 想要获取文件的目录
     * @return 返回文件list
     */
    private static List<File> getFileList(String dirPath) {
        File[] files = new File(dirPath).listFiles();
        List<File> fileList = new ArrayList<File>();
        for (File file : files) {
            if (isTxtFile(file.getName())) {
                fileList.add(file);
            }
        }
        return fileList;
    }

    /**
     * 读取txt文件的内容
     *
     * @param file 想要读取的文件对象
     * @return 返回文件内容
     */
    private static String txt2String(File file) {
        String result = "";
        try {
            // 构造一个BufferedReader类来读取文件(解决中文乱码问题)
            BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "GBK"));
            String s = null;
            // 使用readLine方法，一次读一行
            while ((s = br.readLine()) != null) {
                result += s;
            }
            br.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 读取doc文件内容
     *
     * @param file 想要读取的文件对象
     * @return 返回文件内容
     */
    public static String doc2String(File file) {
        String result = "";
        if (file.exists() && file.isFile()) {
            InputStream is = null;
            HWPFDocument doc = null;
            XWPFDocument docx = null;
            POIXMLTextExtractor extractor = null;
            try {
                FileInputStream fis = new FileInputStream(file);
                // 判断word的两种格式doc,docx
                if (file.getPath().toLowerCase().endsWith("doc")) {
                    doc = new HWPFDocument(fis);
                    // 文档文本内容
                    result = doc.getDocumentText();
                } else if (file.getPath().toLowerCase().endsWith("docx")) {
                    docx = new XWPFDocument(fis);
                    extractor = new XWPFWordExtractor(docx);
                    // 文档文本内容
                    result = extractor.getText();
                } else {
                    log.error("不是word文档:" + file.getPath());
                }
            } catch (Exception e) {
                log.error("word文件读取异常:" + e.getMessage(), e);
            } finally {
                try {
                    if (doc != null) {
                        doc.close();
                    }
                    if (extractor != null) {
                        extractor.close();
                    }
                    if (docx != null) {
                        docx.close();
                    }
                    if (is != null) {
                        is.close();
                    }
                } catch (Exception e) {
                    log.error("关闭IO异常:" + e.getMessage(), e);
                }
            }
        }
        return result;
    }

    /**
     * 读取xls文件内容
     *
     * @param file 想要读取的文件对象
     * @return 返回文件内容
     */
    public static String xls2String(File file) {
        String result = "";
        FileInputStream fis = null;
        try {
            fis = new FileInputStream(file);
            Workbook workbook = null;
            // 判断excel的两种格式xls,xlsx
            if (file.getPath().toLowerCase().endsWith("xlsx")) {
                workbook = new XSSFWorkbook(fis);
            } else if (file.getPath().toLowerCase().endsWith("xls")) {
                workbook = new HSSFWorkbook(fis);
            }
            // 得到sheet的总数  
            int numberOfSheets = workbook.getNumberOfSheets();
            //System.out.println("一共" + numberOfSheets + "个sheet");
            // 循环每一个sheet  
            for (int i = 0; i < numberOfSheets; i++) {
                //得到第i个sheet  
                Sheet sheet = workbook.getSheetAt(i);
                //System.out.println(sheet.getSheetName() + "  sheet");
                // 得到行的迭代器  
                Iterator<Row> rowIterator = sheet.iterator();
                int rowCount = 0;
                // 循环每一行
                while (rowIterator.hasNext()) {
                    //System.out.print("第" + (++rowCount) + "行  ");
                    // 得到一行对象  
                    Row row = rowIterator.next();
                    // 得到列对象 
                    Iterator<Cell> cellIterator = row.cellIterator();
                    int columnCount = 0;
                    // 循环每一列
                    while (cellIterator.hasNext()) {
                        //System.out.print("第" + (++columnCount) + "列:");
                        // 得到单元格对象
                        Cell cell = cellIterator.next();
                        // 检查数据类型 
                        switch (cell.getCellTypeEnum()) {
                            case _NONE:
                                break;
                            case STRING:
                                result += cell.getStringCellValue() + " ";
                                break;
                            case NUMERIC:
                                result += String.valueOf(cell.getNumericCellValue()) + " ";
                                break;
                            case BOOLEAN:
                                result += cell.getBooleanCellValue() + " ";
                                break;
                            case BLANK:
                                break;
                            default:
                                result += cell.toString() + " ";
                        }
                    } //end of cell iterator 
                    System.out.println();
                } //end of rows iterator  
            } //end of sheets for loop
        } catch (Exception e) {
            log.error("异常:" + e.getMessage(), e);
        } finally {
            //close file input stream 
            if (fis != null) {
                try {
                    fis.close();
                } catch (IOException e) {
                    log.error("FileInputStream关闭异常:" + e.getMessage(), e);
                }
            }
        }
        return result;
    }

    /**
     * 判断是否为目标文件，目前支持txt xls doc格式
     *
     * @param fileName 文件名称
     * @return 如果是文件类型满足过滤条件，返回true；否则返回false
     */
    public static boolean isTxtFile(String fileName) {
        if (fileName.lastIndexOf(".txt") > 0) {
            return true;
        } else if (fileName.lastIndexOf(".xls") > 0 || fileName.lastIndexOf(".xlsx") > 0) {
            return true;
        } else if (fileName.lastIndexOf(".doc") > 0 || fileName.lastIndexOf(".docx") > 0) {
            return true;
        }
        return false;
    }

    /**
     * 关闭索引写入
     *
     * @throws Exception
     */
    private static void closeWriter() throws Exception {
        if (indexWriter != null) {
            indexWriter.close();
        }
    }

    /**
     * 删除文件目录下的所有文件
     *
     * @param file 要删除的文件目录
     * @return 如果成功，返回true.
     */
    private static boolean deleteDir(File file) {
        if (file.isDirectory()) {
            File[] files = file.listFiles();
            for (int i = 0; i < files.length; i++) {
                deleteDir(files[i]);
            }
        }
        file.delete();
        return true;
    }

    /**
     * 测试
     *
     * @param args
     */
    public static void main(String[] args) {
        File fileIndex = new File(INDEX_DIR);
        if (deleteDir(fileIndex)) {
            fileIndex.mkdir();
        } else {
            fileIndex.mkdir();
        }
        //创建索引
        if (createIndex(DATA_DIR)) {
            log.info("创建索引成功");
        } else {
            log.error("创建索引失败");
        }
    }
}

执行结果如下：

name :1.txt
path :D:\luceneData\1.txt
content :中国人民解放军
=======================
name :2.docx
path :D:\luceneData\2.docx
content :中国国庆节

=======================

name :3.xlsx
path :D:\luceneData\3.xlsx
content :中国 国家 篮球队 
=======================
创建索引-----耗时：1337ms
16:28:13.031 [main] INFO com.example.testspringboot.apachelucene.CreateLuceneIndex - 创建索引成功

Process finished with exit code 0

4、关键字搜索

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.util.Date;

public class SearchLuceneIndex {
    private final static Logger log = LoggerFactory.getLogger(SearchLuceneIndex.class);

    private static String INDEX_DIR = "D:\\luceneIndex";// 存放索引的位置
    private static Analyzer analyzer = null;
    private static Directory directory = null;

    /**
     * 查找索引，返回符合条件的文件
     *
     * @param text 查找的字符串
     * @return 符合条件的文件List
     */
    public static void searchIndex(String text) {
        System.out.println("查找关键字“" + text + "”开始......");
        Date date1 = new Date();
        DirectoryReader ireader = null;
        try {
            directory = FSDirectory.open(new File(INDEX_DIR));
            analyzer = new StandardAnalyzer(Version.LUCENE_40);
            ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);

            QueryParser parser = new QueryParser(Version.LUCENE_40, "content", analyzer);
            Query query = parser.parse(text);
            ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;

            for (int i = 0; i < hits.length; i++) {
                Document hitDoc = isearcher.doc(hits[i].doc);
                System.out.println("找到包含“" + text + "”的文件信息如下:");
                System.out.println("文件名:" + hitDoc.get("filename"));
                System.out.println("内容:" + hitDoc.get("content"));
                System.out.println("文件路径:" + hitDoc.get("path"));
                System.out.println("____________________________");
            }
        } catch (Exception e) {
            log.error("查找索引异常:" + e.getMessage(), e);
        } finally {
            try {
                if (ireader != null) {
                    ireader.close();
                }
                if (directory != null) {
                    directory.close();
                }
            } catch (Exception e) {
                log.error("关闭查找索引IO异常:" + e.getMessage(), e);
            }
        }
        Date date2 = new Date();
        System.out.println("查看索引-----耗时：" + (date2.getTime() - date1.getTime()) + "ms");
    }

    /**
     * 测试
     *
     * @param args
     */
    public static void main(String[] args) {
        searchIndex("中国");
    }
}

执行结果如下：

查找关键字“中国”开始......
找到包含“中国”的文件信息如下:
文件名:2.docx
内容:中国国庆节

文件路径:D:\luceneData\2.docx
____________________________
找到包含“中国”的文件信息如下:
文件名:3.xlsx
内容:中国 国家 篮球队 
文件路径:D:\luceneData\3.xlsx
____________________________
找到包含“中国”的文件信息如下:
文件名:1.txt
内容:中国人民解放军
文件路径:D:\luceneData\1.txt
____________________________
查看索引-----耗时：412ms

Process finished with exit code 0

– END –

Apache Lucene的一个简单示例

一个简单的例子

相关阅读

相关文章

相关问答

相关文档