全文检索工具——Lucene学习

写在前面

Lucene是Apache旗下的一个开源的项目，主要用于全文检索。所谓的全文检索，就是将非结构化数据中的一部分信息提取出来，重新组织，使其变得有一定结构，然后对此有一定结构的数据进行搜索，从而达到搜索相对较快的目的。这种先建立索引，再对索引进行搜索的过程就叫全文检索。全文检索的应用场景十分广泛，基本上数据量上了一个等级后的查询都可以用到全文检索。这里就先来学习一下最基本的全文检索API——Lucene

Lucene的原理

我们来看一张图：

获得原始文档

首先我们需要获得原始文档，一般都通过爬虫来进行实现。

创建文档对象

在获得了原始文档后，我们要创建文档对象，文档对象（Document），文档对象主要包含了一个一个的域（Field），域中存储着内容。我们可以这样理解：磁盘上的一个文件就可以当成一个document对象，document中包含着一些Field,如图：

需要注意的是，每个Document可以有多个Field,不同的Document有不同的Field，同一个Document可以有相同的Field.

分析文档

这个步骤主要是把文档提取单词、字母转换大小写、去除标点符号和停用词。如这一句话："Spring is a very cool thing"，经过处理后会获得:spring very cool thing ，其中is和a这种词没有意义，都会被自动分析掉。每个单词就叫做一个Term,不同的域拆出来的相同的单词是不同的Term,Term中包含两部分，一部分是文档的域名，一部分是单词的内容。

创建索引

这里使用了倒排索引结构，如下图：

倒排索引结构也叫反向索引结构，包括索引和文档两部分，索引即词汇表，它的规模较小，而文档集合较大。

查询索引

首先，由用户输入查询内容，即查询接口。然后创建一个查询对象，查询对象中可以指定好要查询的Field,查询的关键字等，查询对象可以生成具体的查询语法。之后就可以执行查询了，在查询返回结果后，我们对结果进行一些处理和美化就可以输出给用户了。

使用Lucene

首先要导入jar包：

这里还顺便导入了后面要说的中文分词器，不用理会。
我们直接来看测试代码:

 @Test
    public void createIndex() throws Exception{
        /**
         * 1.创建一个Directory对象，指定索引库保存的位置
         * 2. 基于Directory对象创建一个IndexWriter对象
         * 3. 读取磁盘上的文件，对应每个文件创建一个文档对象
         * 4. 向文档对象添加域
         * 5. 向文档对象写入索引库
         * 6. 关闭IndexWriter对象
         */
        // 把索引库保存在内存中
        // Directory directory = new RAMDirectory();
        // 把索引库保存在磁盘
        Directory directory = FSDirectory.open(new File("F:\\ideaWorkSpace\\luceneTest\\index").toPath());
        //
        // 使用自定义分析器
        Analyzer analyzer = new IKAnalyzer();
        IndexWriter indexWriter = new IndexWriter(directory,new IndexWriterConfig(analyzer));
        //
        File dir = new File("F:\\JAVA学习\\黑马57期 idea\\10 Lucene\\lucene\\02.参考资料\\searchsource");
        File[] files = dir.listFiles();
        for (File file : files) {
            // 取文件名
            String fileName = file.getName();
            // 文件的路径
            String filePath = file.getPath();
            // 文件的内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            // 文件的大小
            long fileSize = FileUtils.sizeOf(file);
            // 创建Field
            // 参数1:域的名称 2:域的内容 3.是否存储
            Field fieldName=  new TextField("name",fileName,Field.Store.YES);
//            Field fieldPath = new TextField("path",filePath, Field.Store.YES);
            Field fieldPath = new StoredField("path",filePath);
            Field fieldContent = new TextField("content",fileContent,Field.Store.YES);
//            Field fieldSize = new TextField("size",fileSize+"", Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size",fileSize);
            Field fieldSizeStore = new StoredField("size",fileSize);
            // 创建文档对象
            Document document = new Document();
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
//            document.add(fieldSize);
            document.add(fieldSizeValue);
            document.add(fieldSizeStore);
            // 文档对象写入索引库
            indexWriter.addDocument(document);
        }
        // 关闭indexWriter对象
        indexWriter.close();
    }

在创建好后，我们就可以进行查询了:

    @Test
    public void searchIndex() throws  Exception{
        // 1.创建一个Directory对象，指定索引库的位置
        Directory directory = FSDirectory.open(new File("F:\\ideaWorkSpace\\luceneTest\\index").toPath());
        // 2.创建一个IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        // 3.创建一个IndexSearcher对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        // 4.创建一个Query对象
        Query query = new TermQuery(new Term("name","spring"));
        // 5.执行查询 10为查询结果返回的最大条数
        TopDocs topDocs = indexSearcher.search(query, 10);
        // 6.取查询结果的总记录
        System.out.println("查询总记录数:"+topDocs.totalHits);
        // 7.取文档列表
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 取文档id
            int docId = scoreDoc.doc;
            // 根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            //System.out.println(document.get("content"));
            System.out.println("----------------分割线");
        }
        // 9. 关闭indexReader
        indexReader.close();
    }

关于查询，有很多方法：

   private IndexReader indexReader;
    private IndexSearcher indexSearcher;
    @Before
    public void init() throws Exception{
        indexReader = DirectoryReader.open(FSDirectory.open(new File("F:\\ideaWorkSpace\\luceneTest\\index").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }
    @Test
    public void testRangeQuery() throws Exception{
        // 创建一个Query对象
        Query query = LongPoint.newRangeQuery("size", 0L, 100L);
        printResult(query);
    }

    private void printResult(Query query) throws IOException {
        // 执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数为:"+topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 取文档id
            int docId = scoreDoc.doc;
            // 根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            //System.out.println(document.get("content"));
            System.out.println("----------------分割线");
        }
        indexReader.close();
    }

    @Test
    public void testQueryParser() throws Exception{
        // 创建一个QueryParser对象 两个参数
        QueryParser queryParser = new QueryParser("name",new IKAnalyzer());
        // 参数1，默认搜索域，参数2：分析器
        // 使用QueryParser对象创建一个Query对象
        Query query = queryParser.parse("lucene是一个java开发的全文检索工具包");
        // 执行查询
        printResult(query);
    }

索引库的维护

直接看代码:

private IndexWriter indexWriter;
    @Before
    public void init() throws Exception{
        // 创建一个IndexWriter对象 使用IKAnalyzer作为分析器
       indexWriter = new IndexWriter(FSDirectory.open(new File("F:\\ideaWorkSpace\\luceneTest\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
    }
    @Test
    public void addDocument() throws Exception{

        // 创建一个Document对象
        Document document = new Document();
        // 向document对象中添加域
        document.add(new TextField("name","新添加的文件", Field.Store.YES));
        document.add(new TextField("content","新添加的内容", Field.Store.NO));
        document.add(new StoredField("path","F:/ideaWorkSpace/luceneTest"));
        // 把文档写入索引库
        indexWriter.addDocument(document);
        // 关闭索引库
        indexWriter.close();
    }

    @Test
    public void deleteAllDocument() throws Exception{
        // 删除全部文档
        indexWriter.deleteAll();
        // 关闭索引库
        indexWriter.close();
    }

    @Test
    public void deleteDocumentByQuery() throws Exception{
        indexWriter.deleteDocuments(new Term("name","apache"));
        indexWriter.close();
    }

    @Test
    public void updateDocument() throws Exception{
        // 创建一个新的文档对象
        Document document = new Document();
        // 向文档对象中添加域
        document.add(new TextField("name","更新之后的文档",Field.Store.YES));
        document.add(new TextField("name","更新之后的文档1",Field.Store.YES));
        document.add(new TextField("name","更新之后的文档2",Field.Store.YES));
        // 更新
        indexWriter.updateDocument(new Term("name","spring"),document);
        // 关闭索引库
        indexWriter.close();
    }

总结

可以看到，使用Lucene还是比较麻烦的，现在市面上大部分都使用了类似ElasticSearch这种基于lucene的搜索服务器。接下来会学习ElasticSearch的使用方法。