Lucene从入门到进阶(6.6.0版本)

Lucene学习笔记


前言

基于最新的Lucene-6.6.0进行学习,很多方法都过时并不适用了,本文尽可能以最简单的方法入门学习。

第二章的例子都是官方的例子,写得很好很详细,但是竟然一句注释都没有,里面的注释都是我自己添加的,可能有不正确的理解,望体谅,可以将错误的注解反馈给我。

第三章开始是自己写的例子,很简单,很好理解,建议是直接从第三章开始看。

1   资源准备

1.1入门文档

软件文档:http://lucene.apache.org/core/6_6_0/index.html

可以根据该文档看官方例子。

1.2 开发文档

       Luence核心coreAPI文档:http://lucene.apache.org/core/6_6_0/core/index.html

1.3 导入Maven依赖

导入使用lucene所必须的jar包

<dependency>
  <groupId>
org.apache.lucene</groupId>
  <artifactId>
lucene-core</artifactId>
  <version>
6.6.0</version>
</dependency>
<dependency>
  <groupId>
org.apache.lucene</groupId>
  <artifactId>
lucene-analyzers-common</artifactId>
  <version>
6.6.0</version>
</dependency>
<dependency>
  <groupId>
org.apache.lucene</groupId>
  <artifactId>
lucene-queryparser</artifactId>
  <version>
6.6.0</version>
</dependency>
<!-- 官方测试例子 -->
<dependency>
  <groupId>
org.apache.lucene</groupId>
  <artifactId>
lucene-demo</artifactId>
  <version>
6.6.0</version>
</dependency>

1.1.4 Luke

Luke是专门用于Lucene的索引查看工具

GitHub地址:https://github.com/DmitryKey/luke

安装步骤:

  1. Clone the repository.
  2. Run mvn install from the project directory. (Make sure you have Java and Maven installed before doing this)
  3. Use luke.sh or luke.bat for launching luke from the command line based on the OS you are in.

(Alternatively, for older versions of lukeyou can directly download the jar file from the releases page and run it with the command java -jarluke-with-deps.jar)

2 入门

2.1 IndexFiels

官方例子 IndexFiels.java创建一个Lucene索引。

该类启动要往main方法写入参数,可以有三种参数写入方式,这里就写一种,使用IDEA在配置中写入如下参数:

2.1.1 Test.txt的内容如下:

numberA

numberB

number 范德萨 jklj

test

你好

不错啊

2.1.2 代码

package com.bingo.backstage;
  
  
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.LinkOption;
import java.nio.file.OpenOption;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.Date;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.FSDirectory;
  
  /**
 * Created by MoSon on 2017/6/30.
 */
  public class IndexFiles {
    private IndexFiles() {
    }
  
    public static void main(String[] args) {
        //在运行是要添加参数如:-docs (你文件的路径)
        String usage = "java com.bingo.backstage.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n" +
                "This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles";
        String indexPath = "index";
        String docsPath = null;
        boolean create = true;
  
        for(int docDir = 0; docDir < args.length; ++docDir) {
            if("-index".equals(args[docDir])) {
                indexPath = args[docDir + 1];
                ++docDir;
            } else if("-docs".equals(args[docDir])) {
                docsPath = args[docDir + 1];
                ++docDir;
            } else if("-update".equals(args[docDir])) {
                create = false;
            }
        }
  
        if(docsPath == null) {
            System.err.println("Usage: " + usage);
            System.exit(1);
        }
  
        Path var13 = Paths.get(docsPath, new String[0]);
        if(!Files.isReadable(var13)) {
            System.out.println("Document directory \'" + var13.toAbsolutePath() + "\' does not exist or is not readable, please check the path");
            System.exit(1);
        }
  
        Date start = new Date();
  
        try {
            System.out.println("Indexing to directory \'" + indexPath + "\'...");
            //打开文件路径
            FSDirectory e = FSDirectory.open(Paths.get(indexPath, new String[0]));
            StandardAnalyzer analyzer = new StandardAnalyzer();
            IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
            if(create) {
                iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
            } else {
                iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            }
  
            IndexWriter writer = new IndexWriter(e, iwc);
            indexDocs(writer, var13);
            writer.close();
            Date end = new Date();
            System.out.println(end.getTime() - start.getTime() + " total milliseconds");
        } catch (IOException var12) {
            System.out.println(" caught a " + var12.getClass() + "\n with message: " + var12.getMessage());
        }
  
    }
  
    static void indexDocs(final IndexWriter writer, Path path) throws IOException {
        if(Files.isDirectory(path, new LinkOption[0])) {
            Files.walkFileTree(path, new SimpleFileVisitor() {
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
                    try {
                        IndexFiles.indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
                    } catch (IOException var4) {
                        ;
                    }
  
                    return FileVisitResult.CONTINUE;
                }
            });
        } else {
            indexDoc(writer, path, Files.getLastModifiedTime(path, new LinkOption[0]).toMillis());
        }
  
    }
  
    static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
        InputStream stream = Files.newInputStream(file, new OpenOption[0]);
        Throwable var5 = null;
  
        try {
            Document doc = new Document();
            StringField pathField = new StringField("path", file.toString(), Field.Store.YES);
            doc.add(pathField);
            doc.add(new LongPoint("modified", new long[]{lastModified}));
            doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
            if(writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {
                System.out.println("adding " + file);
                writer.addDocument(doc);
            } else {
                System.out.println("updating " + file);
                writer.updateDocument(new Term("path", file.toString()), doc);
            }
        } catch (Throwable var15) {
            var5 = var15;
            try {
                throw var15;
            } catch (Throwable throwable) {
                throwable.printStackTrace();
            }
        } finally {
            if(stream != null) {
                if(var5 != null) {
                    try {
                        stream.close();
                    } catch (Throwable var14) {
                        var5.addSuppressed(var14);
                    }
                } else {
                    stream.close();
                }
            }
  
        }
  
    }
}

2.1.3启动效果

将会在跟目录下自动生成一个文件用来保存索引


使用Luke查看效果:


发现没有添加中文进去


2.1.4 分析

IndexFiles类创建一个Lucene索引。

在主()方法分析命令行参数,则在制备用于实例化 IndexWriter,打开 Directory,和实例化StandardAnalyzer 和IndexWriterConfig。

所述的值-index命令行参数是其中应该存储所有索引信息文件系统目录的名称。如果IndexFiles与在给定的相对路径调用-index命令行参数,或者如果-index没有给出命令行参数,使默认的相对索引路径“ 指数 ”被使用,索引路径将被创建作为当前工作目录的子目录(如果它不存在)。在某些平台上,可以在不同的目录(例如用户的主目录)中创建索引路径。

所述-docs命令行参数值是包含文件的目录的位置被索引。

该-update命令行参数告诉 IndexFiles不删除索引,如果它已经存在。当没有给出-update时,IndexFiles将在索引任何文档之前首先擦拭平板。

IndexWriterDirectory使用Lucene 来存储索引中的信息。除了 我们使用的实现之外,还有其他几个可以写入RAM,数据库等的Directory子类。FSDirectory

Lucene Analyzer正在处理管道,将文本分解为索引令牌,也称为条款,并可选择对这些令牌进行其他操作,例如缩小,同义词插入,过滤掉不需要的令牌等。我们使用的Analyzer是StandardAnalyzer,它使用Unicode标准附件#29中指定的Unicode文本分段算法中的Word Break规则; 将令牌转换为小写字母; 然后过滤掉停用词。停用词是诸如文章(a,an,等等)和其他可能具有较少搜索价值的标记的常用语言单词。应该注意的是,每个语言都有不同的规则,你应该为每个语言使用适当的分析器。

该IndexWriterConfig实例适用于所有配置的IndexWriter。例如,我们将OpenMode设置为基于-update命令行参数的值使用。

在文件中进一步看,在IndexWriter被实例化之后,应该看到indexDocs()代码。此递归函数可以抓取目录并创建Document对象。该文献仅仅是一个数据对象来表示从文件以及其创建时间和位置的文本内容。这些实例被添加到IndexWriter。如果给出了 -update命令行参数,则 IndexWriterConfig OpenMode将被设置为OpenMode.CREATE_OR_APPEND,而不是向索引添加文档,IndexWriter将通过尝试找到具有相同标识符的已经索引的文档来更新它们在索引中(在我们的例子中,文件路径作为标识符); 如果存在,则从索引中删除它; 然后将新文档添加到索引中。

2.2 SearchFiles

搜索文件

2.2.1代码

package com.bingo.backstage;
  
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
  
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Date;
  
  /**
 * Created by MoSon on 2017/6/30.
 */
  public class SearchFiles {
    private SearchFiles() {
    }
  
    public static void main(String[] args) throws Exception {
        String usage = "Usage:\tjava com.bingo.backstage.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.";
        if(args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
            System.out.println(usage);
            System.exit(0);
        }
  
        String index = "index";
        String field = "contents";
        String queries = null;
        int repeat = 0;
        boolean raw = false;
        String queryString = null;
        int hitsPerPage = 10;
  
        for(int reader = 0; reader < args.length; ++reader) {
            if("-index".equals(args[reader])) {
                index = args[reader + 1];
                ++reader;
            } else if("-field".equals(args[reader])) {
                field = args[reader + 1];
                ++reader;
            } else if("-queries".equals(args[reader])) {
                queries = args[reader + 1];
                ++reader;
            } else if("-query".equals(args[reader])) {
                queryString = args[reader + 1];
                ++reader;
            } else if("-repeat".equals(args[reader])) {
                repeat = Integer.parseInt(args[reader + 1]);
                ++reader;
            } else if("-raw".equals(args[reader])) {
                raw = true;
            } else if("-paging".equals(args[reader])) {
                hitsPerPage = Integer.parseInt(args[reader + 1]);
                if(hitsPerPage <= 0) {
                    System.err.println("There must be at least 1 hit per page.");
                    System.exit(1);
                }
  
                ++reader;
            }
        }
  
        //打开文件
        DirectoryReader var18 = DirectoryReader.open(FSDirectory.open(Paths.get(index, new String[0])));
        IndexSearcher searcher = new IndexSearcher(var18);
        StandardAnalyzer analyzer = new StandardAnalyzer();
        BufferedReader in = null;
        if(queries != null) {
            in = Files.newBufferedReader(Paths.get(queries, new String[0]), StandardCharsets.UTF_8);
        } else {
            in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
        }
  
        QueryParser parser = new QueryParser(field, analyzer);
  
        do {
            if(queries == null && queryString == null) {
                System.out.println("Enter query: ");
            }
  
            String line = queryString != null?queryString:in.readLine();
            if(line == null || line.length() == -1) {
                break;
            }
  
            line = line.trim();
            if(line.length() == 0) {
                break;
            }
  
            Query query = parser.parse(line);
            System.out.println("Searching for: " + query.toString(field));
            if(repeat > 0) {
                Date start = new Date();
  
                for(int end = 0; end < repeat; ++end) {
                    searcher.search(query, 100);
                }
  
                Date var19 = new Date();
                System.out.println("Time: " + (var19.getTime() - start.getTime()) + "ms");
            }
  
            doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);
        } while(queryString == null);
  
        var18.close();
    }
  
    public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query, int hitsPerPage, boolean raw, boolean interactive) throws IOException {
        TopDocs results = searcher.search(query, 5 * hitsPerPage);
        ScoreDoc[] hits = results.scoreDocs;
        int numTotalHits = results.totalHits;
        System.out.println(numTotalHits + " total matching documents");
        int start = 0;
        int end = Math.min(numTotalHits, hitsPerPage);
  
        while(true) {
            if(end > hits.length) {
                System.out.println("Only results 1 - " + hits.length + " of " + numTotalHits + " total matching documents collected.");
                System.out.println("Collect more (y/n) ?");
                String quit = in.readLine();
                if(quit.length() == 0 || quit.charAt(0) == 110) {
                    break;
                }
  
                hits = searcher.search(query, numTotalHits).scoreDocs;
            }
  
            end = Math.min(hits.length, start + hitsPerPage);
  
            for(int var15 = start; var15 < end; ++var15) {
                if(raw) {
                    System.out.println("doc=" + hits[var15].doc + " score=" + hits[var15].score);
                } else {
                    Document line = searcher.doc(hits[var15].doc);
                    String page = line.get("path");
                    if(page != null) {
                        System.out.println(var15 + 1 + ". " + page);
                        String title = line.get("title");
                        if(title != null) {
                            System.out.println("   Title: " + line.get("title"));
                        }
                    } else {
                        System.out.println(var15 + 1 + ". No path for this document");
                    }
                }
            }
  
            if(!interactive || end == 0) {
                break;
            }
  
            if(numTotalHits >= end) {
                boolean var16 = false;
  
                while(true) {
                    System.out.print("Press ");
                    if(start - hitsPerPage >= 0) {
                        System.out.print("(p)revious page, ");
                    }
  
                    if(start + hitsPerPage < numTotalHits) {
                        System.out.print("(n)ext page, ");
                    }
  
                    System.out.println("(q)uit or enter number to jump to a page.");
                    String var17 = in.readLine();
                    if(var17.length() == 0 || var17.charAt(0) == 113) {
                        var16 = true;
                        break;
                    }
  
                    if(var17.charAt(0) == 112) {
                        start = Math.max(0, start - hitsPerPage);
                        break;
                    }
  
                    if(var17.charAt(0) == 110) {
                        if(start + hitsPerPage < numTotalHits) {
                            start += hitsPerPage;
                        }
                        break;
                    }
  
                    int var18 = Integer.parseInt(var17);
                    if((var18 - 1) * hitsPerPage < numTotalHits) {
                        start = (var18 - 1) * hitsPerPage;
                        break;
                    }
  
                    System.out.println("No such page");
                }
  
                if(var16) {
                    break;
                }
  
                end = Math.min(numTotalHits, start + hitsPerPage);
            }
        }
  
    }
}

2.2.2 运行效果

可以看出是跟上面Luke工具查看的结果一样,只有是对了才能查到


2.2.3 分析

该类主要与一个IndexSearcher,, StandardAnalyzer(在IndexFiles类中使用)和一个QueryParser。查询解析器是用一个分析器构造的,用于以与解释文档相同的方式解释查询文本:查找单词边界,缩小和删除无用单词,如“a”,“an”和“the”。该 Query对象包含 QueryParser传递给搜索者的结果。请注意,也可以以编程方式构建丰富Query 对象,而不使用查询解析器。查询语法分析器只能将 Lucene查询语法解码为相应的 Query对象。

SearchFiles使用最大 n个匹配IndexSearcher.search(query,n)返回的方法 。结果以页面打印,按分数(即相关性)排序。

2.3 SimpleSortedSetFacetsExample

一个简单的例子,比前面的两个Demo理解起来容易一些。

该例子使用SortedSetDocValuesFacetField和SortedSetDocValuesFacetCounts显示了简单的使用分面索引和搜索。

以下代码里面有注释,结合起来看会比较容易理解。

2.3.1 代码

package com.bingo.backstage.facet;
  
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
  
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.facet.DrillDownQuery;
import org.apache.lucene.facet.FacetResult;
import org.apache.lucene.facet.FacetsCollector;
import org.apache.lucene.facet.FacetsConfig;
import org.apache.lucene.facet.sortedset.DefaultSortedSetDocValuesReaderState;
import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts;
import org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
  
  /**
 * Created by MoSon on 2017/6/30.
 */
  public class SimpleSortedSetFacetsExample {
    //RAMDirectory:内存驻留目录实现。 默认情况下,锁定实现是SingleInstanceLockFactory。
    private final Directory indexDir = new RAMDirectory();
    private final FacetsConfig config = new FacetsConfig();
  
    public SimpleSortedSetFacetsExample() {
    }
  
    private void index() throws IOException {
        ////初始化索引创建器
        //WhitespaceAnalyzer仅仅是去除空格,对字符没有lowcase,不支持中文;并且不对生成的词汇单元进行其他的规范化处理。
        //openMode:创建索引模式:CREATE,覆盖模式; APPEND,追加模式
        //IndexWriter:创建并维护索引
        IndexWriter indexWriter = new IndexWriter(this.indexDir, (new IndexWriterConfig(new WhitespaceAnalyzer())).setOpenMode(OpenMode.CREATE));
        //建立文档
        Document doc = new Document();
        // 创建Field对象,并放入doc对象中
        doc.add(new SortedSetDocValuesFacetField("Author", "Bob"));
        doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));
        // 写入IndexWriter
        indexWriter.addDocument(this.config.build(doc));
        doc = new Document();
        doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));
        doc.add(new SortedSetDocValuesFacetField("Publish Year", "2010"));
        indexWriter.addDocument(this.config.build(doc));
        doc = new Document();
        doc.add(new SortedSetDocValuesFacetField("Author", "Lisa"));
        doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));
        indexWriter.addDocument(this.config.build(doc));
        doc = new Document();
        doc.add(new SortedSetDocValuesFacetField("Author", "Susan"));
        doc.add(new SortedSetDocValuesFacetField("Publish Year", "2012"));
        indexWriter.addDocument(this.config.build(doc));
        doc = new Document();
        doc.add(new SortedSetDocValuesFacetField("Author", "Frank"));
        doc.add(new SortedSetDocValuesFacetField("Publish Year", "1999"));
        indexWriter.addDocument(this.config.build(doc));
        indexWriter.close();
    }
  
    //查询并统计文档的信息
    private List<FacetResult> search() throws IOException {
        //基本都是一层包着一层封装
        //DirectoryReader是可以读取目录中的索引的CompositeReader的实现。
        DirectoryReader indexReader = DirectoryReader.open(this.indexDir);
        //通过一个IndexReader实现搜索。
        IndexSearcher searcher = new IndexSearcher(indexReader);
        DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);
        //收集命中后续刻面。 一旦你运行了一个搜索并收集命中,就可以实例化一个Facets子类来进行细分计数。 使用搜索实用程序方法执行普通搜索,但也会收集到Collector中。
        FacetsCollector fc = new FacetsCollector();
        //实用方法,搜索并收集所有的命中到提供的Collector。
        FacetsCollector.search(searcher, new MatchAllDocsQuery(), 10, fc);
        //计算所提供的匹配中的所有命中。
        SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);
        ArrayList results = new ArrayList();
        //getTopChildren:返回指定路径下的顶级子标签。
        results.add(facets.getTopChildren(10, "Author", new String[0]));
        results.add(facets.getTopChildren(10, "Publish Year", new String[0]));
        indexReader.close();
        return results;
    }
  
    private FacetResult drillDown() throws IOException {
        DirectoryReader indexReader = DirectoryReader.open(this.indexDir);
        IndexSearcher searcher = new IndexSearcher(indexReader);
        DefaultSortedSetDocValuesReaderState state = new DefaultSortedSetDocValuesReaderState(indexReader);
        DrillDownQuery q = new DrillDownQuery(this.config);
        //添加查询条件
        q.add("Publish Year", new String[]{"2012"});
        FacetsCollector fc = new FacetsCollector();
        FacetsCollector.search(searcher, q, 10, fc);
        SortedSetDocValuesFacetCounts facets = new SortedSetDocValuesFacetCounts(state, fc);
        //获取符合的作者
        FacetResult result = facets.getTopChildren(10, "Author", new String[0]);
        indexReader.close();
        return result;
    }
  
    public List<FacetResult> runSearch() throws IOException {
        this.index();
        return this.search();
    }
  
    public FacetResult runDrillDown() throws IOException {
        this.index();
        return this.drillDown();
    }
  
    public static void main(String[] args) throws Exception {
        System.out.println("Facet counting example:");
        System.out.println("-----------------------");
        SimpleSortedSetFacetsExample example = new SimpleSortedSetFacetsExample();
        List results = example.runSearch();
        System.out.println("Author: " + results.get(0));
        System.out.println("Publish Year: " + results.get(0));
        System.out.println("\n");
        System.out.println("Facet drill-down example (Publish Year/2010):");
        System.out.println("---------------------------------------------");
        System.out.println("Author: " + example.runDrillDown());
    }
}

2.3.2        运行效果


3 简单上手

3.1 创建索引

这是自己写的例子,很好理解。

简单地添加内容到索引库。

3.1.1 代码

package com.bingo.backstage;
  
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LegacyLongField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
  
  
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Path;
  
import static org.apache.lucene.document.TextField.TYPE_STORED;
  
  /**
 * Created by MoSon on 2017/6/30.
 */
  public class CreateIndex {
  
    public static void main(String[] args) throws IOException {
        //定义IndexWriter
        //index是一个相对路径,当前工程
        Path path = FileSystems.getDefault().getPath("", "index");
        Directory directory = FSDirectory.open(path);
        //定义分词器
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
  
        //定义文档
        Document document = new Document();
        //定义文档字段
        document.add(new LegacyLongField("id", 5499, Field.Store.YES));
        document.add(new Field("title", "小米6", TYPE_STORED));
        document.add(new Field("sellPoint", "骁龙8356G内存,双摄!", TYPE_STORED));
        //写入数据
        indexWriter.addDocument(document);
        //添加新的数据
        document = new Document();
        document.add(new LegacyLongField("id", 8324, Field.Store.YES));
        document.add(new Field("title", "OnePlus5", TYPE_STORED));
        document.add(new Field("sellPoint", "8核,8G运行内存", TYPE_STORED));
        indexWriter.addDocument(document);
        //提交
        indexWriter.commit();
        //关闭
        indexWriter.close();
  
    }
   
  
}

3.1.2结果

一下是使用Luke查看的结果


3.2 分词搜索

根据条件查询符合的内容

3.2.1 代码

package com.bingo.backstage;
  
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
  
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Path;
  
  /**
 * Created by MoSon on 2017/7/1.
 */
  public class Search {
  
    public static void main(String[] args) throws IOException {
        //定义索引目录
        Path path = FileSystems.getDefault().getPath("index");
        Directory directory = FSDirectory.open(path);
        //定义索引查看器
        IndexReader indexReader = DirectoryReader.open(directory);
        //定义搜索器
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //搜索内容
        //定义查询字段
        Term term = new Term("sellPoint","");
        Query query = new TermQuery(term);
        //命中前10条文档
        TopDocs topDocs = indexSearcher.search(query,10);
        //打印命中数
        System.out.println("命中数:"+topDocs.totalHits);
        //取出文档
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //遍历取出数据
        for (ScoreDoc scoreDoc : scoreDocs){
            //通过indexSearcherdoc方法取出文档
            Document doc = indexSearcher.doc(scoreDoc.doc);
            System.out.println("id:"+doc.get("id"));
            System.out.println("sellPoint:"+doc.get("sellPoint"));
        }
  
        //关闭索引查看器
        indexReader.close();
    }
}

3.2.2 运行结果

将符合条件的结果查询并显示。

4   Lucene创建索引核心API

Directory  索引操作目录

Analyzer   分词器

Document 索引中文档对象

IndexableField 文档内部数据信息

IndexWriterConfig 索引生成配置信息

IndexWriter  索引生成对象

5   IK分词器

5.1下载

下载适合Lucene的IKAnalyzer

链接:http://download.csdn.net/detail/fanpei_moukoy/9796612

5.2 基本使用

使用IK分词器对中文进行词意划分。

使用方式:将系统的Analyzer替换为IKAnalyzer


效果:

能对常用的词语识别并划分,但还不足够,例如“双摄像头”,“骁龙”识别出来。


5.3 自定义分词器

创建配置文件


创建自定义的扩展字典


分词效果:


5.4 使用分页查询

代码:

packagecom.bingo.backstage;

import
org.apache.lucene.document.Document;
import
org.apache.lucene.index.DirectoryReader;
import
org.apache.lucene.index.IndexReader;
import
org.apache.lucene.index.Term;
import
org.apache.lucene.queryparser.classic.ParseException;
import
org.apache.lucene.queryparser.classic.QueryParser;
import
org.apache.lucene.search.*;
import
org.apache.lucene.store.Directory;
import
org.apache.lucene.store.FSDirectory;
import
org.wltea.analyzer.lucene.IKAnalyzer;

import
java.io.IOException;
import
java.nio.file.FileSystems;
import
java.nio.file.Path;

/**
 * Created by MoSon on 2017/7/1.
 */
public class SearchPage {

   
public static void main(String[] args)throwsIOException,ParseException {
       
//定义索引目录
       
Path path = FileSystems.getDefault().getPath("index");
       
Directory directory = FSDirectory.open(path);
       
//定义索引查看器
       
IndexReader indexReader = DirectoryReader.open(directory);
       
//定义搜索器
       
IndexSearcher indexSearcher = newIndexSearcher(indexReader);
       
//搜索内容


       
//搜索关键字
       
String  keyWords = "内存";

        
//分页信息
       
Integer page = 1;
       
Integer pageSize = 20;
       
Integer start = (page-1) * pageSize;
       
Integer end = start + pageSize;

       
Query query = newQueryParser("sellPoint",newIKAnalyzer()).parse(keyWords);//模糊搜索

       
//命中前10条文档
       
TopDocs topDocs = indexSearcher.search(query,end);//根据end查询

       
Integer totalPage = ((topDocs.totalHits/ pageSize) ==0)
                ? topDocs.
totalHits/pageSize
                : ((topDocs.
totalHits / pageSize) +1);

       
System.out.println("“"+ keyWords +"”搜索到"+ topDocs.totalHits
               
+ "条数据,页数:"+ page +"/"+ totalPage);
       
//打印命中数
       
System.out.println("命中数:"+topDocs.totalHits);
       
//取出文档
       
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        int
length = scoreDocs.length> end ? end : scoreDocs.length;
       
//遍历取出数据
       
for (inti = start;i < length;i++){
            ScoreDoc doc = scoreDocs[i]
;
           
System.out.println("得分:"+ doc.score);
           
Document document = indexSearcher.doc(doc.doc);
           
System.out.println("ID:"+ document.get("id"));
           
System.out.println("sellPoint:"+document.get("sellPoint"));
           
System.out.println("-----------------------");
       
}

       
//关闭索引查看器
       
indexReader.close();
   
}
}

效果:

 

6文件索引建立与搜索

导入一百万的数据创建索引

6.1 创建索引

packagecom.bingo.backstage;

import
org.apache.lucene.analysis.Analyzer;
import
org.apache.lucene.analysis.standard.StandardAnalyzer;
import
org.apache.lucene.document.Document;
import
org.apache.lucene.document.Field;
import
org.apache.lucene.document.StringField;
import
org.apache.lucene.index.IndexWriter;
import
org.apache.lucene.index.IndexWriterConfig;
import
org.apache.lucene.store.Directory;
import
org.apache.lucene.store.FSDirectory;
import
org.wltea.analyzer.lucene.IKAnalyzer;

import
javax.print.Doc;
import
java.io.*;
import
java.nio.file.FileSystems;
import
java.nio.file.Path;

import static
org.apache.lucene.document.TextField.TYPE_STORED;

/**
 * Created by MoSon on 2017/7/4.
 */
public class ReadTxt {
   
public static void main(String[] args)throwsIOException {
        Path path = FileSystems.getDefault().getPath(
"","index");
       
String extPath = "H:\\IDEAWorkspace\\lucene\\src\\main\\resources\\ext.dic";
       
Directory directory = FSDirectory.open(path);
       
//定义分词器
//        Analyzer analyzer = new StandardAnalyzer();
       
Analyzer analyzer = newIKAnalyzer();
       
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE);
       
IndexWriter indexWriter = newIndexWriter(directory,indexWriterConfig);


       
String filePath = "H:\\myfile\\品高\\茂名全量地址20170401boss+.csv";
       
FileInputStream fis = newFileInputStream(filePath);
       
InputStreamReader isr = newInputStreamReader(fis,"GBK");
       
BufferedReader br = newBufferedReader(isr);
       
String content;
       
String levelOne = "";
       
String levelTwo = "";
       
String levelThree = "";
       
String levelFour = "";
       
String levelFive = "";
        int
i = 0;
      
/* while ((content = br.readLine()) != null){
            if (i == 1000) {
                break;
            }
            String[] split = content.split(",");
            String tempOne = "";
            String tempTwo = "";
            String tempThree = "";
            String tempFour = "";
            String tempFive = "";
            if (i == 1) {
                levelOne = split[2];
                levelTwo = split[3];
                levelThree = split[4];
                levelFour = split[5];
                levelFive = split[6];
            }

            tempOne = split[2];
            tempTwo = split[3];
            tempThree = split[4];
            tempFour = split[5];
            tempFive = split[6];

            StringBuilder sb = new StringBuilder();
            //
使用equals如存在""避免放在前面
            if (levelOne != null && levelOne != "" && tempOne!= "" && tempOne != null) {
                if(!tempOne.equals(levelOne)) {
                    sb.append("\n" + levelOne);
                    levelOne = tempOne;
                    System.out.println("11" + levelOne+tempOne);
                }
            }
            if (levelTwo != null && levelTwo != "" && tempTwo!= ""&& tempTwo != null) {
                if(!tempTwo.equals(levelTwo)) {
                    sb.append("\n" + levelTwo);
                    levelTwo = tempTwo;
                }
            }
            if (levelThree != null && levelThree != ""&& tempThree != ""&& tempThree != null) {
                if(!tempThree.equals(levelThree)) {
                    sb.append("\n" + levelThree);
                    levelThree = tempThree;
                }
            }
            if (levelFour != null && levelFour != ""&& tempFour != "" && tempFour != null) {
                if(!tempFour.equals(levelFour)) {
                    sb.append("\n" + levelFour);
                    levelFour = tempFour;
                }
            }
            if (levelFive != null && levelFive != "" && tempFive != "" && tempFive != null) {
                if(!tempFive.equals(levelFive)) {
                    sb.append("\n" + levelFive);
                    levelFive = tempFive;
                }
            }
            if(i == 422){
                System.out.println("address" + sb.toString()+tempFive+levelFive);
            }

//            System.out.println("address" + sb.toString()+tempFive+levelFive);
            if (sb != null){
                //
以追加的形式写入
                FileOutputStream fos = new FileOutputStream(extPath,true);
                OutputStreamWriter osr = new OutputStreamWriter(fos);
                BufferedWriter bw = new BufferedWriter(osr);
                bw.write(sb.toString(),0,sb.length());
                bw.close();
            }
            i++;
        }*/

       
long start = System.currentTimeMillis();
       
System.out.println("start:"+ start);
        while
((content = br.readLine()) != null) {
           
//第一行不记录
           
/*if(i == 0){
                continue;
            }*/
           /* if (i == 1000) {
                break;
            }*/

            //
定义文档
           
Document document = newDocument();
           
//读取每一行
//            System.out.println(content);
            
String[] split = content.split(",");
           
String id = split[0];
           
String address = split[1];


//            System.out.println(id + ":" + address);
           
document.add(newField("id",id,TYPE_STORED));
           
document.add(newField("address",address,TYPE_STORED));
           
indexWriter.addDocument(document);
           
i++;
       
}
       
long end = System.currentTimeMillis();
       
System.out.println("end:"+ end);
        float
time = end - start;
       
System.out.println("用时:"+ time);
       
//提交
       
indexWriter.commit();
       
//关闭
       
indexWriter.close();
       
br.close();
       
isr.close();
       
fis.close();
   
}
}

6.2 效果

一开始100秒将一百万的索引建完。后来逐渐加快,应该跟只开了2个应用程序有关,不到一分钟就建完了。


6.3 模糊搜索

搜索“茂名”,全部命中,一百多万条。用时1秒多。

效果:


7 获取分词器分词结果

7.1 使用IK分词器

想百度那样,把我们要搜索的一句话先给分词了再按关键字搜索

代码:

packagecom.bingo.backstage;

import
org.apache.lucene.analysis.Analyzer;
import
org.apache.lucene.analysis.TokenStream;
import
org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import
org.wltea.analyzer.lucene.IKAnalyzer;

import
java.io.IOException;
import
java.io.StringReader;
import
java.util.ArrayList;
import
java.util.List;

/**
 * Created by MoSon on 2017/7/5.
 */
public class AnalyzerResult {

   
/**
     *
获取指定分词器的分词结果
    
* @param analyzeStr
    
*            要分词的字符串
    
* @param analyzer
    
*            分词器
    
* @return 分词结果
    
*/
   
public List<String>getAnalyseResult(String analyzeStr,Analyzer analyzer) {
        List<String> response =
new ArrayList<String>();
       
TokenStream tokenStream = null;
        try
{
           
//返回适用于fieldNameTokenStream,标记读者的内容。
           
tokenStream = analyzer.tokenStream("address", newStringReader(analyzeStr));
           
// 语汇单元对应的文本
           
CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
           
//消费者在使用incrementToken()开始消费之前调用此方法。
           
//将此流重置为干净状态。 有状态的实现必须实现这种方法,以便它们可以被重用,就像它们被创建的一样。
            
tokenStream.reset();
           
//Consumers(即IndexWriter)使用此方法将流推送到下一个令牌。
           
while (tokenStream.incrementToken()) {
                response.add(attr.toString())
;
           
}
        }
catch (IOException e) {
            e.printStackTrace()
;
       
} finally{
           
if (tokenStream !=null) {
               
try {
                    tokenStream.close()
;
               
} catch(IOException e) {
                    e.printStackTrace()
;
               
}
            }
        }
        
return response;
   
}

   
public static void main(String[] args) {
        List<String> analyseResult =
new AnalyzerResult().getAnalyseResult("茂名市信宜市丁堡镇丁堡镇片区丁堡街道181301", new IKAnalyzer());
        for
(String result : analyseResult){
            System.
out.println(result);
       
}
    }
}

分词效果


7.2 使用内置CJK分词器

把类中的IKAnalyzer替换为CJKAnalyzer就可以了


分词效果:


基本以两个字两个字来划分,没有IK分词器的效果好。

8   进阶

根据前面的知识结合起来,先分词,根据关键词搜索。相似度高的靠前输出。

使用的是布尔搜索。

代码:

package com.bingo.backstage;
  
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;
  
import java.io.IOException;
import java.io.StringReader;
import java.nio.file.FileSystems;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
  
  /**
 * Created by MoSon on 2017/7/5.
 */
  public class BooleanSearchQuery {
    public static void main(String[] args) throws IOException, ParseException {
        long start = System.currentTimeMillis();
        System.out.println("开始时间:" + start);
        //定义索引目录
        Path path = FileSystems.getDefault().getPath("index");
        Directory directory = FSDirectory.open(path);
        //定义索引查看器
        IndexReader indexReader = DirectoryReader.open(directory);
        //定义搜索器
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //搜索内容
        //定义查询字段
  
        //布尔搜索
     /*   TermQuery termQuery1 = new TermQuery(term1);
        TermQuery termQuery2 = new TermQuery(term2);
        BooleanClause booleanClause1 = new BooleanClause(termQuery1, BooleanClause.Occur.MUST);
        BooleanClause booleanClause2 = new BooleanClause(termQuery2, BooleanClause.Occur.SHOULD);
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        builder.add(booleanClause1);
        builder.add(booleanClause2);
        BooleanQuery query = builder.build();*/
  
        /**
         * 进阶
         *多关键字的布尔搜索
         * */
        //定义Term集合
        List<Term> termList = new ArrayList<Term>();
        //获取分词结果
        List<String> analyseResult = new AnalyzerResult().getAnalyseResult("信宜市1234ewrq13asd丁堡镇丁堡镇片区丁堡街道181301", new IKAnalyzer());
        for (String result : analyseResult){
            termList.add(new Term("address",result));
  //            System.out.println(result);
        }
        //定义TermQuery集合
        List<TermQuery> termQueries = new ArrayList<TermQuery>();
        //取出集合结果
        for(Term term : termList){
            termQueries.add(new TermQuery(term));
        }
        List<BooleanClause> booleanClauses = new ArrayList<BooleanClause>();
        //遍历
        for (TermQuery termQuery : termQueries){
            booleanClauses.add(new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));
        }
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        for (BooleanClause booleanClause : booleanClauses){
            builder.add(booleanClause);
        }
        //检索
        BooleanQuery query = builder.build();
  
        //命中前10条文档
        TopDocs topDocs = indexSearcher.search(query,20);
        //打印命中数
        System.out.println("命中数:"+topDocs.totalHits);
        //取出文档
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //遍历取出数据
        for (ScoreDoc scoreDoc : scoreDocs){
            float score = scoreDoc.score; //相似度
            System.out.println("相似度:"+ score);
            //通过indexSearcherdoc方法取出文档
            Document doc = indexSearcher.doc(scoreDoc.doc);
            System.out.println("id:"+doc.get("id"));
            System.out.println("address:"+doc.get("address"));
        }
  
        //关闭索引查看器
        indexReader.close();
        long end = System.currentTimeMillis();
        System.out.println("开始时间:" + end);
        long time =  end-start;
        System.out.println("用时:" + time + "毫秒" );
    }
  
  
  
  
    /**
     * 获取指定分词器的分词结果
     * @param analyzeStr
     *            要分词的字符串
     * @param analyzer
     *            分词器
     * @return 分词结果
     */
    public List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {
        List<String> response = new ArrayList<String>();
        TokenStream tokenStream = null;
        try {
            //返回适用于fieldNameTokenStream,标记读者的内容。
            tokenStream = analyzer.tokenStream("address", new StringReader(analyzeStr));
            // 语汇单元对应的文本
            CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
            //消费者在使用incrementToken()开始消费之前调用此方法。
            //将此流重置为干净状态。 有状态的实现必须实现这种方法,以便它们可以被重用,就像它们被创建的一样。
            tokenStream.reset();
            //Consumers(即IndexWriter)使用此方法将流推送到下一个令牌。
            while (tokenStream.incrementToken()) {
                response.add(attr.toString());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (tokenStream != null) {
                try {
                    tokenStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return response;
    }
}

效果:

输入的句子是


检索结果:

 在此入门到此结束,如有兴趣可以查看进阶版,可以看底部的“我的更多文章”。

猜你喜欢

转载自blog.csdn.net/weinichendian/article/details/79992813