首先需要基础知识

说到同义词分词器，从原理角度来说要了解了Analyzer

Analyzer分词的原理

Analyzer类是一个抽象类，切分词的具体规则是由子类实现的。Analyzer内部主要通过TokenStream类实现。Tokenizer类和TokenFilter类是TokenStream的两个子类。

Analyzer处理流程：

1. 通过Tokenizer进行分词，不同的分词器用的Tokenizer是不一样的

Tokenizer处理单个字符组成的字符流，读取Reader对象中的数据，处理后转换成词汇单元。如this is a apple standard分词器

2. 分完词后通过TokenFilter进行过滤，有些会把停用词过滤掉，有些不会把停用词过滤掉

TokenFilter完成文本过滤器的功能，但在使用过程中必须注意不同过滤器的使用顺序。

通过停用词filter 只剩下apple了

3. 过滤之后，把所有的词组成一个TokenStream，这个stream中会存储有一些属性，如CharTermAttribute， PositionIncrementAttribute，OffsetAttribute，TypeAttribute等，这些Attribute会标识stream中的一些元素

扫描二维码关注公众号，回复： 2484371 查看本文章

1. 分词后一个一个的词汇单元要保存起来 CharAttributeTerm

2. 顺序，词汇之间的偏移量，是以增量方式进行保存的，OffsetTerm

3. 比如说fields中有停用词会被过滤掉，所以要保存词汇单元之间的位置信息，PositionIncrementTerm 词与词之间的位置增量

同义词的原理就在这里在位置增量为0的地方有多个词就说明是同义词

4. TypeAttribute 分词的类型

了解了以上原理，参考：

Jijun/ik-analyzer: 基于IK中文分词器,添加同义词功能

那么来写MySynonymAnalyzer类，其中IKTokenizer6x参考文章开始的链接

public class MySynonymAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new IKTokenizer6x(true);
        Map paramsMap = new HashMap(10);
        paramsMap.put("synonyms", "synonyms.dic");
        SynonymFilterFactory factory=new SynonymFilterFactory(paramsMap);
        Path path = Paths.get("C:\\");
        ResourceLoader loader = new ClasspathResourceLoader();

        try {
            factory.inform(loader);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return new TokenStreamComponents(tokenizer, factory.create(tokenizer));
    }
}

把同义词词典synonyms.txt放到classpath下（resources文件夹下）

配置一条出租车,出租汽车

测试一下：

public class Main {

    public static void main(String[] args) throws IOException, ParseException {
        Analyzer analyzer = new MySynonymAnalyzer();
        String str = "出租车";
        displayAllTokenInfo(analyzer,str);
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

结果：

加载扩展词典：ext.dic
加载扩展停止词典：stopword.dic
加载扩展停止词典：ext_stopword.dic
1:出租车[0-3]-->出租车
0:出租汽车[0-3]-->SYNONYM

可以看到，getPositionIncrement为0，说明出租汽车是出租车的同义词，当然TypeAttribute为SYNONYM也说明了这一点（为什么lucene知道TypeAttribute是SYNNONYM呢？这是因为我们MySynonymAnalyzer使用了lucene提供的SynonymFilterFactory，为其提供了同义词配置信息）

在lucene中使用同义词的两种方案

下面来在lucene里进行一个实际的测试。并来验证下这篇文章中提出的两个问题：

lucene+ikanalyzer实现中文同义词搜索

关于同义词，在lucene中使用时，有两种方案：

1、在建立索引时，拆词建索引时就把同义词考虑进去，将同义词的词条加入到索引中，然后检索时，直接根据输入拆词来检索

2、在建立索引时，不对同义词进行任何处理，在检索时，先拆词，针对拆分出来的词元（呵呵，自创的称呼）也即关键字，进行同义词匹配，把匹配好的同义词拼成一个新的关键字，搜索索引时根据此关键字来进行检索。

方案1的重点就在于每次修改了同义词词库，都要对索引进行重建（这个问题在elastic search中也存在）

方案2是在搜索的时候指定同义词分词器，这样就避免了对索引进行重建

先来测试方案1：

新建MyIndex类，提供创建索引方法。在创建索引时，使用的是MySynnoymAnalyzer分词器

public class MyIndex {
    public static void createIndex(String indexPath) throws IOException {
        Directory directory = FSDirectory.open(Paths.get(indexPath));
        Analyzer analyzer = new MySynonymAnalyzer();

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = new IndexWriter(directory, iwc);

        Document document1 = new Document();
        document1.add(new TextField("title", "出租汽车经营者不按照规定配置出租汽车相关设备", Field.Store.YES));
        indexWriter.addDocument(document1);

        Document document2 = new Document();
        document2.add(new TextField("title", "对违反出租车运营规定的处罚", Field.Store.YES));
        indexWriter.addDocument(document2);

        indexWriter.close();
    }
}

再写MySearch类，提供搜索方法，其分词器主要是针对搜索词进行分词，这里使用的是IKAnalyzer6x，当然使用MySynonymAnalyzer也是可以的

public class MySearcher {
    public static List<String> searchIndex(String keyword, String indexPath) throws IOException, ParseException {
        List<String> result = new ArrayList<>();
        IndexSearcher indexSearcher = null;
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
        indexSearcher = new IndexSearcher(indexReader);
        Analyzer analyzer = new IKAnalyzer6x(true);

        QueryParser queryParser = new QueryParser("title", analyzer);
        Query query = queryParser.parse(keyword);
        TopDocs td = indexSearcher.search(query, 10);
        for (int i = 0; i < td.totalHits; i++) {
            Document document = indexSearcher.doc(td.scoreDocs[i].doc);
            result.add(document.get("title"));
        }
        return result;
    }
}

测试方法

public class Main {

    public static void main(String[] args) throws IOException, ParseException {
        String indexPath = "D:\\indexFile\\test";
        String input = "出租车";
        MyIndex.createIndex(indexPath);
        List<String> docs = MySearcher.searchIndex(input, indexPath);
        for (String string : docs) {
            System.out.println(string);
        }
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

搜索出租车是可以搜索到出租车和出租汽车这两条内容的

再来测试方案2：

新建MyIndex2类，提供创建索引方法。在创建索引时，使用的是IKAnalyzer6x分词器

public class MyIndex2 {
    public static void createIndex(String indexPath) throws IOException {
        Directory directory = FSDirectory.open(Paths.get(indexPath));
        Analyzer analyzer = new IKAnalyzer6x();

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = new IndexWriter(directory, iwc);

        Document document1 = new Document();
        document1.add(new TextField("title", "出租汽车经营者不按照规定配置出租汽车相关设备", Field.Store.YES));
        indexWriter.addDocument(document1);

        Document document2 = new Document();
        document2.add(new TextField("title", "对违反出租车运营规定的处罚", Field.Store.YES));
        indexWriter.addDocument(document2);

        indexWriter.close();
    }
}

再写MySearch2类，提供搜索方法，这里使用MySynonymAnalyzer

public class MySearcher2 {
    public static List<String> searchIndex(String keyword, String indexPath) throws IOException, ParseException {
        List<String> result = new ArrayList<>();
        IndexSearcher indexSearcher = null;
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
        indexSearcher = new IndexSearcher(indexReader);
        Analyzer analyzer = new MySynonymAnalyzer();

        QueryParser queryParser = new QueryParser("title", analyzer);
        Query query = queryParser.parse(keyword);
        TopDocs td = indexSearcher.search(query, 10);
        for (int i = 0; i < td.totalHits; i++) {
            Document document = indexSearcher.doc(td.scoreDocs[i].doc);
            result.add(document.get("title"));
        }
        return result;
    }
}

测试方法

public class Main2 {

    public static void main(String[] args) throws IOException, ParseException {
        String indexPath = "D:\\indexFile\\test02";
        String input = "出租车";
        MyIndex.createIndex(indexPath);
        List<String> docs = MySearcher.searchIndex(input, indexPath);
        for (String string : docs) {
            System.out.println(string);
        }
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

搜索出租车是可以搜索到出租车和出租汽车这两条内容的

然而，如果你将MySearcher2的Analyzer使用改为Tokenizer6x，那么只能搜索到一条

结论：

通过这个测试，就说明两种方案都是可行的。我个人倾向于方案2，因为同义词是变更频率很大的，如果每次变更完都要不断地重建索引，很不好

lucene6中配置IK Analyzer同义词分词器

Analyzer分词的原理

在lucene中使用同义词的两种方案

先来测试方案1：

再来测试方案2：

结论：

猜你喜欢