lucene6中配置IK Analyzer同义词分词器

首先需要基础知识

intellij idea中为lucene6配置IK Analyzer分词器

说到同义词分词器,从原理角度来说要了解了Analyzer

Analyzer分词的原理

Analyzer类是一个抽象类,切分词的具体规则是由子类实现的。Analyzer内部主要通过TokenStream类实现。Tokenizer类和TokenFilter类是TokenStream的两个子类。

Analyzer处理流程:

1. 通过Tokenizer进行分词,不同的分词器用的Tokenizer是不一样的

Tokenizer处理单个字符组成的字符流,读取Reader对象中的数据,处理后转换成词汇单元。如this is a apple  standard分词器

2. 分完词后通过TokenFilter进行过滤,有些会把停用词过滤掉,有些不会把停用词过滤掉

TokenFilter完成文本过滤器的功能,但在使用过程中必须注意不同过滤器的使用顺序。

通过停用词filter 只剩下apple了

3. 过滤之后,把所有的词组成一个TokenStream,这个stream中会存储有一些属性,如CharTermAttribute, PositionIncrementAttribute,OffsetAttribute,TypeAttribute等,这些Attribute会标识stream中的一些元素

扫描二维码关注公众号,回复: 2484371 查看本文章

1. 分词后一个一个的词汇单元 要保存起来 CharAttributeTerm

2. 顺序,词汇之间的偏移量,是以增量方式进行保存的,OffsetTerm

3. 比如说fields中有停用词会被过滤掉,所以要保存词汇单元之间的位置信息,PositionIncrementTerm 词与词之间的位置增量

同义词的原理就在这里 在位置增量为0的地方有多个词 就说明是同义词

4. TypeAttribute 分词的类型

了解了以上原理,参考:

Jijun/ik-analyzer: 基于IK中文分词器,添加同义词功能

那么来写MySynonymAnalyzer类,其中IKTokenizer6x参考文章开始的链接

public class MySynonymAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new IKTokenizer6x(true);
        Map paramsMap = new HashMap(10);
        paramsMap.put("synonyms", "synonyms.dic");
        SynonymFilterFactory factory=new SynonymFilterFactory(paramsMap);
        Path path = Paths.get("C:\\");
        ResourceLoader loader = new ClasspathResourceLoader();

        try {
            factory.inform(loader);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return new TokenStreamComponents(tokenizer, factory.create(tokenizer));
    }
}

把同义词词典synonyms.txt放到classpath下(resources文件夹下)

配置一条 出租车,出租汽车

 

测试一下:

public class Main {

    public static void main(String[] args) throws IOException, ParseException {
        Analyzer analyzer = new MySynonymAnalyzer();
        String str = "出租车";
        displayAllTokenInfo(analyzer,str);
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

结果:

加载扩展词典:ext.dic
加载扩展停止词典:stopword.dic
加载扩展停止词典:ext_stopword.dic
1:出租车[0-3]-->出租车
0:出租汽车[0-3]-->SYNONYM

可以看到,getPositionIncrement为0,说明出租汽车是出租车的同义词,当然TypeAttribute为SYNONYM也说明了这一点(为什么lucene知道TypeAttribute是SYNNONYM呢?这是因为我们MySynonymAnalyzer使用了lucene提供的SynonymFilterFactory,为其提供了同义词配置信息)

在lucene中使用同义词的两种方案

下面来在lucene里进行一个实际的测试。并来验证下这篇文章中提出的两个问题:

lucene+ikanalyzer实现中文同义词搜索

关于同义词,在lucene中使用时,有两种方案:

1、在建立索引时,拆词建索引时就把同义词考虑进去,将同义词的词条加入到索引中,然后检索时,直接根据输入拆词来检索

2、在建立索引时,不对同义词进行任何处理,在检索时,先拆词,针对拆分出来的词元(呵呵,自创的称呼)也即关键字,进行同义词匹配,把匹配好的同义词拼成一个新的关键字,搜索索引时根据此关键字来进行检索。

方案1的重点就在于每次修改了同义词词库,都要对索引进行重建(这个问题在elastic search中也存在)

方案2是在搜索的时候指定同义词分词器,这样就避免了对索引进行重建

 

先来测试方案1:

新建MyIndex类,提供创建索引方法。在创建索引时,使用的是MySynnoymAnalyzer分词器

public class MyIndex {
    public static void createIndex(String indexPath) throws IOException {
        Directory directory = FSDirectory.open(Paths.get(indexPath));
        Analyzer analyzer = new MySynonymAnalyzer();

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = new IndexWriter(directory, iwc);

        Document document1 = new Document();
        document1.add(new TextField("title", "出租汽车经营者不按照规定配置出租汽车相关设备", Field.Store.YES));
        indexWriter.addDocument(document1);

        Document document2 = new Document();
        document2.add(new TextField("title", "对违反出租车运营规定的处罚", Field.Store.YES));
        indexWriter.addDocument(document2);

        indexWriter.close();
    }
}

再写MySearch类,提供搜索方法,其分词器主要是针对搜索词进行分词,这里使用的是IKAnalyzer6x,当然使用MySynonymAnalyzer也是可以的

public class MySearcher {
    public static List<String> searchIndex(String keyword, String indexPath) throws IOException, ParseException {
        List<String> result = new ArrayList<>();
        IndexSearcher indexSearcher = null;
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
        indexSearcher = new IndexSearcher(indexReader);
        Analyzer analyzer = new IKAnalyzer6x(true);

        QueryParser queryParser = new QueryParser("title", analyzer);
        Query query = queryParser.parse(keyword);
        TopDocs td = indexSearcher.search(query, 10);
        for (int i = 0; i < td.totalHits; i++) {
            Document document = indexSearcher.doc(td.scoreDocs[i].doc);
            result.add(document.get("title"));
        }
        return result;
    }
}

测试方法

public class Main {

    public static void main(String[] args) throws IOException, ParseException {
        String indexPath = "D:\\indexFile\\test";
        String input = "出租车";
        MyIndex.createIndex(indexPath);
        List<String> docs = MySearcher.searchIndex(input, indexPath);
        for (String string : docs) {
            System.out.println(string);
        }
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

搜索出租车是可以搜索到出租车和出租汽车这两条内容的

再来测试方案2:

新建MyIndex2类,提供创建索引方法。在创建索引时,使用的是IKAnalyzer6x分词器

public class MyIndex2 {
    public static void createIndex(String indexPath) throws IOException {
        Directory directory = FSDirectory.open(Paths.get(indexPath));
        Analyzer analyzer = new IKAnalyzer6x();

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = new IndexWriter(directory, iwc);

        Document document1 = new Document();
        document1.add(new TextField("title", "出租汽车经营者不按照规定配置出租汽车相关设备", Field.Store.YES));
        indexWriter.addDocument(document1);

        Document document2 = new Document();
        document2.add(new TextField("title", "对违反出租车运营规定的处罚", Field.Store.YES));
        indexWriter.addDocument(document2);

        indexWriter.close();
    }
}

再写MySearch2类,提供搜索方法,这里使用MySynonymAnalyzer

public class MySearcher2 {
    public static List<String> searchIndex(String keyword, String indexPath) throws IOException, ParseException {
        List<String> result = new ArrayList<>();
        IndexSearcher indexSearcher = null;
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
        indexSearcher = new IndexSearcher(indexReader);
        Analyzer analyzer = new MySynonymAnalyzer();

        QueryParser queryParser = new QueryParser("title", analyzer);
        Query query = queryParser.parse(keyword);
        TopDocs td = indexSearcher.search(query, 10);
        for (int i = 0; i < td.totalHits; i++) {
            Document document = indexSearcher.doc(td.scoreDocs[i].doc);
            result.add(document.get("title"));
        }
        return result;
    }
}

测试方法

public class Main2 {

    public static void main(String[] args) throws IOException, ParseException {
        String indexPath = "D:\\indexFile\\test02";
        String input = "出租车";
        MyIndex.createIndex(indexPath);
        List<String> docs = MySearcher.searchIndex(input, indexPath);
        for (String string : docs) {
            System.out.println(string);
        }
    }

    /**
     * 显示分词后token stream全面的信息
     * @param analyzer
     * @throws IOException
     */
    public static void displayAllTokenInfo(Analyzer analyzer,String str) throws IOException {
        StringReader reader = new StringReader(str);
        TokenStream toStream = analyzer.tokenStream(str, reader);
        toStream.reset();// 清空流
        PositionIncrementAttribute pia = toStream.getAttribute(PositionIncrementAttribute.class);
        OffsetAttribute oa = toStream.getAttribute(OffsetAttribute.class);
        CharTermAttribute cta = toStream.getAttribute(CharTermAttribute.class);
        TypeAttribute ta = toStream.getAttribute(TypeAttribute.class);
        while (toStream.incrementToken()) {
            System.out.print(pia.getPositionIncrement()+":");
            System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
        }
    }
}

搜索出租车是可以搜索到出租车和出租汽车这两条内容的

 

然而,如果你将MySearcher2的Analyzer使用改为Tokenizer6x,那么只能搜索到一条

 

结论:

通过这个测试,就说明两种方案都是可行的。我个人倾向于方案2,因为同义词是变更频率很大的,如果每次变更完都要不断地重建索引,很不好

猜你喜欢

转载自blog.csdn.net/u013905744/article/details/81132602
今日推荐