lucene 4.3 通过TokenStream显示分词代码演示

核心代码：

public class AnalyzerUtils {

    /**
     * 显示分词信息
     * @param str
     * @param a
     * @Adder by arvin 2013-7-2 下午5:02:24
     */
    public static void displayToken(String str,Analyzer a) {
        try {
            TokenStream stream = a.tokenStream("content",new StringReader(str));
            //创建一个属性，这个属性会添加流中，随着这个TokenStream增加
            CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
            stream.reset();//不添加会显示空指针错误
            while(stream.incrementToken()) {
                System.out.print("["+cta+"]");
            }
            System.out.println();
            stream.end();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 显示分词的所有信息
     * @param str
     * @param a
     * @Adder by arvin 2013-7-2 下午5:02:52
     */
    public static void displayAllTokenInfo(String str,Analyzer a){
        try {
            TokenStream stream = a.tokenStream("content",new StringReader(str));
            //位置增量的属性，存储语汇单元之间的距离
            PositionIncrementAttribute pis=stream.addAttribute(PositionIncrementAttribute.class);
            //每个语汇单元的位置偏移量
            OffsetAttribute oa=stream.addAttribute(OffsetAttribute.class);
            //存储每一个语汇单元的信息（分词单元信息）
            CharTermAttribute cta=stream.addAttribute(CharTermAttribute.class);
            //使用的分词器的类型信息
            TypeAttribute ta=stream.addAttribute(TypeAttribute.class);
            stream.reset();
            while(stream.incrementToken()) {
                System.out.print("增量:"+pis.getPositionIncrement()+":");
                System.out.print("分词:"+cta+"位置:["+oa.startOffset()+"~"+oa.endOffset()+"]->类型:"+ta.type()+"\n");
            }
            System.out.println();
            stream.end();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

测试代码：

@Test
    public void testAnalyzer(){

        Analyzer a1=new StandardAnalyzer(Version.LUCENE_43);
        Analyzer a2=new StopAnalyzer(Version.LUCENE_43);
        Analyzer a3=new SimpleAnalyzer(Version.LUCENE_43);
        Analyzer a4=new WhitespaceAnalyzer(Version.LUCENE_43);

        String str="this is my house,I am come from yunnang zhaotong,my email is [email protected]";
        //String str="我的家乡在福建省龙岩市";
        AnalyzerUtils.displayToken(str, a1);
        AnalyzerUtils.displayToken(str, a2);
        AnalyzerUtils.displayToken(str, a3);
        AnalyzerUtils.displayToken(str, a4);

    }

   @Test
    public void testAnalyzer02(){

        Analyzer a1=new StandardAnalyzer(Version.LUCENE_43);
        Analyzer a2=new StopAnalyzer(Version.LUCENE_43);
        Analyzer a3=new SimpleAnalyzer(Version.LUCENE_43);
        Analyzer a4=new WhitespaceAnalyzer(Version.LUCENE_43);

        String str="how are you thank you";

        AnalyzerUtils.displayAllTokenInfo(str, a1);
        AnalyzerUtils.displayAllTokenInfo(str, a2);
        AnalyzerUtils.displayAllTokenInfo(str, a3);
        AnalyzerUtils.displayAllTokenInfo(str, a4);

    }

控制台结果显示:

英文结果：

testAnalyzer()结果：

[my][house][i][am][come][from][yunnang][zhaotong][my][email][342345324][qq.com]
[my][house][i][am][come][from][yunnang][zhaotong][my][email][qq][com]
[this][is][my][house][i][am][come][from][yunnang][zhaotong][my][email][is][qq][com]
[this][is][my][house,I][am][come][from][yunnang][zhaotong,my][email][is][[email protected]]

testAnalyzer()结果

增量:1:分词:how位置:[0~3]->类型:<ALPHANUM>
增量:2:分词:you位置:[8~11]->类型:<ALPHANUM>
增量:1:分词:thank位置:[12~17]->类型:<ALPHANUM>
增量:1:分词:you位置:[18~21]->类型:<ALPHANUM>

增量:1:分词:how位置:[0~3]->类型:word
增量:2:分词:you位置:[8~11]->类型:word
增量:1:分词:thank位置:[12~17]->类型:word
增量:1:分词:you位置:[18~21]->类型:word

增量:1:分词:how位置:[0~3]->类型:word
增量:1:分词:are位置:[4~7]->类型:word
增量:1:分词:you位置:[8~11]->类型:word
增量:1:分词:thank位置:[12~17]->类型:word
增量:1:分词:you位置:[18~21]->类型:word

增量:1:分词:how位置:[0~3]->类型:word
增量:1:分词:are位置:[4~7]->类型:word
增量:1:分词:you位置:[8~11]->类型:word
增量:1:分词:thank位置:[12~17]->类型:word
增量:1:分词:you位置:[18~21]->类型:word

中文结果：

testAnalyzer()结果：

[我][的][家][乡][在][福][建][省][龙][岩][市]
[我的家乡在福建省龙岩市]
[我的家乡在福建省龙岩市]
[我的家乡在福建省龙岩市]

testAnalyzer02()结果：

增量:1:分词:明位置:[0~1]->类型:<IDEOGRAPHIC>
增量:1:分词:天位置:[1~2]->类型:<IDEOGRAPHIC>
增量:1:分词:是位置:[2~3]->类型:<IDEOGRAPHIC>
增量:1:分词:我位置:[3~4]->类型:<IDEOGRAPHIC>
增量:1:分词:的位置:[4~5]->类型:<IDEOGRAPHIC>
增量:1:分词:生位置:[5~6]->类型:<IDEOGRAPHIC>
增量:1:分词:日位置:[6~7]->类型:<IDEOGRAPHIC>

增量:1:分词:明天是我的生日位置:[0~7]->类型:word

增量:1:分词:明天是我的生日位置:[0~7]->类型:word

增量:1:分词:明天是我的生日位置:[0~7]->类型:word

一条梦想会飞的鱼

发布了43 篇原创文章 · 获赞 2 · 访问量 4万+

私信关注

lucene 4.3 通过TokenStream显示分词代码演示

猜你喜欢