Stuttering word java achieve high performance, two times the speed huaban jieba

Segment

Segment is based on a more flexible stuttered word thesaurus to achieve, to achieve high-performance java word.

Change Log

Creation goal

NLP is a word do related work, a very basic function.

jieba-analysis as a very popular word achieve personal realization opencc4j been using it as a word before.

But with the knowledge of the word, the word for stuttering found on some configuration is not flexible enough.

There are many functions can not be specified shut down, such as traditional and simplified conversion for HMM is useless, because the traditional term is fixed, you do not need to predict.

The latest version of speech and other functions seem to have been removed, but these are very personal needs.

So, if you re-implement it again, hope to achieve a more flexible, more characteristic of segmentation framework.

And jieba-analysis update seems stalled, individual implementations are quite different, so the establishment of a new project.

Features Features

  • High-performance word DFA-based implementation

  • It allows user-defined lexicon

  • Support Back to speech

Off by default, lazy loading, it does not affect performance and memory.

Getting Started

ready

jdk1.7 +

maven 3.x+

maven introduced

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>segment</artifactId>
    <version>${最新版本}</version>
</dependency>

Examples of Use

See the relevant code SegmentBsTest.java

Get word, subscript and other information

Temporarily does not implement speech tagging, ready to implement the next version.

final String string = "这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱学习。";

List<ISegmentResult> resultList = SegmentBs.newInstance().segment(string);
Assert.assertEquals("[这[0,1), 是[1,2), 一个[2,4), 伸手不见五指[4,10), 的[10,11), 黑夜[11,13), 。[13,14), 我[14,15), 叫[15,16), 孙悟空[16,19), ,[19,20), 我[20,21), 爱[21,22), 北京[22,24), ,[24,25), 我[25,26), 爱[26,27), 学习[27,29), 。[29,30)]", resultList.toString());

Only to get word information

final String string = "这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱学习。";

List<String> resultList = SegmentBs.newInstance().segmentWords(string);
Assert.assertEquals("[这, 是, 一个, 伸手不见五指, 的, 黑夜, 。, 我, 叫, 孙悟空, ,, 我, 爱, 北京, ,, 我, 爱, 学习, 。]", resultList.toString());

Return speech

Examples of Use

Specified directly wordTypeattribute to true.

final String string = "我爱学习";

List<ISegmentResult> resultList = SegmentBs
                .newInstance()
                .wordType(true)
                .segment(string);

Assert.assertEquals("[我[0,1)/r, 爱[1,2)/v, 学习[2,4)/v]", resultList.toString());

POS Description

r / v is the part of speech, meaning every details as follows.

coding description
Ag Shaped morpheme
a adjective
ad Deputy shaped word
an Noun word
b Distinguishing Words
c conjunction
dg Vice morpheme
d adverb
e interjection
f Position of the word
g Morpheme
h Before the next ingredient
i idiom
j Short abbreviation
k After receiving component
l Idiom
m numeral
of Name morpheme
n noun
No. Person's name
ns Place name
nt Institutional bodies
nz Other Names
O Onomatopoeia
p preposition
q quantifier
r pronoun
s Locative
tg When morpheme
t Time Words
in particle
vg Move morpheme
v verb
CEO Vice verb
vn Name verb
w Punctuation
x Non-morpheme word
Y Modal
from Status word
a Unknown words

可以参见对应的枚举类 WordTypeEnum

Benchmark 性能对比

性能对比

性能对比基于 jieba 1.0.2 版本,测试条件保持一致,保证二者都做好预热,然后统一处理。

验证下来,分词的性能是 jieba 的两倍左右

原因也很简单,暂时没有引入词频和 HMM。

代码参见 BenchmarkTest.java

性能对比图

相同长文本,循环 1W 次。

benchmark.png

后期 Road-Map

核心特性

  • 基于词频修正

  • HMM 算法实现新词预测

  • 常见的分词模式

  • 停顿词/人名/地名/机构名/数字... 各种常见的词性标注

格式处理

  • 全角半角处理

  • 繁简体处理

创作感谢

感谢 jieba 分词提供的词库,以及 jieba-analysis 的相关实现。

Guess you like

Origin blog.51cto.com/9250070/2466813