Segment
Segment is based on a more flexible stuttered word thesaurus to achieve, to achieve high-performance java word.
Creation goal
NLP is a word do related work, a very basic function.
jieba-analysis as a very popular word achieve personal realization opencc4j been using it as a word before.
But with the knowledge of the word, the word for stuttering found on some configuration is not flexible enough.
There are many functions can not be specified shut down, such as traditional and simplified conversion for HMM is useless, because the traditional term is fixed, you do not need to predict.
The latest version of speech and other functions seem to have been removed, but these are very personal needs.
So, if you re-implement it again, hope to achieve a more flexible, more characteristic of segmentation framework.
And jieba-analysis update seems stalled, individual implementations are quite different, so the establishment of a new project.
Features Features
-
High-performance word DFA-based implementation
-
It allows user-defined lexicon
- Support Back to speech
Off by default, lazy loading, it does not affect performance and memory.
Getting Started
ready
jdk1.7 +
maven 3.x+
maven introduced
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>segment</artifactId>
<version>${最新版本}</version>
</dependency>
Examples of Use
See the relevant code SegmentBsTest.java
Get word, subscript and other information
Temporarily does not implement speech tagging, ready to implement the next version.
final String string = "这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱学习。";
List<ISegmentResult> resultList = SegmentBs.newInstance().segment(string);
Assert.assertEquals("[这[0,1), 是[1,2), 一个[2,4), 伸手不见五指[4,10), 的[10,11), 黑夜[11,13), 。[13,14), 我[14,15), 叫[15,16), 孙悟空[16,19), ,[19,20), 我[20,21), 爱[21,22), 北京[22,24), ,[24,25), 我[25,26), 爱[26,27), 学习[27,29), 。[29,30)]", resultList.toString());
Only to get word information
final String string = "这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱学习。";
List<String> resultList = SegmentBs.newInstance().segmentWords(string);
Assert.assertEquals("[这, 是, 一个, 伸手不见五指, 的, 黑夜, 。, 我, 叫, 孙悟空, ,, 我, 爱, 北京, ,, 我, 爱, 学习, 。]", resultList.toString());
Return speech
Examples of Use
Specified directly wordType
attribute to true.
final String string = "我爱学习";
List<ISegmentResult> resultList = SegmentBs
.newInstance()
.wordType(true)
.segment(string);
Assert.assertEquals("[我[0,1)/r, 爱[1,2)/v, 学习[2,4)/v]", resultList.toString());
POS Description
r / v is the part of speech, meaning every details as follows.
coding | description |
---|---|
Ag | Shaped morpheme |
a | adjective |
ad | Deputy shaped word |
an | Noun word |
b | Distinguishing Words |
c | conjunction |
dg | Vice morpheme |
d | adverb |
e | interjection |
f | Position of the word |
g | Morpheme |
h | Before the next ingredient |
i | idiom |
j | Short abbreviation |
k | After receiving component |
l | Idiom |
m | numeral |
of | Name morpheme |
n | noun |
No. | Person's name |
ns | Place name |
nt | Institutional bodies |
nz | Other Names |
O | Onomatopoeia |
p | preposition |
q | quantifier |
r | pronoun |
s | Locative |
tg | When morpheme |
t | Time Words |
in | particle |
vg | Move morpheme |
v | verb |
CEO | Vice verb |
vn | Name verb |
w | Punctuation |
x | Non-morpheme word |
Y | Modal |
from | Status word |
a | Unknown words |
可以参见对应的枚举类 WordTypeEnum
Benchmark 性能对比
性能对比
性能对比基于 jieba 1.0.2 版本,测试条件保持一致,保证二者都做好预热,然后统一处理。
验证下来,分词的性能是 jieba 的两倍左右。
原因也很简单,暂时没有引入词频和 HMM。
代码参见 BenchmarkTest.java
性能对比图
相同长文本,循环 1W 次。
后期 Road-Map
核心特性
-
基于词频修正
-
HMM 算法实现新词预测
-
常见的分词模式
- 停顿词/人名/地名/机构名/数字... 各种常见的词性标注
格式处理
-
全角半角处理
- 繁简体处理
创作感谢
感谢 jieba 分词提供的词库,以及 jieba-analysis 的相关实现。