Different analytical tools word principle
Various sub-word description of the tool, with specific reference to:
http://www.cnblogs.com/en-heng/p/6234006.html
1) jieba
Specific reference :
https://blog.csdn.net/rav009/article/details/12196623
Principle jieba word parsing
Uses Unigram + HMM, Unigram assumed that each word independent of each other
Specific reference:
http://www.cnblogs.com/en-heng/p/6234006.html
First we summarize the segmentation method jieba
First loading the dictionary (including custom dictionaries and dictionary comes) generating a trie for the sentence to be the first word by the dictionary tree, for those that appear in a dictionary of the above structure is based on a DAG ( Directed Acyclic Figure ), specifically through representation in python dictionary whose key word is likely to be the first word of the word in the sentence of the subscript, value is a list, each value in the list represents the key to title to all suffix may consist of words in the sentence of the subscript. For these paths by converting dynamic programming to solve the biggest problem for the sake of maximum probability path problem in graph theory, the right side of the graph is the log of word frequency of heavy word.
For those words do not appear in the dictionary, use them to reconstitute fragment HMM model word , the last note of the decoding process jieba of the Viterbi algorithm is constrained.
For use for specific reference word HMM:
http://www.cnblogs.com/en-heng/p/6164145.html
Jieba use the BEMS four label formats represent the word of the beginning, end, middle (begin, end, middle) and independent characters into words (single), the more the label may be more accurate, but will make the training more slowly.
For HMM find new words for a total of three questions probability value, which is based on a large corpus of trained ahead of time, it may be the initial probability of the word is the frequency of each word, then the transition probability and the probability is generated by a large-scale training corpus out (this is the learning problems) . Corpus include: There are two main sources, one is the Internet can be downloaded to the 1998 People's Daily corpus segmentation and a segmentation msr corpus. Another is my own collection of some of the txt fiction, with ictclas their segmentation.
Theme jieba when using word HMM model of the Viterbi algorithm has been modified as follows
To adapt to the Chinese word task, Jieba of the Viterbi algorithm made the following changes: when the state transition to meet PrevStatus conditions, that is, before a state can only be state B E or S, ... and finally a state can only be E or S, represents the end of a word.
1) Foolnltk
1. custom dictionary is loaded
import fool
fool.load_userdict('dict/aa.txt')
Note: aa.txt must GBK encoding, each word must be followed by a weighting value greater than 1:
It is based on the feature word + BI-LSTM + CRF divide word
3) HIT ltp
Micro-Bo training corpus
Structured perceptron ( Structured Perceptron's, the SP) method for segmentation, it is the problem of a processing sequence for labeling.
LTP User Dictionary: The official added that "LTP's word module does not adopt strategies dictionary matching external dictionary with characteristic way (converted to feature: is the beginning of the dictionary of words, the middle part of the dictionary words, is the end of dictionary words) adding machine learning algorithm, and can not guarantee that all words are be segmented according to the dictionary of the way. "
Structured perception difference with CRF and Perceptron
Specific reference:
https://www.zhihu.com/question/51872633
Where is the biggest difference between perception and structured crf in? Crf feel templates can also be used in the above-structured Perceptron
The main difference Perceptron (Perceptron) and CRF is to optimize different objectives, CRF optimize the log-likelihood function is a probabilistic model, therefore need to calculate the partition function (partition function), calculated a higher price. Perceptron correct answer while optimizing the difference between the predicted results and scores (SP maximum entropy criterion modeling function score, the results are identical to the word label sequence corresponding to the maximum score function, in particular not understand.), Scoring function is a linear function. CRF potential functions of the scoring function Perceptron linear functions are used, and therefore is consistent with the feature template
First, the "global learning" concept is aimed primarily at structural prediction problem (structure prediction), such as sequence labeling or parsing. Unlike simple multivariate classification, the configuration of prediction problems, usually require a more complex decoding process to be able to obtain a final structured output. Structured perceptron is consistent with the common sense in the learning algorithm, the main difference is that the feature extraction structured consider whether the global output. And feature extraction procedure further determines the structure of the learning and predictive models can be global.
4) CAS nlpir
Micro-Bo training corpus
Because NLPIR word dictionary is based, so if the user loads the custom dictionary will give priority to use the user dictionary.
Its predecessor was ICTCLAS
Uses Bigram of Word-Based Generative Model , Bigram assume that the probability of each word appears only before a word and its related.
Word-Based Generative Model is modeled using the maximum joint probability of the best segmentation scheme. Formula is the word model based, and jieba similar word, jieba used Unigram + HMM is used and it is Bigram + HMM.
5) Tsinghua thulac
People's Daily to the training corpus
And the same sub-word model ltp employed.
6) Stanford Chinese word
It is based on CRF model based on Chinese word
To be updated tomorrow