NLP word, the word vector, pre-trained articles

Different analytical tools word principle

Various sub-word description of the tool, with specific reference to:

http://www.cnblogs.com/en-heng/p/6234006.html

1)  jieba

Specific reference :

https://blog.csdn.net/rav009/article/details/12196623

Principle jieba word parsing

Uses Unigram + HMM, Unigram assumed that each word independent of each other

Specific reference:

http://www.cnblogs.com/en-heng/p/6234006.html

First we summarize the segmentation method jieba

First loading the dictionary (including custom dictionaries and dictionary comes) generating a trie for the sentence to be the first word by the dictionary tree, for those that appear in a dictionary of the above structure is based on a DAG ( Directed Acyclic Figure ), specifically through representation in python dictionary whose key word is likely to be the first word of the word in the sentence of the subscript, value is a list, each value in the list represents the key to title to all suffix may consist of words in the sentence of the subscript. For these paths by converting dynamic programming to solve the biggest problem for the sake of maximum probability path problem in graph theory, the right side of the graph is the log of word frequency of heavy word.

For those words do not appear in the dictionary, use them to reconstitute fragment HMM model word , the last note of the decoding process jieba of the Viterbi algorithm is constrained.

For use for specific reference word HMM:

http://www.cnblogs.com/en-heng/p/6164145.html

Jieba use the BEMS four label formats represent the word of the beginning, end, middle (begin, end, middle) and independent characters into words (single), the more the label may be more accurate, but will make the training more slowly.

For HMM find new words for a total of three questions probability value, which is based on a large corpus of trained ahead of time, it may be the initial probability of the word is the frequency of each word, then the transition probability and the probability is generated by a large-scale training corpus out (this is the learning problems) . Corpus include: There are two main sources, one is the Internet can be downloaded to the 1998 People's Daily corpus segmentation and a segmentation msr corpus. Another is my own collection of some of the txt fiction, with ictclas their segmentation.

Theme jieba when using word HMM model of the Viterbi algorithm has been modified as follows

To adapt to the Chinese word task, Jieba of the Viterbi algorithm made the following changes: when the state transition to meet PrevStatus conditions, that is, before a state can only be state B E or S, ... and finally a state can only be E or S, represents the end of a word.

1)  Foolnltk

1.  custom dictionary is loaded

import fool
fool.load_userdict('dict/aa.txt')

Note: aa.txt must GBK encoding, each word must be followed by a weighting value greater than 1:

It is based on the feature word + BI-LSTM + CRF divide word

3) HIT ltp

Micro-Bo training corpus

Structured perceptron ( Structured Perceptron's, the SP) method for segmentation, it is the problem of a processing sequence for labeling.

LTP User Dictionary: The official added that "LTP's word module does not adopt strategies dictionary matching external dictionary with characteristic way (converted to feature: is the beginning of the dictionary of words, the middle part of the dictionary words, is the end of dictionary words) adding machine learning algorithm, and can not guarantee that all words are be segmented according to the dictionary of the way. "

Structured perception difference with CRF and Perceptron

Specific reference:

https://www.zhihu.com/question/51872633

Where is the biggest difference between perception and structured crf in? Crf feel templates can also be used in the above-structured Perceptron

The main difference Perceptron (Perceptron) and CRF is to optimize different objectives, CRF optimize the log-likelihood function is a probabilistic model, therefore need to calculate the partition function (partition function), calculated a higher price. Perceptron correct answer while optimizing the difference between the predicted results and scores (SP maximum entropy criterion modeling function score, the results are identical to the word label sequence corresponding to the maximum score function, in particular not understand.), Scoring function is a linear function. CRF potential functions of the scoring function Perceptron linear functions are used, and therefore is consistent with the feature template

First, the "global learning" concept is aimed primarily at structural prediction problem (structure prediction), such as sequence labeling or parsing. Unlike simple multivariate classification, the configuration of prediction problems, usually require a more complex decoding process to be able to obtain a final structured output. Structured perceptron is consistent with the common sense in the learning algorithm, the main difference is that the feature extraction structured consider whether the global output. And feature extraction procedure further determines the structure of the learning and predictive models can be global.

4) CAS nlpir

Micro-Bo training corpus

Because NLPIR word dictionary is based, so if the user loads the custom dictionary will give priority to use the user dictionary.

Its predecessor was ICTCLAS

Uses Bigram of Word-Based Generative Model , Bigram assume that the probability of each word appears only before a word and its related.

Word-Based Generative Model is modeled using the maximum joint probability of the best segmentation scheme. Formula is the word model based, and jieba similar word, jieba used Unigram + HMM is used and it is Bigram + HMM.

5) Tsinghua thulac

People's Daily to the training corpus

And the same sub-word model ltp employed.

6) Stanford Chinese word

It is based on CRF model based on Chinese word

To be updated tomorrow

Guess you like

Origin www.cnblogs.com/dyl222/p/11025378.html