Yesterday, I shared the article contributed by Le Yuquan on the AINLP public account: " Those things about word segmentation ". Some students left a message saying "not addictive". I thought about it. In fact, I have accumulated a lot on the blog about natural language processing. For articles about Chinese word segmentation, in addition to deep learning-based word segmentation methods that have not yet been explored, Chinese word segmentation methods in the "classical" machine learning era are all involved, from dictionary-based Chinese word segmentation (maximum matching method) to statistical-based word segmentation methods ( HMM, Maximum Entropy Model, Conditional Random Field Model (CRF), Mecab, NLTK Chinese word segmentation are all involved. Looking back, these articles are about 10 years old, and now they are a little immature. It may not be suitable to be posted on the official account. But here is an index. Interested students can read them on the blog. Basically, they are There are codes for reference.
Introduction to Chinese Word Segmentation Series
The maximum matching method for the introduction of Chinese word segmentation
Chinese word segmentation introduction to the maximum matching method expansion 1
Chinese word segmentation introduction to the maximum matching method expansion 2
Introduction to Chinese Word Segmentation
Resources for getting started with Chinese word segmentation
Literature of Introduction to Chinese Word Segmentation
Chinese word segmentation method based on character tagging
Introduction to Chinese Word Segmentation Zhuangzi Marking 1
Introduction to Chinese Word Segmentation Zodiac Marking 2
Introduction to Chinese Word Segmentation Zodiac Marking 3
Introduction to Chinese Word Segmentation Zodiac Marking 4
Introduction to Chinese word segmentation full text document
Use MeCab to create a practical Chinese word segmentation system
Use MeCab to create a practical Chinese word segmentation system (2)
Use MeCab to create a practical Chinese word segmentation system (3): MeCab-Chinese
Use MeCab to create a practical Chinese word segmentation system (4): MeCab incremental update
Python natural language processing practice: using Stanford Chinese tokenizer in NLTK
Two Japanese translation documents of rickjin boss, very helpful
Darts: Double-ARray Trie System Translation Document
Mecab document of Japanese tokenizer
Chinese word segmentation related articles shared by other students on the 52nlp blog, thank you all
Beginners report: Implemented a maximum matching word segmentation algorithm
Beginner report (2): Implement 1-gram word segmentation algorithm
Beginner report (3) CRF Chinese word segmentation decoding process understanding
Itenyh version-Chinese word segmentation with HMM 1: Preface
Itenyh version-Chinese word segmentation with HMM 2: model preparation
Itenyh版-用HMM做中文分词三:前向算法和Viterbi算法的开销
Itenyh版-用HMM做中文分词四:A Pure-HMM 分词器
Itenyh版-用HMM做中文分词五:一个混合的分词器
最后关于中文分词的数据资源,多说两句,中文分词的研究时间比较长,方法比较多,从实际经验看,好的词库资源可能更重要一些,最后提供一份中文分词的相关资源,包括中文分词字标注法全文pdf文档,以及web上其他同学分享的词库资源,感兴趣的同学可以关注AINLP,回复“fenci"获取: