文章目录

简介
规则分词
统计分词
混合分词
- Jieba分词
- - 高频词提取

简介

方法可以归纳为 “规则分词”、“统计分词” 和 “混合分词（规则 + 统计）”

规则分词主要是通过人工设立词库，按照一定方式进行匹配切分，但对新词很难进行处理

统计机器学习兴起后，能够较好应对新词发现，但过于依赖语料的质量。

规则分词

通过维护词典，切分语句时，将语句的每个字符串与词表的词进行逐一匹配，找到则切分，否则不予切分

主要有正向最大匹配法、逆向最大匹配法以及双向中最大匹配法

正向最大匹配法（MM 法）

基本思想：假定分词词典中最长词有 i 个汉字字符，则当前要处理的字串的前 i 个字作为匹配字段

若匹配成功，匹配字段作为一个词切分出去

若匹配失败，将匹配字段的最前一个字去掉，重新匹配

知道匹配成功或者匹配字段长度为 0

如 “南京市长江大桥”，词典中存在 “南京市长” 和 “长江大桥”，且最长词的长度为 5，则会分为 “南京市长”、“江”、“大桥”。这显然不是我们想要的

# 正向最大匹配
class MM(object):
    def __init__(self, dic_path):
        self.dictionary = set()
        self.maximum = 0
        with open(dic_path, 'r', encoding='utf8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                self.dictionary.add(line)
                if len(line) > self.maximum:
                    self.maximum = len(line)
                    
    def cut(self, text):
        result = []
        index = 0
        length = len(text)
        while index < length:
            word = None
            for size in range(self.maximum, 0, -1):
                if index + size > length:
                    continue
                piece = text[index:(index + size)]
                if piece in self.dictionary:
                    word = piece
                    result.append(word)
                    index += size
                    break
            if word is None:
                result.append(text[index])
                index += 1
        return result

def main():
    text = "南京市长江大桥"
    tokenizer = MM('imm_dic.utf8')
    print(tokenizer.cut(text))
main()

输出 [‘南京市长’, ‘江’, ‘大桥’]

逆向最大匹配法(RMM 法)

与 MM 法类似，每次从文档的末端开始匹配，匹配失败去掉最后面一个字。

相应地，分词词典是逆序词典，文档也要先进行倒排处理。

汉语中偏正结构较多，所以从后向前匹配，可以适当的提高精度。

统计结果表明，单纯使用 MM 错误率为 1/169，单纯使用 RMM 错误率为 1/245。

# 逆向最大匹配
class RMM(object):
    def __init__(self, dic_path):
        self.dictionary = set()
        self.maximum = 0
        with open(dic_path, 'r', encoding='utf8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                self.dictionary.add(line)
                if len(line) > self.maximum:
                    self.maximum = len(line)
    def cut(self, text):
        result = []
        index = len(text)
        while index > 0:
            word = None
            for size in range(self.maximum, 0, -1):
                if index - size < 0:
                    continue
                piece = text[(index - size):index]
                if piece in self.dictionary:
                    word = piece
                    result.append(word)
                    index -= size
                    break
            if word is None:
                result.append(text[index-1])
                index -= 1
        return result[::-1]

def main():
    text = "南京市长江大桥"
    tokenizer = RMM('imm_dic.utf8')
    print(tokenizer.cut(text))
main()

输出 [‘南京市’, ‘长江大桥’]

双向最大匹配法

将 MM 和 RMM 分词结果进行比较，按照最大匹配原则，选取词数切分最少的作为结果。

SunM.S.(1995) 表明，只有 1 % 的句子，使用 MM 和 RMM 都不对。

双向最大匹配法在实用中文信息处理系统中广泛使用

# 双向最大匹配法
def main():
    text = "南京市长江大桥"
    tokenizer1 = MM('imm_dic.utf8')
    tokenizer2 = RMM('imm_dic.utf8')
    res1 = tokenizer1.cut(text)
    res2 = tokenizer2.cut(text)
    if len(res1) < len(res2):
        print(res1)
    else:
        print(res2)
main()

输出 [‘南京市’, ‘长江大桥’]

统计分词

主要思想：词看作各个字组成，如果相连的字在不同的文本出现的次数越多，就证明这很可能就是一个词。

利用字与字相邻出现的频率反映成词的可靠度，当高于一个临界值时，就认为可能构成一个词语。

建立语言模型
对句子进行单词划分，然后对划分结果进行概率计算，获得概率最大的分词方式。这里会用到隐马尔可夫、条件随机场等。

【待完成：HMM 模型】

混合分词

最常用的方式就是先基于词典的方式进行分词，然后再用统计分词方法进行辅助。Jieba 分词工具便是基于这种方法

Jieba分词

基于规则和统计两类方法

首先基于前缀词典进行词图扫描，构建包含全部可能分词结果的有向无环图。基于标注语料，使用动态规划方法可以找出最大概率模型。

对于未登录词，Jieba 使用了基于汉字成词的 HMM 模型，采用了 Viterbi 算法进行推导。

三种分词模式：

精确模式：试图将句子最精确切开，适合文本分析
全模式：所有可以成词的词语都扫描出来，速度很快，但是不能解决起义
搜索引擎模式：在精确模式基础上，对长词再次切分，提高召回率，适合搜索引擎分词。

import jieba
sent = '中文分词是文本处理不可或缺的一步！'
seg_list = jieba.cut(sent, cut_all=True)
print('全模式：', '/ '.join(seg_list))

seg_list = jieba.cut(sent, cut_all=False)
print('精确模式：', '/ '.join(seg_list))

seg_list = jieba.cut(sent)
print("默认精确模式：", '/ '.join(seg_list))

seg_list = jieba.cut_for_search(sent)
print('搜索引擎模式：', '/ '.join(seg_list))import jieba
sent = '中文分词是文本处理不可或缺的一步！'
seg_list = jieba.cut(sent, cut_all=True)
print('全模式：', '/ '.join(seg_list))

seg_list = jieba.cut(sent, cut_all=False)
print('精确模式：', '/ '.join(seg_list))

seg_list = jieba.cut(sent)
print("默认精确模式：", '/ '.join(seg_list))

seg_list = jieba.cut_for_search(sent)
print('搜索引擎模式：', '/ '.join(seg_list))

高频词提取

高频词提取其实就是 TF (Term Frequency) 策略

get_content
cut -> split_words 忽略一些词
get_TF 排序字典计数 get 方法

import numpy as np
import glob
import jieba

def get_content(path):
    with open(path, 'r', encoding='utf8', errors='ignore') as f:
        content = ''
        for l in f:
            l = l.strip()
            content += l
        return content

def get_TF(words, topK=10):
    tf_dic = {
    
    }
    for w in words:
        tf_dic[w] = tf_dic.get(w, 0) + 1  # 获取 w 的计算，没有就置为 0
    return sorted(tf_dic.items(), key=lambda x:x[1], reverse=True)[:topK]

def stop_words(path):  # 跳过的词，返回列表
    with open(path, encoding='utf8') as f:
        return [l.strip() for l in f]

def cut(content):
    split_words = [x for x in jieba.cut(content) if x not in ['，', '。','、'] and len(x)>1]  #这里可以用 stop_words
    return split_words

def main():
    files = glob.glob('news.txt') #可以读多个文件
    corpus = [get_content(x) for x in files]
    
    sample_inx = np.random.randint(0, len(corpus)) #随机选择一个文件，不过这里就使用一个
    split_words = cut(corpus[sample_inx])
    print('样本的topK(10)词：'+str(get_TF(split_words)))

注意：np.random.randint是左闭右开区间，而random.randint是左闭右闭区间

中文分词总结