规则分词：机械的分词方法，主要是通过维护词典，每次分割时将词语中每个字符串与词典表中的词逐一比较，确定是否切割（很费时）。按照切分方式，主要有正向最大匹配法，逆向最大匹配法，及双向最大匹配法。

1.正向

从左至右取切分汉语的m个字符串作为匹配字段，m是需要设置的字典中最大词条长度；

在词典中进行查找，若匹配成功则将匹配字段作为切分词提取出来；若不成功，去掉最后一个字，新的字段再从新去词典里匹配，直到切分完成。

2.逆向

3.正逆向

4.词袋模型

三部曲：分词（tokenizing），统计修订词特征值（counting）与标准化（normalizing）

它仅仅考虑了词频，没有考虑上下文的关系。

scikit-learn的CountVectorizer类来完成，这个类可以帮我们完成文本的词频统计与向量化：

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ["I come to China to travel",
          "This is a car polupar in China",
          "I love tea and Apple ",
          "The work is to write some papers in science"]
print('文档号及词频')
# (0,16) 1：0号文档，字典中16号词 ，频率1
doc = vectorizer.fit_transform(corpus)
print(doc)
print('文本词向量')
print(doc.toarray())
print('词典')
# I是停用词，自动过滤
print(vectorizer.get_feature_names())

#
文档号及词频
  (0, 16)	1
  (0, 3)	1
  (0, 15)	2
  (0, 4)	1
  (1, 5)	1
  (1, 9)	1
  (1, 2)	1
  (1, 6)	1
  (1, 14)	1
  (1, 3)	1
  (2, 1)	1
  (2, 0)	1
  (2, 12)	1
  (2, 7)	1
  (3, 10)	1
  (3, 8)	1
  (3, 11)	1
  (3, 18)	1
  (3, 17)	1
  (3, 13)	1
  (3, 5)	1
  (3, 6)	1
  (3, 15)	1
文本词向量
[[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
 [0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0]
 [1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1]]
词典
['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']

scikit-learn的HashingVectorizer类中，实现了基于signed hash trick的算法，其定义一个维度远小于词汇表的哈西维度，看作是降维，任意词哈西映射到一个位置，然后词频累加，有些不同的词会被累加在一起。结果里面有负数，这是因为哈希函数可以哈希到1或者-1导致的。和PCA类似，Hash Trick降维后的特征我们已经不知道它代表的特征名字和意义。此时不能知道每一列的意义，所以Hash Trick的解释性不强。

from sklearn.feature_extraction.text import HashingVectorizer

corpus = ["I come to China to travel",
          "This is a car polupar in China",
          "I love tea and Apple ",
          "The work is to write some papers in science"]

vectorizer2 = HashingVectorizer(n_features=6, norm=None)
print(vectorizer2.fit_transform(corpus))
#
  (0, 1)	2.0
  (0, 2)	-1.0
  (0, 4)	1.0
  (0, 5)	-1.0
  (1, 0)	1.0
  (1, 1)	1.0
  (1, 2)	-1.0
  (1, 5)	-1.0
  (2, 0)	2.0
  (2, 5)	-2.0
  (3, 0)	0.0
  (3, 1)	4.0
  (3, 2)	-1.0
  (3, 3)	1.0
  (3, 5)	-1.0

5.词频-逆文本频率TF-IDF

出现频率大的词语，反而重要性不高，to,from，......。IDF来反应这个词的重要性，或者，反应一个词在所有文本中出现的频率，如果一个词在很多的文本中出现，那么它的IDF值应该低，如“to”。而反过来如果一个词在比较少的文本中出现，那么它的IDF值应该高。比如一些专业的名词如“Machine Learning”。这样的词IDF值应该高。

$IDF(x) = log\frac{N+1}{N(x)+1} + 1$

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I come to China to travel",
          "This is a car polupar in China",
          "I love tea and Apple ",
          "The work is to write some papers in science"]

# 文本向量化
vectorizer = CountVectorizer()
#  Tfidf转换器
transformer = TfidfTransformer()
# 处理向量化的文本为逆文本频率
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
print('词频', tfidf)
print('词典', vectorizer.get_feature_names())


# --------一步到位---------
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf2 = TfidfVectorizer()
re = tfidf2.fit_transform(corpus)
print('词频', re)
print('词典', tfidf2.get_feature_names())

6 中文分词流程

1.数据收集

　　　　在文本挖掘之前，我们需要得到文本语料库，需要自己爬取或者下载网站的公开语料库。

2.除去数据中非文本部分

　　　　这一步主要是针对我们用爬虫收集的语料数据，由于爬下来的内容中有很多html的一些标签，需要去掉。少量的非文本内容的可以直接用Python的正则表达式(re)删除, 复杂的则可以用beautifulsoup来去除。去除掉这些非文本的内容后，我们就可以进行真正的文本预处理了。

3. 处理中文编码问题

　　　　由于Python2不支持unicode的处理，因此我们使用Python2做中文文本预处理时需要遵循的原则是，存储数据都用utf8，读出来进行中文相关处理时，使用GBK之类的中文编码。

4. 中文分词

　　　　常用的中文分词软件有很多，jieba分词比较流行，pip install jieba。三步曲：分词后的文件，停用词列表，TFIDF转换

jieba分词：


import jieba

# 加入人工设定的人名，地名
jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('京州', True)

with open('./nlp_test2.txt', encoding='utf-8') as f:
    document = f.read()

    document_decode = document.encode('utf-8')
    # 全模式
    # document_cut = jieba.cut(document_decode,cut_all=True)
    # 精确模式
    # document_cut = jieba.cut(document_decode, cut_all=False)
    # 搜索引擎模式
    document_cut = jieba.cut_for_search(document_decode)
    result = ' '.join(document_cut)
    result = result.encode('utf-8')
    with open('./nlp_test3.txt', 'w') as f2:
        f2.write(str(result, encoding='utf-8'))

利用分词后的文本，利用停用词，进行Tfidf转换：

# 读取分词文件
with open('./nlp_test1.txt') as f3:
    res1 = f3.read()
with open('./nlp_test3.txt') as f4:
    res2 = f4.read()

from sklearn.feature_extraction.text import TfidfVectorizer

# ===========分词文件============
corpus = [res1, res2]

# =========停用词表==========
stpwrdpath = 'stop_words.txt'
with open(stpwrdpath) as f5:
    stpwrd_content = f5.read()
    # 将停用词表转换为list
    stpwrdlst = stpwrd_content.splitlines()

vector = TfidfVectorizer(stop_words=stpwrdlst)
tfidf = vector.fit_transform(corpus)
# print(tfidf)

wordlist = vector.get_feature_names()  # 获取词袋模型中的所有词
# 文档数 X 词典数 矩阵
weightlist = tfidf.toarray()
num_doc = len(weightlist)
for i in range(num_doc):
    print("-------第", i, "段文本的词语tf-idf权重------")
    for j in range(len(wordlist)):
        print('第', j, '个词', wordlist[j], weightlist[i][j])


#
-------第 0 段文本的词语tf-idf权重------
第 0 个词 一起 0.23276132724875156
第 1 个词 万块 0.23276132724875156
第 2 个词 三人 0.23276132724875156

第 52 个词 道口 0.11638066362437578
第 53 个词 金山 0.11638066362437578
第 54 个词 降职 0.11638066362437578
第 55 个词 风生水 0.11638066362437578
-------第 1 段文本的词语tf-idf权重------
第 0 个词 一起 0.0
第 1 个词 万块 0.0
第 2 个词 三人 0.0
第 3 个词 三套 0.14811780175932843
第 54 个词 降职 0.0
第 55 个词 风生水 0.0

NLP-中文分词-预处理