本章节研究内容：基于词袋模型（BOW）特征抽取 + 贝叶斯算法文本分类

CountVectorizer 使用

基于词袋模型特征提取，即我们使用词频TF来抽取特征

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer(analyzer='word',max_features=4000)#创建词袋数据结构
cv_fit=cv.fit_transform(texts)
#用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的

print(cv.get_feature_names())#获得上面稀疏矩阵的列索引，即特征的名字（就是特征词）
print(cv_fit.toarray())# 得到分词的系数矩阵-稠密向量矩阵表示
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0)) #每个词在所有文档中的词频

['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
 [0 2 1 0]
 [1 0 0 1]
 [1 0 0 0]]
[2 3 2 2]

# 词汇表-也就是 字典顺序
cv.vocabulary_

{'dog': 2, 'cat': 1, 'fish': 3, 'bird': 0}

# 统计基于BOW 抽取的字典以及词频数
word = cv.get_feature_names()
freq = cv_fit.toarray().sum(axis = 0)
print(word)
print(freq)
word_freqs = dict(zip(word,freq))
print(word_freqs)
# dict 进行排序
word_freqs = sorted(word_freqs.items(),key=lambda d:d[1],reverse=True)
print(word_freqs)

['bird', 'cat', 'dog', 'fish']
[2 3 2 2]
{'bird': 2, 'cat': 3, 'dog': 2, 'fish': 2}
[('cat', 3), ('bird', 2), ('dog', 2), ('fish', 2)]

# 第一行结果分析： 第0个列表元素，**词典中索引为3的元素**， 词频
print(cv_fit)

  (0, 3)	1
  (0, 1)	1
  (0, 2)	1
  (1, 1)	2
  (1, 2)	1
  (2, 0)	1
  (2, 3)	1
  (3, 0)	1

导入库

from sklearn.model_selection import train_test_split

加载数据

sentences = []
with open("../data/news.csv", 'r',encoding='utf8') as f:
    lines = f.readlines()
    for line in lines:
        splits = line.split(' ')
        feat = splits[:splits.__len__() - 1]
        label = splits[splits.__len__() - 1]
        sentences.append((" ".join(feat), label.strip()))

sentences[:2]

[('另一边 舞王 韩庚 跟随 欢乐 起舞 八十年代 迪斯科 舞步 轮番上阵 场面 精彩 歌之夜 敬请期待 浙江 卫视 2017 周五 00 畅意 100% 乳酸菌 饮品 独家 冠名 二十四 小时 第二季 水手 欢乐 出发',
  'entertainment'),
 ('三是 改变 割裂 状况 建立 一体化 防御 体系', 'technology')]

切分训练集合测试集

重点关注下输入数据格式

X, y = zip(*sentences)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1234)
print('X_train len=',len(X_train))
print('X_test len=',len(X_test))
print(X_train[:2])
print(y_train[:2])

X_train len= 82110
X_test len= 27370
['依托 腾讯 强大 技术 平台 视频 直播 经验 海外 用户 需求 技术 团队 联合 第三方 厂商 海外 用户 体验 优化 地区 打开 直播 缓冲 时间 两秒 以内 国内 用户 相差无几 成功 保障 全球 2145 在线 人群 观看 4K VR 5.1 声道 环绕声 传送 画面 稳定 流畅', '光圈 倒闭 直播 面临 生死战']
['technology', 'entertainment']

词袋模型特征抽取

X_train 和 y_train 的数据格式分别对应：

[‘音乐大师播出’, ‘设计公司承担四届奥运场馆设计’]
[‘entertainment’, ‘sports’]

备注：用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
    analyzer='word',
    max_features=4000,# 通过词袋模型 字典长度最大max_features 
    min_df=100,# 词频最小数min_df 才进行统计
)
vec.fit(X_train)

def get_features(x):
    return vec.transform(x)

我们分析下基于词袋文本抽取方式，我们可以获取什么样的信息

print(vec.get_feature_names()[:20])  # 获得上面稀疏矩阵的列索引，即特征的名字（就是特征词－词典）
print('X_train=', X_train[:2])
print('y_train=', y_train[:2])
words_vec = vec.transform(X_train)  # sparse matrix, [n_samples, n_features]
print(words_vec[:10].toarray())  # 得到分词的系数矩阵-稠密向量矩阵表示

['00', '10', '100', '1000', '11', '12', '120', '13', '14', '15', '150', '1500', '16', '17', '19', '1997', '20', '200', '2000', '2002']
X_train= ['依托 腾讯 强大 技术 平台 视频 直播 经验 海外 用户 需求 技术 团队 联合 第三方 厂商 海外 用户 体验 优化 地区 打开 直播 缓冲 时间 两秒 以内 国内 用户 相差无几 成功 保障 全球 2145 在线 人群 观看 4K VR 5.1 声道 环绕声 传送 画面 稳定 流畅', '光圈 倒闭 直播 面临 生死战']
y_train= ['technology', 'entertainment']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

#统计特征次－词频数据
word = vec.get_feature_names() # 字典中的word
print(word[-10:-1])

['黄磊', '黄金', '黎明', '黑客', '黑色', '黑马', '黑龙江', '默契', '鼓励']

# 字典中freq
freq = words_vec.toarray().sum(axis = 0)
print(freq[:10])

# <word,count>
word_freqs = dict(zip(word,freq))
# dict 进行排序
word_freqs = sorted(word_freqs.items(),key=lambda d:d[1],reverse=True)
print('word_freqs size = ',len(word_freqs))
print(word_freqs[:10])

[443 315 132 405 192 193 210 174 148 216]
word_freqs size =  3685
[('中国', 18494), ('比赛', 7962), ('电影', 7883), ('发展', 7626), ('用户', 6486), ('技术', 6161), ('市场', 6135), ('汽车', 6072), ('平台', 5891), ('北京', 5478)]

# 词汇表
vocab_dict = dict(vec.vocabulary_)
vocab_dict_results = sorted(vocab_dict.items(), key=lambda d: d[1],reverse=True) 
print(vocab_dict_results[:5])

[('龙舟', 3684), ('鼓励', 3683), ('默契', 3682), ('黑龙江', 3681), ('黑马', 3680)]

# 词汇表保存文件
with open('../data/bow_vocab.txt','w') as f:
    for vocab in vocab_dict_results:
        text = "{}|{}".format(vocab[0],vocab[1])
        f.write(text+"\n")

模型训练

用朴素贝叶斯完成一个中文文本分类器，一般在数据量足够，数据丰富度够的情况下，用朴素贝叶斯完成这个任务，准确度还是很不错的。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(vec.transform(X_train),y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

测试集准确率

accuracy = clf.score(vec.transform(X_test), y_test)
print('accuracy = ',accuracy)

accuracy =  0.8751187431494337

交叉验证-准确率

更可靠的验证效果的方式是交叉验证，交叉验证最好保证每一份里面的样本类别也是相对均衡的，这里使用StratifiedKFold

vec.transform(X)

<109480x3685 sparse matrix of type '<class 'numpy.int64'>'
	with 1787662 stored elements in Compressed Sparse Row format>

np.array(y)

array(['entertainment', 'technology', 'sports', ..., 'entertainment',
       'entertainment', 'sports'], dtype='<U13')

from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np

def stratifiedkfold_cv(x, y, clf_class, shuffle=True, n_folds=5, **kwargs):
    stratifiedk_fold = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle)
    y_pred = y[:]
    for train_index, test_index in stratifiedk_fold:
        X_train, X_test = x[train_index], x[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred
NB = MultinomialNB
y_pred = stratifiedkfold_cv(vec.transform(X),np.array(y),NB,n_folds=5)
accuracy = accuracy_score(y, y_pred)
print('kfold accuracy = ',accuracy)

kfold accuracy =  0.8662404092071612

模型保存

import pickle
with open('../model/tf_model.pkl','wb') as f:
    pickle.dump(clf,f)

在线预测

# 加载停止词
with open('../data/stopwords.txt') as f:
    stopwords = [stopword.strip() for stopword in f.readlines()]
print(stopwords[:10])

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']

# 加载模型
import pickle
tf_model = '../model/tf_model.pkl'
with open(tf_model,'rb') as f:
    model = pickle.load(f)

预测案例1-汽车类

摘自今日头条： https://www.toutiao.com/a6714271125473346055/

import jieba
text = "奥迪A3、宝马1系和奔驰A级一直纠缠不休的三个冤家"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['奥迪', 'A3', '宝马', '奔驰', '纠缠', '不休', '三个', '冤家']
car

预测案例2-军事类

摘自今日头条新闻： https://www.toutiao.com/a6714188329937535496/

import jieba
text = "谁说文物只能躺在博物馆，想买一架梦想中的战斗机开着兜风吗？"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['文物', '只能', '博物馆', '一架', '梦想', '战斗机', '开着', '兜风']
military

预测案例3-娱乐类

我们从今日头条： https://www.toutiao.com/a6689675139333751299/ 拷贝标题来进行预测

import jieba
text = "陈晓旭：从完美林黛玉到身家过亿后剃度出家，她戏里戏外都是传奇"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['陈晓旭', '完美', '林黛玉', '身家', '亿后', '剃度', '出家', '戏里', '戏外', '传奇']
entertainment

预测案例4-体育类

摘自今日头条：https://www.toutiao.com/a6714266792253981192/

import jieba
text = "男女有别！国乒主力参加马来西亚T2联赛 男队站着吃自助女队吃桌餐"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['男女有别', '国乒', '主力', '参加', '马来西亚', 'T2', '联赛', '男队', '自助', '女队', '桌餐']
sports

预测案例5-科技类

import jieba
text = "摩托罗拉One Macro将是最新一款Android One智能手机"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['摩托罗拉', 'One', 'Macro', '最新', '一款', 'Android', 'One', '智能手机']
technology

Python 实战系列课程

Python数据可视化教程:基于 plotly 动态可视化绘图
https://edu.51cto.com/sd/4bff8
Python数据可视化教程 Seaborn

https://edu.51cto.com/sd/19627

Python数据可视化教程：基于Plotly的动态可视化绘图
https://edu.csdn.net/course/detail/24935
Python数据可视化教程 Seaborn

https://edu.csdn.net/course/detail/24790

Python 数据分析实战视频课程

https://edu.51cto.com/sd/63225

走在前方博客专家

发布了267 篇原创文章 · 获赞 66 · 访问量 43万+

他的留言板关注

自然语言处理（NLP）： 01 基于词袋模型（BOW）特征抽取 + 贝叶斯算法新闻文本分类

CountVectorizer 使用

导入库

加载数据

切分训练集合测试集

词袋模型特征抽取

模型训练

测试集准确率

交叉验证-准确率

模型保存

在线预测

预测案例1-汽车类

预测案例2-军事类

预测案例3-娱乐类

预测案例4-体育类

预测案例5-科技类

Python 实战系列课程

猜你喜欢

自然语言处理（NLP）： 01 基于词袋模型（BOW）特征抽取 + 贝叶斯算法 新闻文本分类

CountVectorizer 使用

导入库

加载数据

切分训练集合测试集

词袋模型特征抽取

模型训练

测试集准确率

交叉验证-准确率

模型保存

在线预测

预测案例1-汽车类

预测案例2-军事类

预测案例3-娱乐类

预测案例4-体育类

预测案例5-科技类

Python 实战系列课程

猜你喜欢

自然语言处理（NLP）： 01 基于词袋模型（BOW）特征抽取 + 贝叶斯算法新闻文本分类