Scikit-Learn机器学习之监督学习模型案例集-新闻/邮件文本内容分类(朴素贝叶斯算法模型)

最简单的办法

下载'20news-bydate.pkz', 放到C:\\Users\[Current user]\scikit_learn_data 下边就行.

2.1. 手动下载 文件

    存放到scikit_learn_data/20news_home/下 解压开

2.2 改site-package/sklearn/datasets/twenty_newsgroups.py里的函数: download_20newsgroups

 注释掉下边代码:运行自动合并成 20news-bydate_py3.pkz

# logger.info("Downloading dataset from %s (14 MB)", ARCHIVE.url)
# archive_path = _fetch_remote(ARCHIVE, dirname=target_dir)
#
# logger.debug("Decompressing %s", archive_path)
# tarfile.open(archive_path, "r:gz").extractall(path=target_dir)
# os.remove(archive_path)

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')

# 查看数据规模和细节
print(len(news.data))

# 从 sklearn.cross_validation 中导入 train_test_split 用于数据分割
from sklearn.model_selection import train_test_split
# 从使用 train_test_split,利用随机种子 random_state 采样 25% 的数据作为测试数据集
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)

# 从 sklearn.feature_extraction.text 里选择导入特征值向量转化模块 CounterVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# 创建 特征值向量转化模块对象
vec = CountVectorizer()
# 标准化训练数据集
X_train = vec.fit_transform(X_train)
# 标准化测试数据集
X_test = vec.transform(X_test)

# 从sklearn.naive_bayes里选择导入 MultinomialNB
from sklearn.naive_bayes import MultinomialNB

# 使用朴素贝叶斯分类器对象
mnb = MultinomialNB()
# 模型训练
mnb.fit(X_train, y_train)

score=mnb.score(X_test,y_test)
print(score)

# 导入模型评估模块
from sklearn import metrics
result = mnb.predict(X_test)
from sklearn.metrics import classification_report
report=classification_report(y_test,result,target_names=news.target_names)
print(report)

# 创建两个文章中出现的词或短语
docs_new = ['God is love', 'OpenGL on thr GPU is fast']
# 特征值向量化转换
X_new_counts = vec.transform(docs_new)
# 全文检索
predict = mnb.predict(X_new_counts)
# 循环输出词语或断句出现在的分类中
for index in predict:
    print(news.target_names[index])

猜你喜欢

转载自blog.csdn.net/huanghong6956/article/details/85762491