希婆邮件主题抽取-----LDA模型应用

在这里插入图片描述

代码实例：
1、导入库和文件

import numpy as np
import pandas as pd
import re
from gensim import corpora,models,similarities
from nltk.corpus import stopwords


df = pd.read_csv('H:/HillaryEmails.csv')
df = df[['Id','ExtractedBodyText']].dropna()

2、文本处理

'''
文本预处理
'''
def clean_email_text(text):
    text = text.replace('\n',' ')           #去掉换行符
    text = re.sub("-"," ",text)             #用空格替换掉‘-’
    text = re.sub(r"\d+/\d+/\d+"," " ,text)  #去掉日期数据
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)  # 时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text)  # 邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)  # 网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter == ' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word) > 1)
    return text

docs = df['ExtractedBodyText']
docs=docs.apply(lambda s:clean_email_text(s))

3构建模型

'''
利用gensim构建模型
    1、从nltk.corpus导入停止词表，分词
    2、构建语料库
'''
doclist = docs.values

#去停止词
words = stopwords.words('english')          #!!!记得去停止词需要加上这句
texts = [[word for word in doc.lower().split() if word not in words] for doc in doclist]

#构建语料库，此处使用词袋模式
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# print(corpus[0])
# [(0, 3), (1, 2), (2, 1), (3, 2), (4, 1), (5, 2), (6, 2), (7, 2), (8, 1), (9, 1), (10, 1), (11, 3), (12, 1)]
# (0,3)代表0号词出现三次，以此类推

lda = models.ldamodel.LdaModel(corpus = corpus,id2word=dictionary,num_topics=20)
#lda.print_topic(10, topn=5)                         #某一个分类中最常出现的单词
# print(lda.print_topics(num_topics=20,num_words=5))  #输出所有分类和其常出现的单词

4 测试

'''
通过
lda.get_document_topics(bow)
或者
lda.get_term_topics(word_id)
两个方法，我们可以把新鲜的文本/单词，分类成20个主题中的一个。
但是注意，我们这里的文本和单词，都必须得经过同样步骤的文本预处理+词袋化，也就是说，变成数字表示每个单词的形式
'''

text1= 'We have still have not shattered that highest and hardest glass ceiling. But some day, someone willTo Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership'
text1=clean_email_text(text1)
text1 = [word for word in text1.lower().split() if word not in words]
text1_bows = dictionary.doc2bow(text1)
print(lda.get_document_topics(text1_bows))
#[(0, 0.52221924), (2, 0.1793758), (9, 0.13047828), (15, 0.12792665)]

LDA原理讲解参考：https://blog.csdn.net/v_july_v/article/details/41209515

希婆邮件主题抽取-----LDA模型应用

猜你喜欢