The LDA model is mainly used to generate TOPIC
Table of contents
5. Output the words corresponding to each topic
6. Output the corresponding topic of each article
foreword
The LDA model requires a certain mathematical foundation to understand, but it can also be used as a black box.
1. Principle
You can learn more about the principle through the following information.
https://www.jianshu.com/p/5c510694c07e
Topic Model: Detailed Explanation and Application of LDA Principles
Topic Model - Latent Dirichlet Allocation (LDA)_哔哩哔哩_bilibili
Latent Dirichlet Allocation (LDA) is a topic model , a typical bag of words model, that is, it believes that a document is a set composed of a group of words, and the relationship between words and words There is no sequence or sequence relationship. A document can contain multiple topics, and each word in the document is generated by one of the topics. It can give the topic of each document in the document set in the form of a probability distribution , and summarize the topics of the articles, which belongs to unsupervised learning.
What needs to be distinguished is that another classic dimensionality reduction method, linear discriminant analysis (Linear Discriminant Analysis, also referred to as LDA for short). This LDA has a very wide range of applications in the field of pattern recognition (such as face recognition, ship recognition, etc.)
LDA does not require a manually labeled training set during training, but only needs a document set and the number k of specified topics. In addition, another advantage of LDA is that for each topic, some words can be found to describe it. Select the number of topics in the model - manually set the parameters, and then each article entered will give the probability of a topic and each topic will give the probability of the next word. The specific implementation of the topic is determined by yourself
The specific generation model class is shown in the following figure:
Two, the code
1. Import library
import os
import pandas as pd
import re
import jieba
import jieba.posseg as psg
2. Path reading
output_path = '../result'
file_path = '../data'
os.chdir(file_path)
data=pd.read_excel("data.xlsx")#content type
os.chdir(output_path)
dic_file = "../stop_dic/dict.txt"
stop_file = "../stop_dic/stopwords.txt"
Create three folders in the same directory: result, data, stop_dic
3. Word segmentation
def chinese_word_cut(mytext):
jieba.load_userdict(dic_file)
jieba.initialize()
try:
stopword_list = open(stop_file,encoding ='utf-8')
except:
stopword_list = []
print("error in stop_file")
stop_list = []
flag_list = ['n','nz','vn']
for line in stopword_list:
line = re.sub(u'\n|\\r', '', line)
stop_list.append(line)
word_list = []
#jieba分词
seg_list = psg.cut(mytext)
for seg_word in seg_list:
word = re.sub(u'[^\u4e00-\u9fa5]','',seg_word.word)
find = 0
for stop_word in stop_list:
if stop_word == word or len(word)<2: #this word is stopword
find = 1
break
if find == 0 and seg_word.flag in flag_list:
word_list.append(word)
return (" ").join(word_list)
data["content_cutted"] = data.content.apply(chinese_word_cut)
This step takes a little time, word segmentation processing
4. LDA analysis
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
Requires the sklearn library
def print_top_words(model, feature_names, n_top_words):
tword = []
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
tword.append(topic_w)
print(topic_w)
return tword
n_features = 1000 #提取1000个特征词语
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
max_features=n_features,
stop_words='english',
max_df = 0.5,
min_df = 10)
tf = tf_vectorizer.fit_transform(data.content_cutted)
n_topics = 8
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,
learning_method='batch',
learning_offset=50,
# doc_topic_prior=0.1,
# topic_word_prior=0.01,
random_state=0)
lda.fit(tf)
5. Output the words corresponding to each topic
n_top_words = 25
tf_feature_names = tf_vectorizer.get_feature_names()
topic_word = print_top_words(lda, tf_feature_names, n_top_words)
6. Output the corresponding topic of each article
import numpy as np
topics=lda.transform(tf)
topic = []
for t in topics:
topic.append(list(t).index(np.max(t)))
data['topic']=topic
data.to_excel("data_topic.xlsx",index=False)
topics[0]#0 1 2
7. Visualization
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
pyLDAvis.show(pic)
8. Perplexity
import matplotlib.pyplot as plt
plexs = []
scores = []
n_max_topics = 16
for i in range(1,n_max_topics):
print(i)
lda = LatentDirichletAllocation(n_components=i, max_iter=50,
learning_method='batch',
learning_offset=50,random_state=0)
lda.fit(tf)
plexs.append(lda.perplexity(tf))
scores.append(lda.score(tf))
n_t=15#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()
n_t=15#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,scores[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("score")
plt.show()
Three, practical operation
I encountered some problems while running. After checking and searching, it was a problem of the environment and version, which was solved by adjusting the version. It is recommended that you run it under conda=4.12.0, pandas=1.3.0, pyLDAvis=2.1.2, Basically, there will be no problems.
The final result is as follows:
Summarize
The understanding of the LDA model is not very thorough. If you use a csv file in the operation of the code, you only need to change read_csv(). There are also some encoding and decoding problems. You can try to change gbk or something, a chicken, I hope There are big guys who can advise.