LDA model principle + code + practical operation

The LDA model is mainly used to generate TOPIC


foreword

The LDA model requires a certain mathematical foundation to understand, but it can also be used as a black box.

1. Principle

You can learn more about the principle through the following information.

[python-sklearn] Chinese text | topic model analysis - LDA (Latent Dirichlet Allocation)_哔哩哔哩_bilibili

https://www.jianshu.com/p/5c510694c07e

Topic Model: Detailed Explanation and Application of LDA Principles

Topic Model - Latent Dirichlet Allocation (LDA)_哔哩哔哩_bilibili

Latent Dirichlet Allocation (LDA) is a topic model , a typical bag of words model, that is, it believes that a document is a set composed of a group of words, and the relationship between words and words There is no sequence or sequence relationship. A document can contain multiple topics, and each word in the document is generated by one of the topics. It can give the topic of each document in the document set in the form of a probability distribution , and summarize the topics of the articles, which belongs to unsupervised learning.

What needs to be distinguished is that another classic dimensionality reduction method, linear discriminant analysis (Linear Discriminant Analysis, also referred to as LDA for short). This LDA has a very wide range of applications in the field of pattern recognition (such as face recognition, ship recognition, etc.)

LDA does not require a manually labeled training set during training, but only needs a document set and the number k of specified topics. In addition, another advantage of LDA is that for each topic, some words can be found to describe it. Select the number of topics in the model - manually set the parameters, and then each article entered will give the probability of a topic and each topic will give the probability of the next word. The specific implementation of the topic is determined by yourself

The specific generation model class is shown in the following figure:

Two, the code

1. Import library

import os
import pandas as pd
import re
import jieba
import jieba.posseg as psg

 2. Path reading

output_path = '../result'
file_path = '../data'
os.chdir(file_path)
data=pd.read_excel("data.xlsx")#content type
os.chdir(output_path)
dic_file = "../stop_dic/dict.txt"
stop_file = "../stop_dic/stopwords.txt"

 Create three folders in the same directory: result, data, stop_dic

3. Word segmentation

def chinese_word_cut(mytext):
    jieba.load_userdict(dic_file)
    jieba.initialize()
    try:
        stopword_list = open(stop_file,encoding ='utf-8')
    except:
        stopword_list = []
        print("error in stop_file")
    stop_list = []
    flag_list = ['n','nz','vn']
    for line in stopword_list:
        line = re.sub(u'\n|\\r', '', line)
        stop_list.append(line)
    
    word_list = []
    #jieba分词
    seg_list = psg.cut(mytext)
    for seg_word in seg_list:
        word = re.sub(u'[^\u4e00-\u9fa5]','',seg_word.word)
        find = 0
        for stop_word in stop_list:
            if stop_word == word or len(word)<2:     #this word is stopword
                    find = 1
                    break
        if find == 0 and seg_word.flag in flag_list:
            word_list.append(word)      
    return (" ").join(word_list)
data["content_cutted"] = data.content.apply(chinese_word_cut)

This step takes a little time, word segmentation processing

4. LDA analysis

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Requires the sklearn library

def print_top_words(model, feature_names, n_top_words):
    tword = []
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        tword.append(topic_w)
        print(topic_w)
    return tword

n_features = 1000 #提取1000个特征词语
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                stop_words='english',
                                max_df = 0.5,
                                min_df = 10)
tf = tf_vectorizer.fit_transform(data.content_cutted)

n_topics = 8
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,
                                learning_method='batch',
                                learning_offset=50,
#                                 doc_topic_prior=0.1,
#                                 topic_word_prior=0.01,
                               random_state=0)
lda.fit(tf)

5. Output the words corresponding to each topic

n_top_words = 25
tf_feature_names = tf_vectorizer.get_feature_names()
topic_word = print_top_words(lda, tf_feature_names, n_top_words)

6. Output the corresponding topic of each article 

import numpy as np
topics=lda.transform(tf)
topic = []
for t in topics:
    topic.append(list(t).index(np.max(t)))
data['topic']=topic
data.to_excel("data_topic.xlsx",index=False)
topics[0]#0 1 2 

7. Visualization 

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
pyLDAvis.show(pic)

8. Perplexity

import matplotlib.pyplot as plt
plexs = []
scores = []
n_max_topics = 16
for i in range(1,n_max_topics):
    print(i)
    lda = LatentDirichletAllocation(n_components=i, max_iter=50,
                                    learning_method='batch',
                                    learning_offset=50,random_state=0)
    lda.fit(tf)
    plexs.append(lda.perplexity(tf))
    scores.append(lda.score(tf))

n_t=15#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()
n_t=15#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,scores[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("score")
plt.show()

Three, practical operation

I encountered some problems while running. After checking and searching, it was a problem of the environment and version, which was solved by adjusting the version. It is recommended that you run it under conda=4.12.0, pandas=1.3.0, pyLDAvis=2.1.2, Basically, there will be no problems.

The final result is as follows:


Summarize

The understanding of the LDA model is not very thorough. If you use a csv file in the operation of the code, you only need to change read_csv(). There are also some encoding and decoding problems. You can try to change gbk or something, a chicken, I hope There are big guys who can advise.

Guess you like

Origin blog.csdn.net/weixin_46451009/article/details/127671146