【NLP】latent Dirichlet allocation

1.LDA主题模型原理

intro:

来看下面几句话:

I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

问:什么是latent Dirichlet allocation?
答:就是自动发现句子属于那种话题。
譬如,上面几句话分属于2种话题A&B,LDA就会这么显示:

第1、2句: 100% Topic A
第3、4句: 100% Topic B
第5句: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (也可以理解为A是关于食物的)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (也可以理解为B是关于可爱动物的)

步骤:

假设你有一组documents,有K种话题,希望用LDA学习每个document的topic和每种topic有哪些words:

  • Go through each document, and randomly assign each word in the document to one of the K topics.
  • Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
  • So to improve on them, for each document d…
    • Go through each word w in d…
      • And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
      • In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
  • After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

2. LDA 参数学习

sklearn官网

3. 使用LDA生成主题特征

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_42317507/article/details/89415608
今日推荐