Gensim: A Python tool library for text topic recognition

Automatically extracting the topics people are talking about from large amounts of text (topic recognition) is one of the fundamental applications of natural language processing. Examples of large texts include social media subscriptions, consumer reviews of hotels, movies, and other businesses, user reviews, news, and emails from customers.

In this article, we will use LDA to extract a practical case of topics from the 20Newsgroup dataset . Welcome to in-depth discussion, if you like, remember to like, follow, and favorite.

[Note] The full version of the code, data, and technical exchange can be obtained at the end of the article

Fundamentals of Topic Recognition

This section will cover the principles of topic identification and modeling. Yun Duojun will learn how to detect and extract topics from text using bag-of-words method and simple NLP model.

Lemmatization

The reduction of words to their roots or stems is called lemmatization.

First instantiate the WordNetLemmatizer. Call the '.lemmatize()'method to build a new list of tokens named LEM. Then call the Counter class and generate bag_wordsa , and finally output the six most likely topics.

lemmatizer = WordNetLemmatizer() 
lem_tokens = [lemmatizer.lemmatize(t) for t in stopwords_removed] 
bag_words = Counter(lem_tokens) 
print(bag_words.most_common(6))

Gensim and LDA

The full name of LDA is Latent Dirichlet Allocation, and Chinese is latent Dirichlet allocation.

Gensim is an open source natural language processing (NLP) library that can create and query corpora. It operates by building word embeddings or vectors, which are then used to model topics.

Deep learning algorithms are used to build multidimensional mathematical representations of words called word vectors. They provide information about the relationships between terms in the corpus. For example, the distance between the words "India" and "New Delhi" might be comparable to the distance between the words "China" and "Beijing" because they are country-capital vectors.

Gensim for creating and querying corpora

I have learned the relevant knowledge of gensim with you before, and this article will work with you to develop the first gensim dictionary and corpus!

These data structures will look at text trends and other interesting topics in the documentation set. First, we imported some more confusing Wikipedia articles that were preprocessed to lowercase all words, tokenize, and remove stop words and punctuation. These files are then saved as articles, which are a list of document tags. Before creating a gensim vocabulary and corpus, some preliminary work needs to be done.

Gensim's bag of words

Now use the new gensim corpus and dictionary to see the most frequently used terms in each document and across all documents. You can look up these terms in a dictionary.

You can use defaultdict to create a dictionary that assigns default values ​​to non-existing keys. We can use the int parameter to ensure that any non-existing key is automatically assigned a default value of 0.

Document term matrix for LDA

After creating the LDA model, we will train the LDA model object on the document term matrix. The number of topics and a dictionary must be specified. We might limit the number of topics to 2 or 3, since we have a small corpus of only 9 documents.

Bag-of-words information (LDA or TF-IDF) is very good at identifying topics by detecting frequent words when the text itself is coherent. When the text is incoherent (in terms of words or sentences), more contextual information is needed to fully reflect the idea of ​​the text.

data set

This example uses the 20Newsgroup dataset that can be downloaded from sklearn .

from sklearn.datasets import fetch_20newsgroups 
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      shuffle = True) 
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

News in this dataset has been categorized into key topics.

print(list(newsgroups_train.target_names))

picture

As you can see from the results, it covers a variety of topics such as science, politics, sports, religion and technology. Let's look at some recent news examples.

newsgroups_train.data[:2]

picture

data preprocessing

Specific steps are as follows:

  1. Use tokenization to split text into sentences and sentences into words.

  2. Remove all punctuation and convert all words to lowercase.

  3. Filter words with less than three characters.

  4. Remove all stop words.

  5. Nouns are lemmatized, so third-person nouns are converted to first-person, and past and future tense verbs are changed to present tense.

  6. Reduce them to their simplest root form.

Related library preparation

Download nltk stopwords and necessary packages.

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)
import nltk
nltk.download('wordnet')

lemmatizer

Before starting to preprocess the data, look at an example of lemmatization. What happens if we lemmatize “Gone”this word?

Take, for example, converting the past tense to the present tense.

print(WordNetLemmatizer().lemmatize('gone', pos = 'v'))
go

Example of root extraction. Try typing a few sentences into the stem analyzer and see what the output is.

import pandas as pd
stemmer = SnowballStemmer("english")
original_words = ['sings', 'classes', 'dies', 'mules', 'denied','played', 'agreement', 'owner', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colon','plotting']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data={
    
    'original word':original_words, 'stemmed':singles })

picture

Next write a function to run the preprocessing stage on the entire dataset.

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return resul

Now preview the preprocessed document and get Tokenized and lemmized documents.

document_number = 50
doc_sample = 'Sara did not like to read. She was not very good at it.'
print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("nnTokenized and lemmatized document: ")
print(preprocess(doc_sample))
Original document:
['Sara', 'did', 'not', 'like', 'to', 'read.',
'She', 'was', 'not', 'very', 'good', 'at', 'it.']
nnTokenized and lemmatized document: 
['sara', 'like', 'read', 'good']

Before starting to preprocess all news headlines. The list of documents in the training examples needs to be carefully checked.

processed_docs = []
for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))
# Preview 'processed_docs
print(processed_docs[:2])

picture

"processed_docs" will now be used to build a dictionary containing the number of times each word appears in the training set. To do this, call it "dictionary" and feed the processed document to gensim.corpora.Dictionary()[1].

Create a bag of words

Create a bag of words from text

Before topic recognition, we convert the tokenized and lemmatized text into a bag of words, which can be thought of as a dictionary where the keys are words and the values ​​are the number of times that word appears in the corpus .

Use gensim.corpora.Dictionary, from to "processed_docs"create a dictionary containing the number of occurrences of a term in the training set, and name it "dictionary".

dictionary = gensim.corpora.Dictionary(processed_docs)

First check if the dictionary was created.

count = 0 
for k, v in dictionary.iteritems(): 
    print(k, v) 
    count += 1 
    if count > 10: 
        break
0 addit1 bodi2 bricklin3 bring4 bumper5 
call6 colleg7 door8 earli9 engin10 enlighten

filter extremes

All the following tokens appear in the delete list.

  • Greater than (absolute number) without documents above or smaller than (absolute number) without documents below (fraction of total corpus size, not absolute number).

  • Only the first n most common tokens after (1) and (2) are kept. (If None all flags are preserved).

dictionary.filter_extremes(no_below=15,no_above=0.1,keep_n= 100000)

You can also filter out infrequent or frequently occurring words.

Now use the resulting dictionary object to convert each preprocessed page into a bag of words. That is, build a dictionary for each document, storing how many words there are and how many times those words appear.

Gensim doc2bow

doc2bow(document)

Convert the document (word list) to a list of 2-tuples (token id token count) in word format. Each word is a normalized and tokenized string (Unicode or utf8-encoded). Apply tokenization, stemming, and other preprocessing to the words in the document before calling this function.

A dictionary must be created for each document using the Bag-of-words model, and in this dictionary is stored how many words are there and how many times those words appear. "bow corpus" is suitable for holding this dictionary.

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Now, preview the BOW of the preprocessed example document 11.

document_num = 11
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print('Word {} ("{}") appears {} time.'.format(bow_doc_x[i][0],dictionary[bow_doc_x[i][0]],bow_doc_x[i][1]))

picture

Execute LDA

Using Bag of Words

In the document corpus, we target ten topics. To parallelize and accelerate model training, we perform LDA on all CPU cores.

Here are some parameters we will be tuning:

  1. The number of potential topics retrieved from the training corpus is required to be 1 topic.

  2. The id2word mapping converts word ids (integers) to words (strings). It is used for debugging and topic printing, and to determine vocabulary size.

  3. The number of extra processes used for parallelization is the number of workers. By default, all available cores are used.

  4. The hyperparameters alpha and eta affect the sparsity of the document-topic (theta) and topic-word (lambda) distributions, respectively. For now, these will be the defaults (the default is 1/num topics).

The topic distribution of each document is Alpha

  • High alpha: Each document has multiple topics (documents look similar to each other).

  • Low alpha: Each document contains some topics.

The word distribution for each topic is called Eta

  • High eta: Each topic contains a variety of words (topics look similar to each other).

  • Low eta: Each topic contains a small number of words.

Since we can use the gensim LDA model, this is fairly straightforward. But the number of subjects in the data collection must be specified. Suppose we start with eight different themes. The number of training passes through this file is called the number of passes.

gensim.modelswill train LDA model. LdaMulticoreand put it in the "LDA model"folder.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 8, id2word = dictionary, passes = 10,workers = 2)

After training the model, look at the words that appear in the topic and their proportional importance to each word.

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} nWords: {}".format(idx, topic ))
    print("n")

Theme Tokens

What categories can you derive from the words in each topic and their corresponding weights?

  • 0: Gun Violence

  • 1: Sports

  • 2: Politics

  • 3: Space

  • 4: Encryption

  • 5: Technology

  • 6: Graphics Cards

  • 7: Religion

Model testing

Preprocess data for unknown documents.

num = 70
unseen_document = newsgroups_test.data[num]print(unseen_document)

bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], 
                           key=lambda tup: -1*tup[1]):
    print("Score: {}t Topic: {}".format(score, lda_model.print_topic(index, 5)))

picture

The model here is complete. Now think about how to interpret it and see if the results make sense.

The model produces outputs for eight topics, each classified by a set of words. The LDA model does not give these words a topic name.

Model evaluation

① The model performs well in extracting different topics of the dataset, and the model can be evaluated by target name.

② The model runs very fast. In just a few minutes, topics can be extracted from the dataset.

③ Assuming that the dataset contains discrete topics, if the dataset is a collection of random tweets, the model results may be difficult to interpret.

simple summary

  1. By combining LDA topic probabilities and sentence embeddings, the contextual topic recognition model utilizes both bag-of-words and contextual information.

  2. Although LDA performs well on topic recognition tasks, it struggles with short texts to be modeled and documents that do not explain topics coherently. It also has limitations as it is based on a bunch of words.

  3. Bag-of-words information (LDA or TF-IDF) is very good at identifying topics by detecting frequent words when the text is internally coherent. When the text is incoherent (in the sense of words or sentences), more information is needed to reflect the idea of ​​the text.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/124312855