Python text statistics and analysis from basics to advanced

This article is shared from the Huawei Cloud Community " Python Text Statistics and Analysis from Basics to Advanced " by Lemony Hug.

In today’s digital age, text data is everywhere and contains a wealth of information, from social media posts to news articles to academic papers. Statistical analysis is a common requirement for processing these text data, and Python, as a powerful and easy-to-learn programming language, provides us with a wealth of tools and libraries to implement statistical analysis of text data. This article will introduce how to use Python to implement text English statistics, including word frequency statistics, vocabulary statistics, and text sentiment analysis.

word frequency statistics

Word frequency counting is one of the most basic tasks in text analysis. There are many ways to implement word frequency statistics in Python, the following is one of the basic methods:

def count_words(text): 
    # Remove punctuation from the text and convert it to lowercase 
    text = text.lower() 
    for char in '!"#$%&\'()*+,-./:;<=> ?@[\\]^_`{|}~': 
        text = text.replace(char, ' ') 
    
    # Split the text into a list of words 
    words = text.split() 

    # Create an empty dictionary to store the word count 
    word_count = {} 
    
    # Traverse each word and update the count in the dictionary 
    for word in words: 
        if word in word_count: 
            word_count[word] += 1 
        else: 
            word_count[word] = 1 
    
    return word_count 

#Test code 
if __name__ == " __main__": 
    text = "This is a sample text. We will use this text to count the occurrences of each word." 
    word_count = count_words(text) 
    for word, count in word_count.items(): 
        print(f"{word} : {count}")

This code defines a function that accepts a text string as a parameter and returns a dictionary containing each word in the text and the number of times it occurs. Here is a line-by-line analysis of the code: count_words(text)

def count_words(text):: Defines a function that accepts one parameter , the text string to be processed. count_words text

text = text.lower(): Convert text strings to lowercase letters, which makes word statistics case-independent.

for char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~':`: This is a loop that traverses all punctuation marks in the text.

text = text.replace(char, ' '): Replaces every punctuation mark in the text with a space, which removes the punctuation mark from the text.

words = text.split(): Split the processed text string into a word list by spaces.

word_count = {}: Creates an empty dictionary that stores word counts, where the keys are words and the values ​​are the number of times that word appears in the text.

for word in words:: Iterate through each word in the word list.

if word in word_count:: Check if the current word already exists in the dictionary.

word_count[word] += 1: If the word already exists in the dictionary, add 1 to its occurrence count.

else:: If the word is not in the dictionary, execute the following code.

word_count[word] = 1: Adds a new word to the dictionary and sets its occurrence count to 1.

return word_count: Returns a dictionary containing word counts.

if __name__ == "__main__":: Check if the script is running as the main program.

text = "This is a sample text. We will use this text to count the occurrences of each word.": Defines a test text.

word_count = count_words(text): Call the function, passing the test text as a parameter, and save the result in a variable. count_words  word_count 

for word, count in word_count.items():: Iterate through each key-value pair in the dictionary. word_count 

print(f"{word}: {count}"):Print each word and its number of occurrences.

The running results are as follows

Further optimization and expansion

import re 
from collections import Counter 
def count_words(text): 
    # Use regular expressions to split the text into a list of words (including hyphenated words) 
    words = re.findall(r'\b\w+(?:-\w+)*\ b', text.lower()) 

    # Use Counter to quickly count the number of word occurrences 
    word_count = Counter(words) 

    return word_count 
# Test code 
if __name__ == "__main__": 
    text = "This is a sample text. We will use this text to count the occurrences of each word." 
    word_count = count_words(text) 
    for word, count in word_count.items(): 
        print(f"{word}: {count}")

This code differs from the previous example in the following ways:

  1. Regular expressions are used to split the text into lists of words. This regular expression matches words, including hyphenated words (like "high-tech"). re.findall()  \b\w+(?:-\w+)*\b 
  2. Classes in the Python standard library are used for word counting, which is more efficient and the code is cleaner. Counter 

This implementation is more advanced, more robust, and handles more special cases such as hyphenated words.

The running results are as follows

Text preprocessing

Before text analysis, text preprocessing is usually required, including punctuation removal, case processing, lemmatization, and stemming. This can make text data more standardized and accurate.

Use more advanced models

In addition to basic statistical methods, we can also use machine learning and deep learning models for text analysis, such as text classification, named entity recognition, and sentiment analysis. There are many powerful machine learning libraries in Python, such as Scikit-learn and TensorFlow, that can help us build and train these models.

Handle large-scale data

When faced with large-scale text data, we may need to consider technologies such as parallel processing and distributed computing to improve processing efficiency and reduce computing costs. There are some libraries and frameworks in Python that can help us achieve these functions, such as Dask and Apache Spark.

Combine with other data sources

In addition to text data, we can also combine other data sources, such as image data, time series data, and geospatial data, to conduct more comprehensive and multi-dimensional analysis. There are many data processing and visualization tools in Python that can help us process and analyze this data.

Summarize

This article provides an in-depth introduction to how to use Python to implement text English statistics, including word frequency statistics, vocabulary statistics, and text sentiment analysis. Here's a summary:

Word frequency statistics :

  • Through Python functions count_words(text), the text is processed and the frequency of word occurrences is counted.
  • Text preprocessing includes converting text to lowercase, removing punctuation, etc.
  • Use a loop to iterate over the words in the text and use a dictionary to store the words and their occurrences.

Further optimization and expansion :

  • Introduce regular expressions and Counterclasses to make your code more efficient and robust.
  • Use regular expressions to split text into lists of words, including handling hyphenated words.
  • Using Counterclasses for word counting simplifies the code.

Text preprocessing :

Text preprocessing is an important step in text analysis, including punctuation removal, case processing, lemmatization, and stemming, etc., to normalize text data.

Use more advanced models :

The possibilities of using machine learning and deep learning models for text analysis such as text classification, named entity recognition, and sentiment analysis are introduced.

Handle large-scale data :

Technical considerations when processing large-scale text data are mentioned, including parallel processing and distributed computing to improve efficiency and reduce costs.

Combined with other data sources :

The possibility of combining other data sources for more comprehensive and multidimensional analysis is explored, such as image data, time series data, and geospatial data.

Summarize :

The content introduced in this paper is emphasized, as well as prospects for future work, encouraging further research and exploration to adapt to more complex and diverse text data analysis tasks.

By studying this article, readers can master the basic methods of using Python for text English statistics, and understand how to further optimize and expand these methods to cope with more complex text analysis tasks.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

A programmer born in the 1990s developed a video porting software and made over 7 million in less than a year. The ending was very punishing! Google confirmed layoffs, involving the "35-year-old curse" of Chinese coders in the Flutter, Dart and Python teams . Daily | Microsoft is running against Chrome; a lucky toy for impotent middle-aged people; the mysterious AI capability is too strong and is suspected of GPT-4.5; Tongyi Qianwen open source 8 models Arc Browser for Windows 1.0 in 3 months officially GA Windows 10 market share reaches 70%, Windows 11 GitHub continues to decline. GitHub releases AI native development tool GitHub Copilot Workspace JAVA is the only strong type query that can handle OLTP+OLAP. This is the best ORM. We meet each other too late.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/11090410