This article is shared from the Huawei Cloud Community " Python Text Statistics and Analysis from Basics to Advanced " by Lemony Hug.
In today’s digital age, text data is everywhere and contains a wealth of information, from social media posts to news articles to academic papers. Statistical analysis is a common requirement for processing these text data, and Python, as a powerful and easy-to-learn programming language, provides us with a wealth of tools and libraries to implement statistical analysis of text data. This article will introduce how to use Python to implement text English statistics, including word frequency statistics, vocabulary statistics, and text sentiment analysis.
word frequency statistics
Word frequency counting is one of the most basic tasks in text analysis. There are many ways to implement word frequency statistics in Python, the following is one of the basic methods:
def count_words(text): # Remove punctuation from the text and convert it to lowercase text = text.lower() for char in '!"#$%&\'()*+,-./:;<=> ?@[\\]^_`{|}~': text = text.replace(char, ' ') # Split the text into a list of words words = text.split() # Create an empty dictionary to store the word count word_count = {} # Traverse each word and update the count in the dictionary for word in words: if word in word_count: word_count[word] += 1 else: word_count[word] = 1 return word_count #Test code if __name__ == " __main__": text = "This is a sample text. We will use this text to count the occurrences of each word." word_count = count_words(text) for word, count in word_count.items(): print(f"{word} : {count}")
This code defines a function that accepts a text string as a parameter and returns a dictionary containing each word in the text and the number of times it occurs. Here is a line-by-line analysis of the code: count_words(text)
def count_words(text):
: Defines a function that accepts one parameter , the text string to be processed. count_words
text
text = text.lower()
: Convert text strings to lowercase letters, which makes word statistics case-independent.
for char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_
{|}~':`: This is a loop that traverses all punctuation marks in the text.
text = text.replace(char, ' ')
: Replaces every punctuation mark in the text with a space, which removes the punctuation mark from the text.
words = text.split()
: Split the processed text string into a word list by spaces.
word_count = {}
: Creates an empty dictionary that stores word counts, where the keys are words and the values are the number of times that word appears in the text.
for word in words:
: Iterate through each word in the word list.
if word in word_count:
: Check if the current word already exists in the dictionary.
word_count[word] += 1
: If the word already exists in the dictionary, add 1 to its occurrence count.
else:
: If the word is not in the dictionary, execute the following code.
word_count[word] = 1
: Adds a new word to the dictionary and sets its occurrence count to 1.
return word_count
: Returns a dictionary containing word counts.
if __name__ == "__main__":
: Check if the script is running as the main program.
text = "This is a sample text. We will use this text to count the occurrences of each word."
: Defines a test text.
word_count = count_words(text)
: Call the function, passing the test text as a parameter, and save the result in a variable. count_words
word_count
for word, count in word_count.items():
: Iterate through each key-value pair in the dictionary. word_count
print(f"{word}: {count}")
:Print each word and its number of occurrences.
The running results are as follows
Further optimization and expansion
import re from collections import Counter def count_words(text): # Use regular expressions to split the text into a list of words (including hyphenated words) words = re.findall(r'\b\w+(?:-\w+)*\ b', text.lower()) # Use Counter to quickly count the number of word occurrences word_count = Counter(words) return word_count # Test code if __name__ == "__main__": text = "This is a sample text. We will use this text to count the occurrences of each word." word_count = count_words(text) for word, count in word_count.items(): print(f"{word}: {count}")
This code differs from the previous example in the following ways:
- Regular expressions are used to split the text into lists of words. This regular expression matches words, including hyphenated words (like "high-tech").
re.findall()
\b\w+(?:-\w+)*\b
- Classes in the Python standard library are used for word counting, which is more efficient and the code is cleaner.
Counter
This implementation is more advanced, more robust, and handles more special cases such as hyphenated words.
The running results are as follows
Text preprocessing
Before text analysis, text preprocessing is usually required, including punctuation removal, case processing, lemmatization, and stemming. This can make text data more standardized and accurate.
Use more advanced models
In addition to basic statistical methods, we can also use machine learning and deep learning models for text analysis, such as text classification, named entity recognition, and sentiment analysis. There are many powerful machine learning libraries in Python, such as Scikit-learn and TensorFlow, that can help us build and train these models.
Handle large-scale data
When faced with large-scale text data, we may need to consider technologies such as parallel processing and distributed computing to improve processing efficiency and reduce computing costs. There are some libraries and frameworks in Python that can help us achieve these functions, such as Dask and Apache Spark.
Combine with other data sources
In addition to text data, we can also combine other data sources, such as image data, time series data, and geospatial data, to conduct more comprehensive and multi-dimensional analysis. There are many data processing and visualization tools in Python that can help us process and analyze this data.
Summarize
This article provides an in-depth introduction to how to use Python to implement text English statistics, including word frequency statistics, vocabulary statistics, and text sentiment analysis. Here's a summary:
Word frequency statistics :
- Through Python functions
count_words(text)
, the text is processed and the frequency of word occurrences is counted. - Text preprocessing includes converting text to lowercase, removing punctuation, etc.
- Use a loop to iterate over the words in the text and use a dictionary to store the words and their occurrences.
Further optimization and expansion :
- Introduce regular expressions and
Counter
classes to make your code more efficient and robust. - Use regular expressions to split text into lists of words, including handling hyphenated words.
- Using
Counter
classes for word counting simplifies the code.
Text preprocessing :
Text preprocessing is an important step in text analysis, including punctuation removal, case processing, lemmatization, and stemming, etc., to normalize text data.
Use more advanced models :
The possibilities of using machine learning and deep learning models for text analysis such as text classification, named entity recognition, and sentiment analysis are introduced.
Handle large-scale data :
Technical considerations when processing large-scale text data are mentioned, including parallel processing and distributed computing to improve efficiency and reduce costs.
Combined with other data sources :
The possibility of combining other data sources for more comprehensive and multidimensional analysis is explored, such as image data, time series data, and geospatial data.
Summarize :
The content introduced in this paper is emphasized, as well as prospects for future work, encouraging further research and exploration to adapt to more complex and diverse text data analysis tasks.
By studying this article, readers can master the basic methods of using Python for text English statistics, and understand how to further optimize and expand these methods to cope with more complex text analysis tasks.
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~