NLP --- 文本处理

手动文本处理

1.将文本转换为小写
2.通过正则来去除文本中的标点符号
3.Tokenization，单词分割

使用 NLTK：Natural Language ToolKit

1.分割单词
分割句子
去除 stop word
单词还原，Lemmatization

我们要做NLP，首先就是怎么获取数据，只有获得了数据，我们才能够向下进行，我们可以通过文档，csv表格，网页等途径获得文本数据。在获得数据后我们需要一些标准流程来处理文本数据

1.将文本转换为小写，这是因为在文本中car 和Car是一样的
2.通过正则来去除文本中标点符号，最好是用空格替换标点符号
3.根据空格来分割单词，返回一个列表
4.去除 stop word

手动文本处理

1.将文本转换为小写

我们在处理文本的时候，单词的大小写对句子的含义没有影响，但是却加大了我们的处理数量，所以我们需要将文本中的大写字符转换为小写的字符

# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

# Convert to lowercase
text = text.lower() 
print(text)

2.通过正则来去除文本中的标点符号

在文本中，标点符号只是给人看的，在处理文本的时候也是会增加我们的数据量，所以我们需要将标点符号去除，我们使用re库中的sub函数，通过正则表达式，用空格来替换文本中的标点符号

import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

3.Tokenization，单词分割

我们需要将单词进行分割，

# Split text into tokens (words)
words = text.split()
print(words)

使用 NLTK：Natural Language ToolKit

上面的步骤是手动流程来处理文本数据，当然，我们可以使用NLTK工具来处理文本

import os
import nltk
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

# Another sample text
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

1.分割单词

from nltk.tokenize import word_tokenize

# Split text into words using NLTK
words = word_tokenize(text)
print(words)

分割句子

from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

去除 stop word

在一个句子中，很多单词不会影响句子的含义，我么可以将其去除，比如the

# List stop words
from nltk.corpus import stopwords
print(stopwords.words("english"))

# Reset text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

单词还原，Lemmatization

在句子中，很多单词存在时态，单复数，我们需要还原单词，比如说ones，我们需要将其转换为one

from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

我们可以通过pos参数来制定词性

# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

NLP笔记 --- 2.文本处理