第2章 文本的歧义及其清理(包括,分词,去除停用词,词干提取,词形还原等)

第2章 文本的歧义及其清理

文本处理的过程:

词项化—>去除停用词---->词干提取或词形还原
在这里插入图片描述
1. 简单看看json文件的基本内容:

example.json:
{
“array”: [1,2,3,4],
“boolean”: “True”,
“object”: {
“a”: “b”
},
“string”: “Hello World”
}

简单的处理代码:

import json
#打开文件
jsonfile=open("example.json")
#加载数据
data=json.load(jsonfile)
print(data['array'],data['boolean'],data['object'],data['string'])

结果如下:
在这里插入图片描述
2.语句分离
前边应该进行文本清理,如前面对html语言进行处理不必要字符,以及删去长度短的字母。
语句分离即将大段原生文本分割成一系列语句。
利用sent_tokenize分离语句:

from nltk.tokenize import sent_tokenize
#sent_tokenize是专门根据语句边界检测来分离语句的
inputstring=" This is an example sent. The sentence splitter will split on sent markers. Ohh really !!"
all_sent=sent_tokenize(inputstring)
print(all_sent)

结果如下:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [’ This is an
example sent.’, ‘The sentence splitter will split on sent markers.’,
‘Ohh really !’, ‘!’]

Process finished with exit code 0

3.标识化处理
有各种表示器,最简单的python字符串类型的split()方法,利用空白符进行单词分割。word_tokenize()是一个更加强大同样的方法,还有另一个选择regex_tokenize()。同时也可以基于正则表达式来分割出相同字符串。
具体代码如下:

  • 利用split():
s = "Hi Everyone ! This is the first day we go to school."
print(s.split())

结果:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school.’]

Process finished with exit code 0

  • 利用word_tokenize:
from nltk.tokenize import word_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=word_tokenize(s)
print(all_word)

结果:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school’, ‘.’]

Process finished with exit code 0

  • 利用·regexp_tokenize:

可以用\w+这个正则表达式,分隔出单词和数字,如果用\d+这个正则表达式,提取出纯数字内容。

from nltk.tokenize import regexp_tokenize,wordpunct_tokenize,blankline_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=regexp_tokenize(s, pattern='\w+')
print(all_word)

结果:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’, ‘to’,
‘school’]

Process finished with exit code 0

4.词干提取stemming
举个例子:
eating、eaten、eats->eat
将不同的词形变化归结为相同的词根,在像移除-s/es、-ing或-ed这类事情上都可以有70%以上的精确度
简单代码如下:

from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
#创建Porter词干提取器
pst=PorterStemmer()
#创建Lancaster词干提取器
lst=LancasterStemmer()
print(lst.stem("eating"))
print(pst.stem("shopping"))

结果展示:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py eat shop

Process finished with exit code 0

5.词形还原
更有条理,会利用上下文语境和词性来确定相关单词的变化形式
简单代码如下:

from nltk.stem import WordNetLemmatizer
wlem=WordNetLemmatizer()
print(wlem.lemmatize("I want to shopping"))

结果:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py I want to
shopping

Process finished with exit code 0

6.停用词移除
停用词对文档或者查询时无用的,有两种方法筛选出停用词:
方法一:通过人工或者网站上找到停用词列表。
方法二:利用频率来构建停用词列表
NLTK就有停用词库
简单代码如下:

#从corpus中导出停用词序列
from nltk.corpus import stopwords
#得到英语english的停用词
stoplist=stopwords.words('english')
#我们可以查看一下停用词有哪些
print(stoplist)
text="This is just a test"
#将文本的字母全部调整为小写
text=text.lower()
print(text)
#剔除在停用词列表中的单词
clenwordlist=[word for word in text.split() if word not in stoplist]
print(clenwordlist)

结果:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘i’, ‘me’, ‘my’,
‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’,
‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’,
‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’,
‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’,
‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’,
‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’,
‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’,
‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’,
‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’,
‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’,
‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’,
‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’,
‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’,
‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’,
‘don’, ‘should’, ‘now’] this is just a test [‘test’]

Process finished with exit code 0

7.拼音纠错
我们可以通过纯字典查找方式创建一个非常基本的拼写检查器,也可以用模糊字符串匹配,最常用的是edit-distance算法,具体见后面章节
简单代码如下:

from nltk.metrics import edit_distance
print(edit_distance("rain","shine"))

结果如下:

D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py
3

Process finished with exit code 0

词干提取与词形还原有什么区别:
个人认为词干提取是缩减,砍掉尾部,比如driving->driv,而不是drive;而词形还原会根据上下文进行变换;比如drove->drive
本章小结:
主要是文本的处理,我们学习了:
文本分离,单词分离,词干提取、词形还原以及去除停用词,还有拼音纠错等等
需注意:
在完成停用词移除之后,我们还可以执行其它NLP操作吗?
答案是否定的:这是不可能的。所有典型的NLP应用,譬如词性标注、断句处理等,都需要根据上下文语境来为既定文本生成相关的标签。一旦我们移除了停用词,其上下文环境也就不存在了。

猜你喜欢

转载自blog.csdn.net/qq_43582620/article/details/105702729