1. LookupError: Resource not found.

例如在运行下列代码时出现错误：

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I am a good boy')

在这里插入图片描述

解决方法一：

import nltk
nltk.download('punkt')

但可能会出现远程主机强迫关闭了一个现有的连接的错误，此时我们就需要使用其他办法。

解决方法二：

手动下载nltk所有的数据集，然后解压至上图中的某个目录下：
运行下列代码：

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)

如果出错，再运行下列代码：

import nltk
nltk.download('punkt')

注1：之所以需要重新下载，是由于之前的数据集的nltk的版本和pip install的最新版本不相符。

注2：如果是Linux系统，最好是先通过Config设置路径，然后把下载好的NLTK数据包放到里面即可。

import nltk
nltk.data.path.append("/home/library/nltk_data")

2. 分词和停用词

分词

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)

停用词

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

3. 词性标注和词形还原

词形还原与词干提取类似，但不同之处在于词干提取经常可能创造出不存在的词汇，词形还原的结果是一个真正的词汇。所以我们这里只介绍词形还原。但是词性还原又取决于词性，所以我们需要借助词性标注得到的结果。

词性标注

import nltk
text = nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))
 
结果为：
['what', 'does', 'the', 'fox', 'say']
输出是元组列表，元组中的第一个元素是单词，第二个元素是词性标签
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

标记（Tag）	含义（Meaning）	例子（Examples）
ADJ	形容词（adjective）	new，good，high，special，big
ADV	副词（adverb）	really,，already，still，early，now
CNJ	连词（conjunction）	and，or，but，if，while
DET	限定词（determiner）	the，a，some，most，every
EX	存在量词（existential）	there，there’s
FW	外来词（foreign word）	dolce，ersatz，esprit，quo，maitre
MOD	情态动词（modal verb）	will，can，would，may，must
N	名词（noun）	year，home，costs，time
NP	专有名词（proper noun）	Alison，Africa，April，Washington
NUM	数词（number）	twenty-four，fourth，1991，14:24
PRO	代词（pronoun）	he，their，her，its，my，I，us
P	介词（preposition）	on，of，at，with，by，into，under
TO	词 to（the word to）	to
UH	感叹词（interjection）	ah，bang，ha，whee，hmpf，oops
V	动词（verb）	is，has，get，do，make，see，run
VD	过去式（past tense）	said，took，told，made，asked
VG	现在分词（present participle）	making，going，playing，working
VN	过去分词（past participle）	given，taken，begun，sung
WH	wh限定词（wh determiner）	who，which，when，what，where

也可以使用nltk.help.upenn_tagset()进行查看。https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/(上述表格有错误！！！)

词性还原

# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
结果为：
play
playing
playing
playing
'''

4. 分句

由于word2vec本质上是对每个句子求词向量，所以我们需要对文章划分成句子。

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)

herosunly

发布了178 篇原创文章 · 获赞 389 · 访问量 6万+

私信关注

NLTK使用汇总

1. LookupError: Resource not found.

2. 分词和停用词

3. 词性标注和词形还原

4. 分句

猜你喜欢