NLTK使用汇总

1. LookupError: Resource not found.

  例如在运行下列代码时出现错误:

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I am a good boy')

在这里插入图片描述
在这里插入图片描述

  • 解决方法一:
import nltk
nltk.download('punkt')

但可能会出现远程主机强迫关闭了一个现有的连接的错误,此时我们就需要使用其他办法。

  • 解决方法二:
  1. 手动下载nltk所有的数据集,然后解压至上图中的某个目录下:
  2. 运行下列代码:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
  1. 如果出错,再运行下列代码:
import nltk
nltk.download('punkt')

注1:之所以需要重新下载,是由于之前的数据集的nltk的版本和pip install的最新版本不相符。

注2:如果是Linux系统,最好是先通过Config设置路径,然后把下载好的NLTK数据包放到里面即可。

import nltk
nltk.data.path.append("/home/library/nltk_data")

2. 分词和停用词

  • 分词
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
  • 停用词
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

3. 词性标注和词形还原

  词形还原与词干提取类似, 但不同之处在于词干提取经常可能创造出不存在的词汇,词形还原的结果是一个真正的词汇。所以我们这里只介绍词形还原。但是词性还原又取决于词性,所以我们需要借助词性标注得到的结果。

  • 词性标注
import nltk
text = nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))
 
结果为:
['what', 'does', 'the', 'fox', 'say']
输出是元组列表,元组中的第一个元素是单词,第二个元素是词性标签
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

标记(Tag) 含义(Meaning) 例子(Examples)
ADJ 形容词(adjective) new,good,high,special,big
ADV 副词(adverb) really,,already,still,early,now
CNJ 连词(conjunction) and,or,but,if,while
DET 限定词(determiner) the,a,some,most,every
EX 存在量词(existential) there,there’s
FW 外来词(foreign word) dolce,ersatz,esprit,quo,maitre
MOD 情态动词(modal verb) will,can,would,may,must
N 名词(noun) year,home,costs,time
NP 专有名词(proper noun) Alison,Africa,April,Washington
NUM 数词(number) twenty-four,fourth,1991,14:24
PRO 代词(pronoun) he,their,her,its,my,I,us
P 介词(preposition) on,of,at,with,by,into,under
TO 词 to(the word to) to
UH 感叹词(interjection) ah,bang,ha,whee,hmpf,oops
V 动词(verb) is,has,get,do,make,see,run
VD 过去式(past tense) said,took,told,made,asked
VG 现在分词(present participle) making,going,playing,working
VN 过去分词(past participle) given,taken,begun,sung
WH wh限定词(wh determiner) who,which,when,what,where

也可以使用nltk.help.upenn_tagset()进行查看。https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/(上述表格有错误!!!)

  • 词性还原
# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
结果为:
play
playing
playing
playing
'''

4. 分句

  由于word2vec本质上是对每个句子求词向量,所以我们需要对文章划分成句子。

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)
发布了178 篇原创文章 · 获赞 389 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/herosunly/article/details/105017811
今日推荐