1. LookupError: Resource not found.
例如在运行下列代码时出现错误:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I am a good boy')
- 解决方法一:
import nltk
nltk.download('punkt')
但可能会出现远程主机强迫关闭了一个现有的连接的错误,此时我们就需要使用其他办法。
- 解决方法二:
- 手动下载nltk所有的数据集,然后解压至上图中的某个目录下:
- 运行下列代码:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
- 如果出错,再运行下列代码:
import nltk
nltk.download('punkt')
注1:之所以需要重新下载,是由于之前的数据集的nltk的版本和pip install的最新版本不相符。
注2:如果是Linux系统,最好是先通过Config设置路径,然后把下载好的NLTK数据包放到里面即可。
import nltk
nltk.data.path.append("/home/library/nltk_data")
2. 分词和停用词
- 分词
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
- 停用词
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
3. 词性标注和词形还原
词形还原与词干提取类似, 但不同之处在于词干提取经常可能创造出不存在的词汇,词形还原的结果是一个真正的词汇。所以我们这里只介绍词形还原。但是词性还原又取决于词性,所以我们需要借助词性标注得到的结果。
- 词性标注
import nltk
text = nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))
结果为:
['what', 'does', 'the', 'fox', 'say']
输出是元组列表,元组中的第一个元素是单词,第二个元素是词性标签
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]
标记(Tag) | 含义(Meaning) | 例子(Examples) |
---|---|---|
ADJ | 形容词(adjective) | new,good,high,special,big |
ADV | 副词(adverb) | really,,already,still,early,now |
CNJ | 连词(conjunction) | and,or,but,if,while |
DET | 限定词(determiner) | the,a,some,most,every |
EX | 存在量词(existential) | there,there’s |
FW | 外来词(foreign word) | dolce,ersatz,esprit,quo,maitre |
MOD | 情态动词(modal verb) | will,can,would,may,must |
N | 名词(noun) | year,home,costs,time |
NP | 专有名词(proper noun) | Alison,Africa,April,Washington |
NUM | 数词(number) | twenty-four,fourth,1991,14:24 |
PRO | 代词(pronoun) | he,their,her,its,my,I,us |
P | 介词(preposition) | on,of,at,with,by,into,under |
TO | 词 to(the word to) | to |
UH | 感叹词(interjection) | ah,bang,ha,whee,hmpf,oops |
V | 动词(verb) | is,has,get,do,make,see,run |
VD | 过去式(past tense) | said,took,told,made,asked |
VG | 现在分词(present participle) | making,going,playing,working |
VN | 过去分词(past participle) | given,taken,begun,sung |
WH | wh限定词(wh determiner) | who,which,when,what,where |
也可以使用nltk.help.upenn_tagset()进行查看。https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/(上述表格有错误!!!)
- 词性还原
# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
结果为:
play
playing
playing
playing
'''
4. 分句
由于word2vec本质上是对每个句子求词向量,所以我们需要对文章划分成句子。
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)