《用Python进行自然语言处理》笔记1

一、首先安装NLTK:

NLTK提供可以访问语料库和词汇资源(如WordNet)的接口,还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库。*

1.利用python解释器:

>>>import nltk
>>>nltk.download()

2.官网下载压缩包并解压到Download Directory中:
在这里插入图片描述

二、简单的使用

1.数据的导入

>>>from nltk.book import *      //从nltk的book模块加载所有需要的数据
*text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908*
>>> text1                   //只需要在python提示符后输入名字就可找到文本
*<Text: Moby Dick by Herman Melville 1851>* 
>>> text2 
*<Text: Sense and Sensibility by Jane Austen 1811>*

2.搜索文本
⑴ text.concordance(word)

用于搜索word在text中出现的那一行,重点强调上下文关系。

>>> text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

⑵ text.similar(word)

用于搜索具有类似短语组成结构的其他词,并不涉及到语义

>>> text1.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

⑶ text.common_contexts([word1,word2…])

用于搜索两个或两个以上词共同组成的类似短语结构


>>> text1.common_contexts(["monstrous","very"])
No common contexts were found
>>> text2.common_contexts(["monstrous","very"])
a_pretty am_glad a_lucky is_pretty be_glad

⑷ text.dispersion_plot([word1, word2,])

用离散图表示位置信息(从开头算起一共出现了多少个次),用于展现词的分布

>>> text4.dispersion_plot(["liberty","constitution"])


3.文本生成
⑴ text.generate()

使用源文本常见的随机产生风格类似的文本(标点符号和文字是相互独立的)


>>> text3.generate()
Building ngram index...
laid by her , and said unto Cain , Where art thou , and said , Go to ,
I will not do it for ten ' s sons ; we dreamed each man according to
their generatio the firstborn said unto Laban , Because I said , Nay ,
but Sarah shall her name be . , duke Elah , duke Shobal , and Akan .
and looked upon my affliction . Bashemath Ishmael ' s blood , but Isra
for as a prince hast thou found of all the cattle in the valley , and
the wo The
"laid by her , and said unto Cain , Where art thou , and said , Go to ,\nI will not do it for ten ' s sons ; we dreamed each man according to\ntheir generatio the firstborn said unto Laban , Because I said , Nay ,\nbut Sarah shall her name be . , duke Elah , duke Shobal , and Akan .\nand looked upon my affliction . Bashemath Ishmael ' s blood , but Isra\nfor as a prince hast thou found of all the cattle in the valley , and\nthe wo The"

4.计数词汇
set ()

该函数创建一个无序不重复的元素集,方便进行关系测试

>>> len(text1)                   //文章词数计数
260819   
>>> sorted(set(text3))           //词汇项的排序表(无重复词)  
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', ...
>>> len(set(text3))              //词汇项的计数
2789
>>> text5.count("lol")          //text5中‘lol’出现的次数
704

猜你喜欢

转载自blog.csdn.net/CHAINQWE/article/details/107078385