用Python进行自然语言处理学习笔记一

NLTK是一个高效的Python构建的平台,用来处理人类自然语言数据。它提供了易于使用的接口,通过这些接口可以访问超过50个语料库和词汇资源(如WordNet),还有一套用于分类、标记化、词干标记、解析和语义推理的文本处理库,以及工业级NLP库的封装器和一个活跃的讨论论坛

搜索文本

来查一下《白鲸记》中的词monstrous:

from NLTK import *

text1.concordance("monstrous")

输出结果如下:

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

函数common_contexts允许我们研究两个或两个以上的词共同的上下文,如monstrous和very。我们必须用方括号和圆括号把这些词括起来,中间用逗号分割。

代码如下:

text2.common_contexts(["monstrous", "very"])

结果为:

a_pretty am_glad a_lucky is_pretty be_glad

text.generate()#随机输出一段text中的内容,但是该函数在NLTK3中没有,若想实现需要下载NLTK2。

set(text3)#获取词汇表

sort(set(text))#获取排序的词汇表,排序表中大写字母出现在小写字母之前。

我们使用FreqDist 寻找《白鲸记》中最常见的50 个词

代码如下:

fdist1 = FreqDist(text1)

collocations()函数为我们找到比我们基于单个词的频率预期得到的更频繁出现的双连词。

fdist = FreqDist([len(w) for w in text1])
print(fdist.items())

输出结果如下:

(1, 47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873), (5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14, 177), (16, 22), (15, 70), (17, 12), (18, 1), (20, 1)

其中(1,47933)表示长度为1的词有47933个。

s.startswith(t) 测试s 是否以t 开头
s.endswith(t) 测试s 是否以t 结尾
t in s 测试s 是否包含t
s.islower() 测试s 中所有字符是否都是小写字母
s.isupper() 测试s 中所有字符是否都是大写字母
s.isalpha() 测试s 中所有字符是否都是字母
s.isalnum() 测试s 中所有字符是否都是字母或数字
s.isdigit() 测试s 中所有字符是否都是数字
s.istitle() 测试s 是否首字母大写(s 中所有的词都首字母大写)

猜你喜欢

转载自blog.csdn.net/Mr_wuliboy/article/details/82493170
今日推荐