将开头和结尾的一些信息去掉,使得开头如下:
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
结尾如下:
And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.
保存为:metamorphosis_clean.txt
加载数据:
filename='metamorphosis_clean.txt'file=open(filename,'rt')
text=file.read()
file.close()
1. 用空格分隔:
words=text.split()print(words[:100])#['One','morning,','when','Gregor','Samsa','woke','from','troubled','dreams,','he',...]
2. 用 re 分隔单词:
和上一种方法的区别是,'armour-like' 被识别成两个词 'armour', 'like','What's' 变成了 'What', 's'
importre
words=re.split(r'\W+',text)
print(words[:100])
3. 用空格分隔并去掉标点:
string 里的 string.punctuation 可以知道都有哪些算是标点符号,
maketrans() 可以建立一个空的映射表,其中 string.punctuation 是要被去掉的列表,
translate() 可以将一个字符串集映射到另一个集,
也就是 'armour-like' 被识别成 'armourlike','What's' 被识别成 'Whats'
words=text.split()importstring
table=str.maketrans('','',string.punctuation)
stripped=[w.translate(table)forwinwords]
print(stripped[:100])
4. 都变成小写:
当然大写可以用 word.upper()。
words=[word.lower()forwordinwords]print(words[:100])
安装 NLTK:
nltk.download() 后弹出对话框,选择 all,点击 download
importnltk
nltk.download()
5. 分成句子:
用到 sent_tokenize()
fromnltkimportsent_tokenize
sentences=sent_tokenize(text)
print(sentences[0])
6. 分成单词:
用到 word_tokenize,
这次 'armour-like' 还是 'armour-like','What's' 就是 'What', 's,
fromnltk.tokenizeimportword_tokenize
tokens=word_tokenize(text)
print(tokens[:100])
7. 过滤标点:
只保留 alphabetic,其他的滤掉,
这样的话 “armour-like” 和 “‘s” 也被滤掉了。
fromnltk.tokenizeimportword_tokenize
tokens=word_tokenize(text)
words=[wordforwordintokensifword.isalpha()]
print(tokens[:100])
8. 过滤掉没有深刻含义的 stop words:
在 stopwords.words('english') 可以查看这样的词表。
fromnltk.corpusimportstopwords
stop_words=set(stopwords.words('english'))
words=[wforwinwordsifnotwinstop_words]
print(words[:100])
9. 转化成词根:
运行 porter.stem(word) 之后,单词会变成相应的词根形式,例如 “fishing,” “fished,” “fisher” 会变成 “fish”
fromnltk.tokenizeimportword_tokenize
tokens=word_tokenize(text)fromnltk.stem.porterimportPorterStemmer
porter=PorterStemmer()
stemmed=[porter.stem(word)forwordintokens]
print(stemmed[:100])