1. 准备工作:分词和清洗
- import nltk
- from nltk.corpus import stopwords
- from nltk.corpus import brown
- import numpy as np
- #分词
- text = "Sentiment analysis is a challenging subject in machine learning.\
- People express their emotions in language that is often obscured by sarcasm,\
- ambiguity, and plays on words, all of which could be very misleading for \
- both humans and computers.".lower()
- text_list = nltk.word_tokenize(text)
- #去掉标点符号
- english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
- text_list = [word for word in text_list if word not in english_punctuations]
- #去掉停用词
- stops = set(stopwords.words("english"))
- text_list = [word for word in text_list if word not in stops]
2. 使用词性标注器:处理一个词序列,为每个词附加一个词性标记
- nltk.pos_tag(text_list)
- Out[81]:
- [('sentiment', 'NN'),
- ('analysis', 'NN'),
- ('challenging', 'VBG'),
- ('subject', 'JJ'),
- ('machine', 'NN'),
- ('learning', 'VBG'),
- ('people', 'NNS'),
- ('express', 'JJ'),
- ('emotions', 'NNS'),
- ('language', 'NN'),
- ('often', 'RB'),
- ('obscured', 'VBD'),
- ('sarcasm', 'JJ'),
- ('ambiguity', 'NN'),
- ('plays', 'NNS'),
- ('words', 'NNS'),
- ('could', 'MD'),
- ('misleading', 'VB'),
- ('humans', 'NNS'),
- ('computers', 'NNS')]
- brown_taged= nltk.corpus.brown.tagged_words()
- brown_tagged_sents = brown.tagged_sents(categories='news')
- brown_sents = brown.sents(categories='news')
- #默认标注
- tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
- print(nltk.FreqDist(tags).max())
- NN
- raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
- tokens = nltk.word_tokenize(raw)
- default_tagger = nltk.DefaultTagger('NN')
- print(default_tagger.tag(tokens))
- print(default_tagger.evaluate(brown_tagged_sents))
- [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
- 0.13089484257215028
- #正则表达式标注器
- patterns= [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*ould$','MD'),\
- (r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'^-?[0-9]+(.[0-9]+)?$','CD'),(r'.*','NN')]
- regexp_tagger = nltk.RegexpTagger(patterns)
- regexp_tagger.tag(brown_sents[3])
- print(regexp_tagger.evaluate(brown_tagged_sents))
- 0.20326391789486245
- #查询标注器:找出100个最频繁的词,存储它们最有可能的标记。然后可以使用这个信息作为
- #"查询标注器"(NLTK UnigramTagger)的模型
- fd = nltk.FreqDist(brown.words(categories='news'))
- cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
- most_freq_words = list(fd.keys())[:100]
- likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
- # baseline_tagger = nltk.UnigramTagger(model=likely_tags)
- #许多词都被分配了None标签,因为它们不在100个最频繁的词中,可以使用backoff参数设置这些词的默认词性
- baseline_tagger = nltk.UnigramTagger(model=likely_tags,backoff=nltk.DefaultTagger('NN'))
- print(baseline_tagger.evaluate(brown_tagged_sents))
- 0.46063806511923944
(1)一元标注器:利用一种简单的算法,对每个标识符分配最有可能的标记,不考虑上下文
- In[87]: unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #训练一个一元标注器
- print(unigram_tagger.tag(brown_sents[2007]))
- unigram_tagger.evaluate((brown_tagged_sents))
- [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
- Out[87]: 0.9349006503968017
- #分离训练集和测试集
- size = int(len(brown_tagged_sents)*0.9)
- train_sents = brown_tagged_sents[:size]
- test_sents = brown_tagged_sents[size:]
- unigram_tagger = nltk.UnigramTagger(train_sents)
- unigram_tagger.evaluate(test_sents)
- Out[89]: 0.8121200039868434
NgramTagger类使用一个已标注的训练语料库来确定每个上下文中哪个词性标记最有可能。
- bigram_tagger = nltk.BigramTagger(train_sents)
- bigram_tagger.tag(brown_sents[2007])
- bigram_tagger.evaluate(test_sents)
- Out[90]: 0.10206319146815508
注意,bigram标注器能够标注训练量中它看到过的句子中的所有词,但对一个没见过的句子却不行。只要遇到一个新词就无法给它分配标记,也无法给新词后面的一个词
分配标记,因为在训练过程中从来没有见过哪个词前面有None标记的词。它的整体准确度非常低。
(3)组合标注器
尝试使用bigram标注器标注标识符
如果bigram无法找到标记,尝试unigram标注器
- t0 = nltk.DefaultTagger('NN')
- t1 = nltk.UnigramTagger(train_sents,backoff=t0)
- t2 = nltk.BigramTagger(train_sents,backoff=t1)
- t2.evaluate(test_sents)
- Out[92]: 0.8452108043456593
- t3 = nltk.BigramTagger(train_sents,cutoff=2,backoff=t1)
- t3.evaluate(test_sents)
- Out[95]: 0.8424200139539519
(4)存储标注器:在大语料库中训练标注器可能需要大量的时间,保存标注器很有必要
- In[101]: #保存标注器
- from pickle import dump
- output = open('t2.pkl','wb')
- dump(t2,output,-1)
- output.close()
- #加载标注器
- from pickle import load
- input = open('t2.pkl','rb')
- tagger = load(input)
- input.close()
- #使用标注器
- text = "Sentiment analysis is a challenging subject in machine learning."
- tokens = text.split()
- tagger.tag(tokens)
- Out[101]:
- [('Sentiment', 'NN'),
- ('analysis', 'NN'),
- ('is', 'BEZ'),
- ('a', 'AT'),
- ('challenging', 'JJ'),
- ('subject', 'NN'),
- ('in', 'IN'),
- ('machine', 'NN'),
- ('learning.', 'NN')]