第05章分类和标注词汇

5.1 使用词性标注器
5.2 标注语料库

表示已标注的标识
读取已标注的语料库
简化的词性标记集
名词
动词
形容词和副词
未简化的标记
探索已标注的语料库

5.3 使用Python 字典映射词及其属性

索引链表VS 字典
Python字典
定义字典
默认字典
递增地更新字典
复杂的键和值
颠倒字典

5.4 自动标注
默认标注器

正则表达式标注器
查询标注器
评估

5.5 N-gram 标注

一元标注（Unigram Tagging）
分离训练和测试数据
一般的N-gram的标注
组合标注器
标注生词
存储标注器
性能限制
跨句子边界标注

5.6 基于转换的标注
5.7 如何确定一个词的分类

形态学线索
句法线索
语义线索
新词
词性标记集中的形态学

5.8 小结

# -*- coding: utf-8 -*-

import nltk, re, pprint
from nltk import word_tokenize

将词汇按它们的词性（parts-of-speech，POS）分类以及相应的标注它们的过程被称为词性标注（part-of-speech tagging, POS tagging）或干脆简称标注。词性也称为词类或词汇范畴。用于特定任务的标记的集合被称为一个标记集。

5.1 使用词性标注器

一个词性标注器（part-of-speech tagger 或POS tagger）处理一个词序列，为每个词附加一个词性标记

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...

#同形同音异义词
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

text.similar()方法为一个词w 找出所有上下文w1ww2，然后找出所有出现在相同上下文中的词w’，即w1w’w2。

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')  #搜索woman 找到名词

man time day year car moment world house family child country boy
state job way war place girl week case

text.similar('bought') #搜索bought 找到的大部分是动词

made done put said found seen had left given heard got brought was
been set told that in took felt

text.similar('over') #；搜索over 一般会找到介词

in on to of and for with from at by that into as up out down through
about all is

text.similar('the') #搜索the 找到几个限定词

a his this their its her an that our any all one these my in your no
some other and

一个标注器能够正确识别一个句子的上下文中的这些词的标记。
一个标注器也可以为我们对未知词的认识过程建模；

5.2 标注语料库

表示已标注的标识

按照NLTK 的约定，一个已标注的标识符使用一个由标识符和标记组成的元组来表示。函数str2tuple()从表示一个已标注的标识符的标准字符串创建一个这样的特
殊元组。

tagged_token = nltk.tag.str2tuple('fly/NN')

tagged_token

('fly', 'NN')

tagged_token[0]

'fly'

tagged_token[1]

'NN'

#从一个字符串构造一个已标注的标识符的链表
sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PP
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/R
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''

[nltk.tag.str2tuple(t) for t in sent.split()][:3]

[('...', None), ('The', 'AT'), ('grand', 'JJ')]

读取已标注的语料库

NLTK 中的语料库阅读器提供了一个统一的接口

nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

print(nltk.corpus.nps_chat.tagged_words())

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

nltk.corpus.brown.tagged_words(tagset='universal')

[('The', 'DET'), ('Fulton', 'NOUN'), ...]

nltk.corpus.treebank.tagged_words(tagset='universal')

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]

NLTK 中还有其他几种语言的已标注语料库，包括中文

nltk.corpus.sinica_treebank.tagged_words()

[('一', 'Neu'), ('友情', 'Nad'), ('嘉珍', 'Nba'), ...]

nltk.corpus.indian.tagged_words()

[('মহিষের', 'NN'), ('সন্তান', 'NN'), (':', 'SYM'), ...]

nltk.corpus.mac_morpho.tagged_words()

[('Jersei', 'N'), ('atinge', 'V'), ('média', 'N'), ...]

nltk.corpus.conll2002.tagged_words()

[('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]

nltk.corpus.cess_cat.tagged_words()

[('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

简化的词性标记集

表5-1. 简化的标记集

标记	含义	例子
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CNJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
FW	外来词	dolce, ersatz, esprit, quo, maitre
MOD	情态动词	will, can, would, may, must, should
N	名词	year, home, costs, time, education
NP	专有名词	Alison, Africa, April, Washington
NUM	数词	twenty-four, fourth, 1991, 14:24
PRO	代词	he, their, her, its, my, I, us
P	介词	on, of, at, with, by, into, under
TO	词to	to
UH	感叹词	ah, bang, ha, whee, hmpf, oops
V	动词	is, has, get, do, make, see, run
VD	过去式	said, took, told, made, asked
VG	现在分词	making, going, playing, working
VN	过去分词	given, taken, begun, sung
WH	Wh 限定词	who, which, when, what, where, how

from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

tag_fd.most_common()

[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

名词

名词一般指的是人、地点、事情或概念，例如：女人、苏格兰、图书、情报。名词可能
出现在限定词和形容词之后，可以是动词的主语或宾语，如表5-2 所示。
表5-2 一些名词的句法模式

词	限定词之后	动词的主语
woman	the woman who I saw yesterday …	the woman sat down
Scotland	the Scotland I remember as a child …	Scotland has five million people
book	the book I bought yesterday …	this book recounts the colonization of Australia
intelligence	the intelligence displayed by the child …	Mary’s intelligence impressed her teachers

word_tag_pairs = nltk.bigrams(brown_news_tagged)

noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']

fdist = nltk.FreqDist(noun_preceders)

[tag for (tag, _) in fdist.most_common()]

['NOUN',
 'DET',
 'ADJ',
 'ADP',
 '.',
 'VERB',
 'CONJ',
 'NUM',
 'ADV',
 'PRT',
 'PRON',
 'X']

动词

动词是用来描述事件和行动的词，例如：fall 和eat，如表5-3 所示。在一个句子中，动
词通常表示涉及一个或多个名词短语所指示物的关系。
表5-3. 一些动词的句法模式

词	例子	修饰符与修饰语（斜体字）
fall	Rome fell	Dot com stocks suddenly fell like a stone
eat	Mice eat cheese	John ate the pizza with gusto

wsj = nltk.corpus.treebank.tagged_words(tagset='universal')

word_tag_fd = nltk.FreqDist(wsj)

[wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB'][:10]

['is', 'said', 'are', 'was', 'be', 'has', 'have', 'will', 'says', 'would']

cfd1 = nltk.ConditionalFreqDist(wsj)

cfd1['yield'].most_common()

[('VERB', 28), ('NOUN', 20)]

cfd1['cut'].most_common()

[('VERB', 25), ('NOUN', 3)]

wsj = nltk.corpus.treebank.tagged_words()

cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)

list(cfd2['VBN'])[:5]

['discarded', 'Filmed', 'repaid', 'expected', 'alarmed']

[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]]

[]

idx1 = wsj.index(('kicked', 'VBD'))

wsj[idx1-4:idx1+1]

[('While', 'IN'),
 ('program', 'NN'),
 ('trades', 'NNS'),
 ('swiftly', 'RB'),
 ('kicked', 'VBD')]

idx2 = wsj.index(('kicked', 'VBN'))

wsj[idx2-4:idx2+1]

[('head', 'NN'),
 ('of', 'IN'),
 ('state', 'NN'),
 ('has', 'VBZ'),
 ('kicked', 'VBN')]

形容词和副词

另外两个重要的词类是形容词和副词。形容词修饰名词，可以作为修饰符（如：the la
rge pizza 中的large）或谓语（如：the pizza is large）。英语形容词可以有内部结构（如：t
he falling stocks 中的fall+ing）。副词修饰动词，指定时间、方式、地点或动词描述的事件
的方向（如：the stocks fell quickly 中的quickly）。副词也可以修饰的形容词（如：Mary’
s teacher was really nice 中的really）。

英语中还有几个封闭的词类，如介词，冠词（也常称为限定词）（如：the，a），情态动
词（如：should，may），人称代词（如：she，they）。每个词典和语法对这些词的分类都不
同。

未简化的标记

例5-1. 找出最频繁的名词标记的程序
找出所有以NN 开始的标记，并为每个标记提供了几个示例词汇。看到有许多名词的变种；最重要的含有$的名词所有格,含有S 的复数名词（因为复数名词通常以s 结尾），以及含有P 的专有名词。此外，大多数的标记都有后缀修饰符：-NC 表示引用，-HL 表示标题中的词，-TL 表示标题（布朗标记的特征）。

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                   if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

for tag in sorted(tagdict):
    print(tag, tagdict[tag])

NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('home', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("city's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Administration's", 3), ("League's", 3), ("University's", 3), ("Army's", 3)]
NN-HL [('war', 2), ('party', 2), ('problem', 2), ('cut', 2), ('Question', 2)]
NN-NC [('aya', 1), ('eva', 1), ('ova', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Oak', 1), ('City', 1), ('Commissioner', 1), ('Dr.', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("men's", 3), ("janitors'", 3), ("builders'", 2)]
NNS$-HL [("Idols'", 1), ("Dealers'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Grizzlies'", 1), ("Raiders'", 1)]
NNS-HL [('pools', 1), ('$14', 1), ('bids', 1), ('subpoenas', 1), ('effects', 1)]
NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Giants', 9)]
NNS-TL-HL [('Nations', 1)]

探索已标注的语料库

brown_learned_text = brown.words(categories='learned')

sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often'))[:10]

[',',
 '.',
 'accomplished',
 'analytically',
 'appear',
 'apt',
 'associated',
 'assuming',
 'became',
 'become']

brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')

tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']

fd = nltk.FreqDist(tags)

fd.tabulate()

VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2

例5-2. 使用POS 标记寻找三词短语

from nltk.corpus import brown
def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)

for tagged_sent in brown.tagged_sents():
    process(tagged_sent)

combined to achieve
continue to place
serve to protect

scheduled to vanish
vanish to make
continued to live
seem to cascade

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

data = nltk.ConditionalFreqDist((word.lower(), tag)
                    for (word, tag) in brown_news_tagged)

for word in sorted(data.conditions()):
    if len(data[word]) > 3:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))

best ADJ VERB ADV NOUN
close ADV ADJ VERB NOUN
open ADJ VERB NOUN ADV
present ADJ ADV VERB NOUN
that ADP DET PRON ADV

5.3 使用Python 字典映射词及其属性

索引链表VS 字典

表5-4. 语言学对象从键到值的映射

语言学	对象映射来自	映射到
文档索引	词	页面列表（找到词的地方）
同义词	词意	同义词列表
词典	中心词	词条项（词性、意思定义、词源）
比较单词列表	注释术语	同源词（词列表，每种语言一个）
词形分析	表面形式	形态学分析（词素组件列表）

Python字典

pos = {}

pos

{}

pos['colorless'] = 'ADJ'

pos['ideas'] = 'N'

pos['sleep'] = 'V'

pos['furiously'] = 'ADV'

pos

{'colorless': 'ADJ', 'furiously': 'ADV', 'ideas': 'N', 'sleep': 'V'}

pos['ideas']

'N'

pos['colorless']

'ADJ'

pos['green']

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-82-7a088695b94e> in <module>()
----> 1 pos['green']


KeyError: 'green'

list(pos)

['sleep', 'furiously', 'colorless', 'ideas']

sorted(pos)

['colorless', 'furiously', 'ideas', 'sleep']

[w for w in pos if w.endswith('s')]

['colorless', 'ideas']

for word in sorted(pos):
    print(word + ":", pos[word])

colorless: ADJ
furiously: ADV
ideas: N
sleep: V

pos.keys()

dict_keys(['sleep', 'furiously', 'colorless', 'ideas'])

pos.values()

dict_values(['V', 'ADV', 'ADJ', 'N'])

pos.items()

dict_items([('sleep', 'V'), ('furiously', 'ADV'), ('colorless', 'ADJ'), ('ideas', 'N')])

pos['sleep'] = 'V'

pos['sleep']

'V'

pos['sleep'] = 'N'

pos['sleep']

'N'

pos['sleep'] = ['N', 'V']

pos['sleep']

['N', 'V']

定义字典

pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}

pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

#字典的键必须是不可改变的类型，如字符串和元组。如果我们尝试使用可变键定义字典会得到一个TypeError
pos = {['ideas', 'blogs', 'adventures']: 'N'}

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-99-b550aa25e9a6> in <module>()
      1 #字典的键必须是不可改变的类型，如字符串和元组。如果我们尝试使用可变键定义字典会得到一个TypeError
----> 2 pos = {['ideas', 'blogs', 'adventures']: 'N'}


TypeError: unhashable type: 'list'

默认字典

from collections import defaultdict

frequency = defaultdict(int)

frequency['colorless'] = 4

frequency['ideas']

pos = defaultdict(list)

pos['sleep'] = ['NOUN', 'VERB']

pos['ideas']

[]

pos = defaultdict(lambda: 'NOUN')

pos['colorless'] = 'ADJ'

pos['blog']

'NOUN'

list(pos.items())

[('blog', 'NOUN'), ('colorless', 'ADJ')]

f = lambda: 'N'

f()

'N'

def g():
    return 'N'

g()

'N'

alice = nltk.corpus.gutenberg.words('carroll-alice.txt')

vocab = nltk.FreqDist(alice)

v50 = [word for (word, _) in vocab.most_common(50)]

mapping = defaultdict(lambda: 'UNK')

for v in v50:
    mapping[v] = v

alice2 = [mapping[v] for v in alice]

alice2[:50]

['UNK',
 'Alice',
 "'",
 's',
 'UNK',
 'in',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'I',
 '.',
 'UNK',
 'the',
 'UNK',
 '-',
 'UNK',
 'Alice',
 'was',
 'UNK',
 'to',
 'UNK',
 'very',
 'UNK',
 'of',
 'UNK',
 'UNK',
 'her',
 'UNK',
 'on',
 'the',
 'UNK',
 ',',
 'and',
 'of',
 'UNK',
 'UNK',
 'to',
 'UNK',
 ':',
 'UNK',
 'UNK',
 'UNK',
 'she',
 'had',
 'UNK',
 'UNK']

len(set(alice2))

递增地更新字典

例5-3. 递增地更新字典，按值排序

from collections import defaultdict

counts = defaultdict(int)

from nltk.corpus import brown

for (word, tag) in brown.tagged_words(categories='news', tagset='universal'):
    counts[tag] += 1

counts['NOUN']

sorted(counts)

['.',
 'ADJ',
 'ADP',
 'ADV',
 'CONJ',
 'DET',
 'NOUN',
 'NUM',
 'PRON',
 'PRT',
 'VERB',
 'X']

from operator import itemgetter

sorted(counts.items(), key=itemgetter(1), reverse=True)

[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

[t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)]

['NOUN',
 'VERB',
 'ADP',
 '.',
 'DET',
 'ADJ',
 'ADV',
 'CONJ',
 'PRON',
 'PRT',
 'NUM',
 'X']

pair = ('NP', 8336)

pair[1]

itemgetter(1)(pair)

last_letters = defaultdict(list)

words = nltk.corpus.words.words('en')

for word in words:
    key = word[-2:]
    last_letters[key].append(word)

last_letters['ly'][:10]

['abactinally',
 'abandonedly',
 'abasedly',
 'abashedly',
 'abashlessly',
 'abbreviately',
 'abdominally',
 'abhorrently',
 'abidingly',
 'abiogenetically']

last_letters['zy'][:10]

['blazy',
 'bleezy',
 'blowzy',
 'boozy',
 'breezy',
 'bronzy',
 'buzzy',
 'Chazy',
 'cozy',
 'crazy']

anagrams = defaultdict(list)

for word in words:
    key = ''.join(sorted(word))
    anagrams[key].append(word)

anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)

anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

复杂的键和值

pos = defaultdict(lambda: defaultdict(int))

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

for ((w1, t1), (w2, t2)) in nltk.bigrams(brown_news_tagged):
    pos[(t1, w2)][t2] += 1

pos[('DET', 'right')]

defaultdict(int, {'ADJ': 11, 'NOUN': 5})

颠倒字典

counts = defaultdict(int)

for word in nltk.corpus.gutenberg.words('milton-paradise.txt'):
    counts[word] += 1

[key for (key, value) in counts.items() if value == 32][:5]

['virtue', 'every', 'There', 'mortal', 'Him']

pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}

pos2 = dict((value, key) for (key, value) in pos.items())

pos2['N']

'ideas'

pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})

pos2 = nltk.defaultdict(list)

for key, value in pos.items():
    pos2[value].append(key)

pos2['ADV']

['peacefully', 'furiously']

pos2 = nltk.Index((value, key) for (key, value) in pos.items())

pos2['ADV']

['peacefully', 'furiously']

表5-5. Python 字典方法：常用的方法与字典相关习惯用法的总结

示例	说明
d = {}	创建一个空的字典，并将分配给d
d[key] = value	分配一个值给一个给定的字典键
d.keys()	字典的键的链表
list(d)	字典的键的链表
sorted(d)	字典的键，排序
key in d	测试一个特定的键是否在字典中
for key in d	遍历字典的键
d.values()	字典中的值的链表
dict([(k1,v1), (k2,v2), …])	从一个键-值对链表创建一个字典
d1.update(d2)	添加d2 中所有项目到d1
defaultdict(int)	一个默认值为0 的字典

5.4 自动标注

from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')

brown_sents = brown.sents(categories='news')

默认标注器

tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

nltk.FreqDist(tags).max()

'NN'

raw = 'I do not like green eggs and ham, I do not like them Sam I am!'

tokens = nltk.word_tokenize(raw)

default_tagger = nltk.DefaultTagger('NN')

default_tagger.tag(tokens)

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

正则表达式标注器

patterns = [
    (r'.*ing$', 'VBG'), # gerunds
    (r'.*ed$', 'VBD'), # simple past
    (r'.*es$', 'VBZ'), # 3rd singular present
    (r'.*ould$', 'MD'), # modals
    (r'.*\'s$', 'NN$'), # possessive nouns
    (r'.*s$', 'NNS'), # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN') # nouns (default)
    ]

regexp_tagger = nltk.RegexpTagger(patterns)

regexp_tagger.tag(brown_sents[3])

[('``', 'NN'),
 ('Only', 'NN'),
 ('a', 'NN'),
 ('relative', 'NN'),
 ('handful', 'NN'),
 ('of', 'NN'),
 ('such', 'NN'),
 ('reports', 'NNS'),
 ('was', 'NNS'),
 ('received', 'VBD'),
 ("''", 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('jury', 'NN'),
 ('said', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('considering', 'VBG'),
 ('the', 'NN'),
 ('widespread', 'NN'),
 ('interest', 'NN'),
 ('in', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('number', 'NN'),
 ('of', 'NN'),
 ('voters', 'NNS'),
 ('and', 'NN'),
 ('the', 'NN'),
 ('size', 'NN'),
 ('of', 'NN'),
 ('this', 'NNS'),
 ('city', 'NN'),
 ("''", 'NN'),
 ('.', 'NN')]

regexp_tagger.evaluate(brown_tagged_sents)

0.20326391789486245

查询标注器

fd = nltk.FreqDist(brown.words(categories='news'))

cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

most_freq_words = fd.most_common(100)

likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)

baseline_tagger = nltk.UnigramTagger(model=likely_tags)

baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

sent = brown.sents(categories='news')[3]

baseline_tagger.tag(sent)

[('``', '``'),
 ('Only', None),
 ('a', 'AT'),
 ('relative', None),
 ('handful', None),
 ('of', 'IN'),
 ('such', None),
 ('reports', None),
 ('was', 'BEDZ'),
 ('received', None),
 ("''", "''"),
 (',', ','),
 ('the', 'AT'),
 ('jury', None),
 ('said', 'VBD'),
 (',', ','),
 ('``', '``'),
 ('considering', None),
 ('the', 'AT'),
 ('widespread', None),
 ('interest', None),
 ('in', 'IN'),
 ('the', 'AT'),
 ('election', None),
 (',', ','),
 ('the', 'AT'),
 ('number', None),
 ('of', 'IN'),
 ('voters', None),
 ('and', 'CC'),
 ('the', 'AT'),
 ('size', None),
 ('of', 'IN'),
 ('this', 'DT'),
 ('city', None),
 ("''", "''"),
 ('.', '.')]

#默认标注器
baseline_tagger = nltk.UnigramTagger(model=likely_tags,
                                     backoff=nltk.DefaultTagger('NN'))

我们要先使用查找表，如果它不能指定一个标记就使用默认标注器，这个过程叫做回退

例5-4. 查找标注器的性能，使用不同大小的模型。

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
    import pylab
    word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
    words_by_freq = [w for (w, _) in word_freqs]
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

display()

![在这里插入图片描述](https://img-blog.csdnimg.cn/20190116093244498.png

评估

对比专家分配的标记来评估一个标注器的性能。由于我们通常很难获得专业和公正的人的判断，所以使用黄金标准测试数据来代替

5.5 N-gram 标注

一元标注（Unigram Tagging）

from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')

brown_sents = brown.sents(categories='news')

unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

unigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

分离训练和测试数据

size = int(len(brown_tagged_sents) * 0.9)

size

train_sents = brown_tagged_sents[:size]

test_sents = brown_tagged_sents[size:]

unigram_tagger = nltk.UnigramTagger(train_sents)

unigram_tagger.evaluate(test_sents)

0.8120203329014253

一般的N-gram的标注

bigram_tagger = nltk.BigramTagger(train_sents)

bigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

unseen_sent = brown_sents[4203]

bigram_tagger.tag(unseen_sent)

[('The', 'AT'),
 ('population', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('Congo', 'NP'),
 ('is', 'BEZ'),
 ('13.5', None),
 ('million', None),
 (',', None),
 ('divided', None),
 ('into', None),
 ('at', None),
 ('least', None),
 ('seven', None),
 ('major', None),
 ('``', None),
 ('culture', None),
 ('clusters', None),
 ("''", None),
 ('and', None),
 ('innumerable', None),
 ('tribes', None),
 ('speaking', None),
 ('400', None),
 ('separate', None),
 ('dialects', None),
 ('.', None)]

bigram_tagger.evaluate(test_sents)

0.10335891557859066

组合标注器

t0 = nltk.DefaultTagger('NN') #尝试使用bigram 标注器标注标识符。

t1 = nltk.UnigramTagger(train_sents, backoff=t0) #如果bigram 标注器无法找到一个标记，尝试unigram 标注器。

t2 = nltk.BigramTagger(train_sents, backoff=t1) #如果unigram 标注器也无法找到一个标记，使用默认标注器。

t2.evaluate(test_sents)

0.8456094886873318

t3 = nltk.TrigramTagger(train_sents, backoff=t2)

t3.evaluate(test_sents)

0.843317053722715

t4 = nltk.BigramTagger(train_sents, cutoff=2, backoff=t1)

t4.evaluate(test_sents)

0.84251968503937

标注生词

存储标注器

from pickle import dump

output = open('t2.pkl', 'wb')

dump(t2, output, -1)

output.close()

from pickle import load

input = open('t2.pkl', 'rb')

tagger = load(input)

input.close()

text = """The board's action shows what free enterprise
    is up against in our complex maze of regulatory laws ."""

tokens = text.split()

tagger.tag(tokens)

[('The', 'AT'),
 ("board's", 'NN$'),
 ('action', 'NN'),
 ('shows', 'NNS'),
 ('what', 'WDT'),
 ('free', 'JJ'),
 ('enterprise', 'NN'),
 ('is', 'BEZ'),
 ('up', 'RP'),
 ('against', 'IN'),
 ('in', 'IN'),
 ('our', 'PP$'),
 ('complex', 'JJ'),
 ('maze', 'NN'),
 ('of', 'IN'),
 ('regulatory', 'NN'),
 ('laws', 'NNS'),
 ('.', '.')]

性能限制

cfd = nltk.ConditionalFreqDist(
    ((x[1], y[1], z[0]), z[1])
    for sent in brown_tagged_sents
    for x, y, z in nltk.trigrams(sent))

ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]

sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

0.049297702068029296

test_tags = [tag for sent in brown.sents(categories='editorial')
             for (word, tag) in t2.tag(sent)]

gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]

# print(nltk.ConfusionMatrix(gold_tags, test_tags))

跨句子边界标注

例5-5. 句子层面的N-gram 标注

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

t2.evaluate(test_sents)

0.8456094886873318

5.6 基于转换的标注

5.7 如何确定一个词的分类

形态学线索

句法线索

语义线索

新词

词性标记集中的形态学

5.8 小结

词可以组成类，如名词、动词、形容词以及副词。这些类被称为词汇范畴或者词性。词性被分配短标签或者标记，如NN 和VB。
给文本中的词自动分配词性的过程称为词性标注、POS 标注或只是标注。
自动标注是NLP 流程中重要的一步，在各种情况下都十分有用，包括预测先前未见过的词的行为、分析语料库中词的使用以及文本到语音转换系统。
一些语言学语料库，如布朗语料库，已经做了词性标注。
有多种标注方法，如默认标注器、正则表达式标注器、unigram 标注器、n-gram 标注器。这些都可以结合一种叫做回退的技术一起使用。
标注器可以使用已标注语料库进行训练和评估。
回退是一个组合模型的方法：当一个较专业的模型（如bigram 标注器）不能为给定内容分配标记时，我们回退到一个较一般的模型（如unigram 标注器）
词性标注是NLP 中一个重要的早期的序列分类任务：利用局部上下文语境中的词和标记对序列中任意一点的分类决策。
字典用来映射任意类型之间的信息，如字符串和数字：freq[‘cat’]=12。我们使用大括号来创建字典：pos = {}，pos = {‘furiously’: ‘adv’, ‘ideas’: ‘n’, ‘colorless’:‘adj’}。
N-gram 标注器可以定义较大数值的n，但是当n 大于3 时，我们常常会面临数据稀疏问题；即使使用大量的训练数据，我们看到的也只是可能的上下文的一小部分。
基于转换的标注学习一系列“改变标记s 为标记t 在上下文c 中”形式的修复规则，每个规则会修复错误，也可能引入（较小的）错误。

致谢
《Python自然语言处理》¹²³ ⁴，作者：Steven Bird, Ewan Klein & Edward Loper，是实践性很强的一部入门读物，2009年第一版，2015年第二版，本学习笔记结合上述版本，对部分内容进行了延伸学习、练习，在此分享，期待对大家有所帮助，欢迎加我微信（验证：NLP），一起学习讨论，不足之处，欢迎指正。
在这里插入图片描述

参考文献

http://nltk.org/ ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎
（英）伯德，（英）克莱因，（美）洛普，《Python自然语言处理》，2010年，东南大学出版社 ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎

《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第05章 分类和标注词汇

第05章 分类和标注词汇