NLP入门从入门到实战实体命名识别 +中文预处理之繁简体转换及获取拼音

文章目录

一中文预处理之繁简体转换及获取拼音

在日常的中文NLP中，经常会涉及到中文的繁简体转换以及拼音的标注等问题，本文将介绍这两个方面的实现。
首先是中文的繁简体转换，不需要使用额外的Python模块，至需要以下两个Python代码文件即可：

示例代码如下（将代码文件与langconv.py与zh_wiki.py放在同一目录下）：

from langconv import *

# 转换繁体到简体
def cht_2_chs(line):
    line = Converter('zh-hans').convert(line)
    line.encode('utf-8')
    return line

line_cht= '''
台北市長柯文哲今在臉書開直播，先向網友報告自己3月16日至24日要出訪美國東部4城市，接著他無預警宣布，
2月23日要先出訪以色列，預計停留4至5天。雖他強調台北市、以色列已在資安方面有所交流，也可到當地城市交流、
參觀產業創新等內容，但柯也說「也是去看看一個小國在這麼惡劣環境，howtosurvive，他的祕訣是什麼？」這番話，
也被解讀，頗有更上層樓、直指總統大位的思維。
'''

line_cht = line_cht.replace('\n', '')
ret_chs = cht_2_chs(line_cht)
print(ret_chs)

# 转换简体到繁体
def chs_2_cht(sentence):
    sentence = Converter('zh-hant').convert(sentence)
    return sentence

line_chs = '忧郁的台湾乌龟台北市長柯文哲今在臉書開直播，先向網友報告自己3月16日至24日要出訪美國東部4城市，接著他無預警宣布，'
line_cht = chs_2_cht(line_chs)
print(line_cht)

from xpinyin import Pinyin

p = Pinyin()

# 默认分隔符为-
print(p.get_pinyin("上海"))

# 显示声调
print(p.get_pinyin("上海", tone_marks='marks'))
print(p.get_pinyin("上海", tone_marks='numbers'))

# 去掉分隔符
print(p.get_pinyin("上海", ''))
# 设为分隔符为空格
print(p.get_pinyin("上海", ' '))

# 获取拼音首字母
print(p.get_initial("上"))
print(p.get_initials("上海"))
print(p.get_initials("上海", ''))
print(p.get_initials("上海", ' '))

输出的结果如下：

台北市长柯文哲今在脸书开直播，先向网友报告自己3月16日至24日要出访美国东部4城市，接着他无预警宣布，2月23日要先出访以色列，预计停留4至5天。虽他强调台北市、以色列已在资安方面有所交流，也可到当地城市交流、参观产业创新等内容，但柯也说「也是去看看一个小国在这么恶劣环境，howtosurvive，他的祕诀是什么？」这番话，也被解读，颇有更上层楼、直指总统大位的思维。

憂郁的臺灣烏龜

在这里插入图片描述

接着是获取中文汉字的拼音，这方面的Python模块有xpinyin, pypinyin等。本文以xpinyin为例，展示如何获取汉字的拼音。示例代码如下：

from xpinyin import Pinyin

p = Pinyin()

# 默认分隔符为-
print(p.get_pinyin("上海"))

# 显示声调
print(p.get_pinyin("上海", tone_marks='marks'))
print(p.get_pinyin("上海", tone_marks='numbers'))

# 去掉分隔符
print(p.get_pinyin("上海", ''))
# 设为分隔符为空格
print(p.get_pinyin("上海", ' '))

# 获取拼音首字母
print(p.get_initial("上"))
print(p.get_initials("上海"))
print(p.get_initials("上海", ''))
print(p.get_initials("上海", ' '))

输出结果如下：

shang-hai
shàng-hǎi
shang4-hai3
shanghai
shang hai
S
S-H
SH
S H

二 NLP入门命名实体识别（NER）

本文将会简单介绍自然语言处理（NLP）中的命名实体识别（NER）。
命名实体识别（Named Entity Recognition，简称NER）是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具，在自然语言处理技术走向实用化的过程中占有重要地位。一般来说，命名实体识别的任务就是识别出待处理文本中三大类（实体类、时间类和数字类）、七小类（人名、机构名、地名、时间、日期、货币和百分比）命名实体。
举个简单的例子，在句子“小明早上8点去学校上课。”中，对其进行命名实体识别，应该能提取信息

人名：小明，时间：早上8点，地点：学校。

本文将会介绍几个工具用来进行命名实体识别，后续有机会的话，我们将会尝试着用HMM、CRF或深度学习来实现命名实体识别。
首先我们来看一下NLTK和Stanford NLP中对命名实体识别的分类，如下图：

NLTK和Stanford NLP中对命名实体识别的分类

在上图中，LOCATION和GPE有重合。GPE通常表示地理—政治条目，比如城市，州，国家，洲等。LOCATION除了上述内容外，还能表示名山大川等。FACILITY通常表示知名的纪念碑或人工制品等。
下面介绍两个工具来进行NER的任务：NLTK和Stanford NLP。
首先是NLTK，我们的示例文档（介绍FIFA，来源于维基百科）如下：

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.

实现NER的Python代码如下：

import re
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
   for tagged_tree in ne_tagged_sentence:
       # extract only chunks having NE labels
       if hasattr(tagged_tree, 'label'):
           entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
           entity_type = tagged_tree.label() # get NE category
           named_entities.append((entity_name, entity_type))
           # get unique named entities
           named_entities = list(set(named_entities))

# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

输出结果如下：

        Entity Name   Entity Type
0              FIFA  ORGANIZATION
1   Central America  ORGANIZATION
2           Belgium           GPE
3         Caribbean      LOCATION
4              Asia           GPE
5            France           GPE
6           Oceania           GPE
7           Germany           GPE
8     South America           GPE
9           Denmark           GPE
10           Zürich           GPE
11           Africa        PERSON
12           Sweden           GPE
13      Netherlands           GPE
14            Spain           GPE
15      Switzerland           GPE
16            North           GPE
17           Europe           GPE

可以看到，NLTK中的NER任务大体上完成得还是不错的，能够识别FIFA为组织（ORGANIZATION），Belgium,Asia为GPE, 但是也有一些不太如人意的地方，比如，它将Central America识别为ORGANIZATION，而实际上它应该为GPE；将Africa识别为PERSON，实际上应该为GPE。

接下来，我们尝试着用Stanford NLP工具。关于该工具，我们主要使用Stanford NER 标注工具。在使用这个工具之前，你需要在自己的电脑上安装Java（一般是JDK），并将Java添加到系统路径中，同时下载英语NER的文件包：stanford-ner-2018-10-16.zip（大小为172MB），下载地址为：https://nlp.stanford.edu/software/CRF-NER.shtml。以笔者的电脑为例，Java所在的路径为：C:\Program Files\Java\jdk1.8.0_161\bin\java.exe，下载Stanford NER的zip文件解压后的文件夹的路径为：E://stanford-ner-2018-10-16，如下图所示：

E://stanford-ner-2018-10-16

在classifer文件夹中有如下文件：

E://stanford-ner-2018-10-16/classifiers

它们代表的含义如下：

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time

可以使用Python实现Stanford NER，完整的代码如下：

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                       path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
   temp_entity_name = ''
   temp_named_entity = None
   for term, tag in sentence:
       # get terms with NE tags
       if tag != 'O':
           temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
           temp_named_entity = (temp_entity_name, tag) # get NE and its category
       else:
           if temp_named_entity:
               named_entities.append(temp_named_entity)
               temp_entity_name = ''
               temp_named_entity = None

# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

输出结果如下：

                Entity Name   Entity Type
0                      1904          DATE
1                   Denmark      LOCATION
2                     Spain      LOCATION
3   North & Central America  ORGANIZATION
4             South America      LOCATION
5                   Belgium      LOCATION
6                    Zürich      LOCATION
7           the Netherlands      LOCATION
8                    France      LOCATION
9                 Caribbean      LOCATION
10                   Sweden      LOCATION
11                  Oceania      LOCATION
12                     Asia      LOCATION
13                     FIFA  ORGANIZATION
14                   Europe      LOCATION
15                   Africa      LOCATION
16              Switzerland      LOCATION
17                  Germany      LOCATION

可以看到，在Stanford NER的帮助下，NER的实现效果较好，将Africa识别为LOCATION，将1904识别为时间（这在NLTK中没有识别出来），但还是对North & Central America识别有误，将其识别为ORGANIZATION。
值得注意的是，并不是说Stanford NER一定会比NLTK NER的效果好，两者针对的对象，预料，算法可能有差异，因此，需要根据自己的需求决定使用什么工具。
本次分享到此结束，以后有机会的话，将会尝试着用HMM、CRF或深度学习来实现命名实体识别。

公众号获取源码数据集，一起AI：

在这里插入图片描述

reference:

https://github.com/skydark/nstools/blob/master/zhtools/zh_wiki.py#L8271

https://github.com/skydark/nstools/blob/master/zhtools/langconv.py

https://blog.csdn.net/lotusws/article/details/82934599