Transformers 사전 학습 모델은 다음을 사용합니다. 명명된 엔터티 인식 명명된 엔터티 인식

명명된 엔터티 인식 작업은 각 토큰을 분류하는 것입니다. 예를 들어 토큰이 사람 이름인지, 조직 이름인지 또는 지명인지 식별합니다. 명명된 엔터티 인식을 위한 데이터 세트 중 하나는 이 작업에 완벽하게 적합한 CoNLL-2003입니다.

사용 파이프라인

다음은 명명된 엔터티 인식을 구현하기 위해 파이프라인을 사용하는 예입니다. 먼저 9개의 레이블 범주를 정의합니다.

O: 명명된 개체가 아닙니다.
B-MIS: 엔터티라는 다른 클래스의 시작 태그입니다.
I-MIS: 명명된 엔터티의 다른 클래스에 대한 중간 마커입니다.
B-PER: 사람 이름의 시작 태그.
I-PER: 사람 이름의 가운데 표시.
B-ORG: 조직 이름의 시작 태그입니다.
I-ORG: 조직명의 중간 표기.
B-LOC: 지명의 시작 마커.
I-LOC: 지명의 중간 표기.

코드 예:

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))

작업 결과:

[
    {
    
    'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {
    
    'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {
    
    'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {
    
    'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {
    
    'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {
    
    'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {
    
    'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {
    
    'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {
    
    'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {
    
    'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {
    
    'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {
    
    'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

모델 및 텍스트 토크나이저 사용

프로세스는 다음과 같습니다.

사전 학습된 모델과 해당 텍스트 토크나이저를 인스턴스화합니다. BERT 모델이 필요합니다.
모델 학습을 위한 레이블 목록을 정의합니다.
"New York City"와 같이 알려진 명명된 엔터티로 시퀀스를 정의합니다.
모델 예측을 위한 세그먼트 단어. 여기에서 약간의 트릭을 사용할 수 있습니다. 먼저 시퀀스를 인코딩한 다음 디코딩하면 BERT에서 ""와 같은 특수 표시가 포함된 문자열을 얻을 수 있습니다.
시퀀스를 인덱스 배열로 인코딩합니다. (특수 태그는 자동으로 추가되므로 수동으로 추가할 필요가 없습니다.)
레이블 예측을 위해 데이터를 모델에 공급하여 얻은 결과입니다. 얻은 결과는 9개 클래스에 대한 확률 분포입니다. 일반적으로 확률이 가장 높은 레이블이 최종 예측 결과로 사용됩니다.
예상 레이블과 함께 각 토큰을 인쇄합니다.

샘플 코드

cache_dir="./transformersModels/ner"
"""
,cache_dir = cache_dir
"""
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english",cache_dir = cache_dir, return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased",cache_dir = cache_dir)

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)

for token, prediction in zip(tokens, predictions[0].numpy()):
    print(token, label_list[prediction])

출력 결과:

[CLS] O
Hu I-ORG
##gging I-ORG
Face I-ORG
Inc I-ORG
. O
is O
a O
company O
based O
in O
New I-LOC
York I-LOC
City I-LOC
. O
Its O
headquarters O
are O
in O
D I-LOC
##UM I-LOC
##BO I-LOC
, O
therefore O
very O
##c O
##lose O
to O
the O
Manhattan I-LOC
Bridge I-LOC
. O
[SEP] O

최종 예측 엔터티는 결과를 조합하여 얻을 수 있습니다.

ORG:Hugging Face Inc
LOC:New York City
LOC:DUMBO
LOC:Manhattan Bridge

파이프라인과 달리 엔터티가 자동으로 통합되지 않고 "O" 카테고리가 삭제되므로 직접 코드를 작성하여 구현해야 합니다.

Transformers 사전 학습 모델은 다음을 사용합니다. 명명된 엔터티 인식 명명된 엔터티 인식

사용 파이프라인

모델 및 텍스트 토크나이저 사용

추천