Table of contents
3. Construction of network structure
3.2 Basic network structure definition
3.3 Encoder and classifier definitions
3.5 Word segmentation auxiliary code
3.6 Data reading and preprocessing code
General parameter introduction
Configuration related to the training phase
Forecast related configuration
Conversational emotion recognition is an important task in natural language processing (NLP), which has a wide range of applications in many fields such as chatbots, customer service, and social media analysis. In this blog, we will explore how to use PaddlePaddle's ERNIE model for dialogue emotion recognition.
1. What is ERNIE?
ERNIE (Enhanced Representation through Knowledge Integration) is a knowledge-enhanced semantic pre-training model developed by Baidu, which has demonstrated excellent performance in many Chinese NLP tasks.
Model | gossip | customer service | |
---|---|---|---|
BOW | 90.2% | 87.6% | 74.2% |
LSTM | 91.4% | 90.1% | 73.8% |
Bi-LSTM | 91.2% | 89.9% | 73.6% |
CNN | 90.8% | 90.7% | 76.3% |
TextCNN | 91.1% | 91.0% | 76.8% |
BERT | 93.6% | 92.3% | 78.6% |
ERNIE | 94.4% | 94.0% | 80.6% |
2. Dataset introduction
The input of the dialogue emotion recognition task is a piece of user text, and the output is the detected emotion category, including negative, positive, and neutral. This is a classic short text three-category task.
After the data set is decompressed, a data directory is generated. In the data directory, there are training set data (train.tsv), development set data (dev.tsv), test set data (test.tsv), data to be predicted (infer.tsv) and corresponding dictionaries (vocab.txt)
Examples of data used for training, prediction, and evaluation are as follows. The data consists of two columns separated by tabs ('\t'). The first column is the category of emotional classification (0 means negative; 1 means medium sex; 2 means positive), the second column is the Chinese text separated by spaces:
label text_a
0 谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1 我 有事 等会儿 就 回来 和 你 聊
2 我 见到 你 很高兴 谢谢 你 帮 我
In [1]
# 解压数据集
!cd /home/aistudio/data/data9740 && unzip -qo 对话情绪识别.zip
In [2]
# 各种引用库
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import io
import os
import six
import sys
import time
import random
import string
import logging
import argparse
import collections
import unicodedata
from functools import partial
from collections import namedtuple
import multiprocessing
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import numpy as np
In [3]
# 统一的 logger 配置
logger = None
def init_log_config():
"""
初始化日志相关配置
:return:
"""
global logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
log_path = os.path.join(os.getcwd(), 'logs')
if not os.path.exists(log_path):
os.makedirs(log_path)
log_name = os.path.join(log_path, 'train.log')
sh = logging.StreamHandler()
fh = logging.FileHandler(log_name, mode='w')
fh.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
fh.setFormatter(formatter)
sh.setFormatter(formatter)
logger.handlers = []
logger.addHandler(sh)
logger.addHandler(fh)
In [4]
# util
def print_arguments(args):
"""
打印参数
"""
logger.info('----------- Configuration Arguments -----------')
for key in args.keys():
logger.info('%s: %s' % (key, args[key]))
logger.info('------------------------------------------------')
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
加载缓存模型
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
"""
If existed presitabels
"""
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
logger.info("Load model from {}".format(init_checkpoint_path))
def csv_reader(fd, delimiter='\t'):
"""
csv 文件读取
"""
def gen():
for i in fd:
slots = i.rstrip('\n').split(delimiter)
if len(slots) == 1:
yield slots,
else:
yield slots
return gen()
3. Construction of network structure
ERNIE : Baidu self-developed a general text semantic representation model based on massive data and prior knowledge training, and based on this, fine-tune it on the dialogue emotion classification dataset.
Released in March 2019, ERNIE learns real-world semantic knowledge by modeling words, entities, and entity relationships in massive data. Compared with BERT learning the original language signal, ERNIE directly models the prior semantic knowledge unit, which enhances the semantic representation ability of the model.
In July of the same year, Baidu released ERNIE 2.0 . ERNIE 2.0 is a semantic understanding pre-training framework based on continuous learning, which uses multi-task learning to build pre-training tasks incrementally. In ERNIE 2.0 , the newly constructed pre-training task type can be seamlessly added to the training framework for continuous semantic understanding learning. Through the new semantic tasks such as entity prediction, sentence causality judgment, and article sentence structure reconstruction, the ERNIE 2.0 semantic understanding pre-training model obtains natural language information in multiple dimensions such as morphology, syntax, and semantics from the training data, which greatly enhances the The general semantic representation capability is provided, and the schematic diagram is as follows:
References:
- ERNIE: Enhanced Representation through Knowledge Integration
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
- ERNIE Preview: Baidu Knowledge Enhanced Semantic Representation Model ERNIE
- ERNIE 2.0 GitHub
3.1 ERNIE model definition
class ErnieModel defines ERNIE encoder network structure
input src_ids, position_ids, sentence_ids and input_mask
output sequence_output and pooled_output
In [5]
class ErnieModel(object):
"""Ernie模型定义"""
def __init__(self,
src_ids,
position_ids,
sentence_ids,
input_mask,
config,
weight_sharing=True,
use_fp16=False):
# Ernie 相关参数
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
self._sent_types = config['type_vocab_size']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._weight_sharing = weight_sharing
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._dtype = "float16" if use_fp16 else "float32"
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._build_model(src_ids, position_ids, sentence_ids, input_mask)
def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
position_emb_out = fluid.layers.embedding(
input=position_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.layers.embedding(
sentence_ids,
size=[self._sent_types, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if self._dtype == "float16":
input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name='encoder')
def get_sequence_output(self):
"""Get embedding of each token for squence labeling"""
return self._enc_out
def get_pooled_output(self):
"""Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0")
return next_sent_feat
3.2 Basic network structure definition
The following 4 cells define the basic network structure used in ErnieModel, including:
- multi_head_attention
- positionwise_feed_forward
- pre_post_process_layer: Add residual connection, layer normalization and droput, used before and after multi_head_attention and positionwise_feed_forward
- encoder_layer: call the above three structures to generate the encoder layer
- encoder: Stack encoder_layer to generate a complete encoder
For an introduction to multi_head_attention and positionwise_feed_forward, please refer to: The Annotated Transformer
In [6]
def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0.,
cache=None, param_initializer=None, name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3:
return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
In [7]
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
In [8]
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
In [9]
def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
param_initializer=None, name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout,
param_initializer=param_initializer, name=name + '_multi_head_att')
attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
param_initializer=None, name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid,
prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd,
postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
3.3 Encoder and classifier definitions
The following cells define the organizational structure of encoder and classification:
- ernie_encoder: organize output embeddings according to ErnieModel
- create_ernie_model: Define the classification network, take embeddings as input, use fully connected network + softmax for classification
In [10]
def ernie_encoder(ernie_inputs, ernie_config):
"""return sentence embedding and token embeddings"""
ernie = ErnieModel(
src_ids=ernie_inputs["src_ids"],
position_ids=ernie_inputs["pos_ids"],
sentence_ids=ernie_inputs["sent_ids"],
input_mask=ernie_inputs["input_mask"],
config=ernie_config)
enc_out = ernie.get_sequence_output()
unpad_enc_out = fluid.layers.sequence_unpad(
enc_out, length=ernie_inputs["seq_lens"])
cls_feats = ernie.get_pooled_output()
embeddings = {
"sentence_embeddings": cls_feats,
"token_embeddings": unpad_enc_out,
}
for k, v in embeddings.items():
v.persistable = True
return embeddings
def create_ernie_model(args,
embeddings,
labels,
is_prediction=False):
"""
Create Model for sentiment classification based on ERNIE encoder
"""
sentence_embeddings = embeddings["sentence_embeddings"]
token_embeddings = embeddings["token_embeddings"]
cls_feats = fluid.layers.dropout(
x=sentence_embeddings,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=args['num_labels'],
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels, return_softmax=True)
if is_prediction:
return probs
loss = fluid.layers.mean(x=ce_loss)
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
return loss, accuracy, num_seqs
3.4 Word segmentation code
The following three cells define word segmentation code classes, including:
- FullTokenizer: complete word segmentation, used in data reading code, call BasicTokenizer and WordpieceTokenizer to achieve
- BasicTokenizer: Basic word segmentation, including punctuation, lowercase conversion, etc.
- WordpieceTokenizer: word division
In [11]
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
In [12]
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
In [13]
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
3.5 Word segmentation auxiliary code
The following cell defines auxiliary codes in word segmentation, including convert_to_unicode, whitespace_tokenize, etc.
In [14]
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = io.open(vocab_file, encoding="utf8")
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
3.6 Data reading and preprocessing code
The following 4 cells define data readers and preprocessing code, including:
- BaseReader: data reader base class
- ClassifyReader: Data reader for classification models, override _readtsv and _pad_batch_records methods
- pad_batch_data: Data preprocessing, adding padding to the data, and generating position data and mask
- ernie_pyreader: generate pyreader for training, validation and prediction
In [15]
class BaseReader(object):
"""BaseReader for classify and sequence labeling task"""
def __init__(self,
vocab_path,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
random_seed=None):
self.max_seq_len = max_seq_len
self.tokenizer = FullTokenizer(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
np.random.seed(random_seed)
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
if label_map_config:
with open(label_map_config) as f:
self.label_map = json.load(f)
else:
self.label_map = None
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f, delimiter="\t")
headers = next(reader)
Example = namedtuple('Example', headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def _convert_example_to_record(self, example, max_seq_length, tokenizer):
"""Converts a single `Example` into a single `Record`."""
text_a = convert_to_unicode(example.text_a)
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if "text_b" in example._fields:
text_b = convert_to_unicode(example.text_b)
tokens_b = tokenizer.tokenize(text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT/ERNIE is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(len(token_ids)))
if self.label_map:
label_id = self.label_map[example.label]
else:
label_id = example.label
Record = namedtuple(
'Record',
['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])
qid = None
if "qid" in example._fields:
qid = example.qid
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
label_id=label_id,
qid=qid)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, self.max_seq_len,
self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def get_num_examples(self, input_file):
"""return total number of examples"""
examples = self._read_tsv(input_file)
return len(examples)
def get_examples(self, input_file):
examples = self._read_tsv(input_file)
return examples
def data_generator(self,
input_file,
batch_size,
epoch,
shuffle=True,
phase=None):
"""return generator which yields batch data for pyreader"""
examples = self._read_tsv(input_file)
def _wrapper():
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
if shuffle:
np.random.shuffle(examples)
for batch_data in self._prepare_batch_data(
examples, batch_size, phase=phase):
yield batch_data
return _wrapper
In [16]
class ClassifyReader(BaseReader):
"""ClassifyReader"""
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f, delimiter="\t")
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != "label"
]
Example = namedtuple('Example', headers)
examples = []
for line in reader:
for index, text in enumerate(line):
if index in text_indices:
line[index] = text.replace(' ', '')
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
# padding
padded_token_ids, input_mask, seq_lens = pad_batch_data(
batch_token_ids,
pad_idx=self.pad_id,
return_input_mask=True,
return_seq_lens=True)
padded_text_type_ids = pad_batch_data(
batch_text_type_ids, pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(
batch_position_ids, pad_idx=self.pad_id)
return_list = [
padded_token_ids, padded_text_type_ids, padded_position_ids,
input_mask, batch_labels, seq_lens
]
return return_list
In [17]
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and input mask.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1])]
return return_list if len(return_list) > 1 else return_list[0]
In [18]
def ernie_pyreader(args, pyreader_name):
"""define standard ernie pyreader"""
pyreader_name += '_' + ''.join(random.sample(string.ascii_letters + string.digits, 6))
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1],
[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1], [-1, 1],
[-1]],
dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, input_mask, labels,
seq_lens) = fluid.layers.read_file(pyreader)
ernie_inputs = {
"src_ids": src_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"input_mask": input_mask,
"seq_lens": seq_lens
}
return pyreader, ernie_inputs, labels
General parameter introduction
- Dataset related configuration
data_config = {
'data_dir': 'data/data9740/data',
'vocab_path': 'data/data9740/data/vocab.txt',
'batch_size': 32,
'random_seed': 0,
'num_labels': 3,
'max_seq_len': 512,
'train_set': 'data/data9740/data/test.tsv',
'test_set': 'data/data9740/data/test.tsv',
'dev_set': 'data/data9740/data/dev.tsv',
'infer_set': 'data/data9740/data/infer.tsv',
'label_map_config': None,
'do_lower_case': True,
}
Parameter introduction:
- data_dir : dataset path, default 'data/data9740/data'
- vocab_path : the path where vocab.txt is located, the default is 'data/data9740/data/vocab.txt'
- batch_size : batch size for training and validation, default: 32
- random_seed : random seed, default 0
- num_labels : number of categories, default 3
- max_seq_len : the longest word in the sentence, the default is 512
- train_set : training set path, default 'data/data9740/data/test.tsv'
- test_set : test set path, default 'data/data9740/data/test.tsv'
- dev_set : Validation set path, default 'data/data9740/data/dev.tsv'
- infer_set : prediction set path, default 'data/data9740/data/infer.tsv'
- label_map_config : label_map path, default None
- do_lower_case : Whether to perform additional lowercase processing on the input, the default is True
- ERNIE network structure related configuration
ernie_net_config = {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "relu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 513,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 18000,
}
Parameter introduction:
- attention_probs_dropout_prob : attention block dropout ratio, default 0.1
- hidden_act : Hidden layer activation function, default 'relu'
- hidden_dropout_prob : Hidden layer dropout ratio, default 0.1
- hidden_size : hidden layer size, default 768
- initializer_range : parameter initialization zoom range, default 0.02
- max_position_embeddings : The maximum length of the position sequence, the default is 513
- num_attention_heads : number of attention block headers, default 12
- num_hidden_layers : number of hidden layers, default 12
- type_vocab_size : number of sentence categories, default 2
- vocab_size : dictionary length, default 18000
In [19]
# 数据集相关配置
data_config = {
'data_dir': 'data/data9740/data', # Directory path to training data.
'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt', # Vocabulary path.
'batch_size': 32, # Total examples' number in batch for training.
'random_seed': 0, # Random seed.
'num_labels': 3, # label number
'max_seq_len': 512, # Number of words of the longest seqence.
'train_set': 'data/data9740/data/test.tsv', # Path to training data.
'test_set': 'data/data9740/data/test.tsv', # Path to test data.
'dev_set': 'data/data9740/data/dev.tsv', # Path to validation data.
'infer_set': 'data/data9740/data/infer.tsv', # Path to infer data.
'label_map_config': None, # label_map_path
'do_lower_case': True, # Whether to lower case the input text. Should be True for uncased models and False for cased models.
}
# Ernie 网络结构相关配置
ernie_net_config = {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "relu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 513,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 18000,
}
4. Model training
Users can implement Finetune training on their own data based on Baidu's open-source dialogue emotion recognition model, in order to obtain better results. Baidu provides two pre-trained models, TextCNN and ERNIE. The specific model Finetune method is as follows:
- Download the pretrained model
- Change parameters
- 'init_checkpoint':'pretrained_model/ernie_finetune/params'
- Execute "ERNIE training code"
Configuration related to the training phase
train_config = {
'init_checkpoint': 'pretrained_model/ernie_finetune/params',
'output_dir': 'train_model',
'epoch': 10,
'save_steps': 100,
'validation_steps': 100,
'lr': 0.00002,
'skip_steps': 10,
'verbose': False,
'use_cuda': True,
}
Parameter introduction:
- init_checkpoint : Whether to use the pre-trained model, default: 'pretrained_model/ernie_finetune/params'
- output_dir : model cache path, default 'train_model'
- epoch : number of training rounds, default 10
- save_steps : model cache interval, default 100
- validation_steps : validation interval, default 100
- lr : learning rate, default 0.00002
- skip_steps : log output interval, default 10
- verbose : Whether to output verbose logs, the default is False
- use_cuda : whether to use GPU, default True
In [20]
# 下载预训练模型
!mkdir pretrained_model
# 下载并解压 ERNIE 预训练模型
!cd pretrained_model && wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
!cd pretrained_model && tar xzf emotion_detection_ernie_finetune-1.0.0.tar.gz
mkdir: cannot create directory ‘pretrained_model’: File exists --2020-02-12 15:26:18-- https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz Resolving baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 182.61.200.195, 182.61.200.229 Connecting to baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|182.61.200.195|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 744568046 (710M) [application/x-gzip] Saving to: ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ emotion_detection_e 100%[===================>] 710.08M 72.2MB/s in 15s 2020-02-12 15:26:33 (48.4 MB/s) - ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ saved [744568046/744568046] gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now
In [21]
# ERNIE 训练代码
train_config = {
'init_checkpoint': 'pretrained_model/ernie_finetune/params', # Init checkpoint to resume training from.
# 'init_checkpoint': 'None',
'output_dir': 'train_model', # Directory path to save checkpoints
'epoch': 5, # Number of epoches for training.
'save_steps': 100, # The steps interval to save checkpoints.
'validation_steps': 100, # The steps interval to evaluate model performance.
'lr': 0.00002, # The Learning rate value for training.
'skip_steps': 10, # The steps interval to print loss.
'verbose': False, # Whether to output verbose log
'use_cuda':True, # If set, use GPU for training.
}
train_config.update(data_config)
def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
"""
Evaluation Function
"""
test_pyreader.start()
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
# 执行一步验证
np_loss, np_acc, np_num_seqs = exe.run(program=test_program,
fetch_list=fetch_list,
return_numpy=False)
np_loss = np.array(np_loss)
np_acc = np.array(np_acc)
np_num_seqs = np.array(np_num_seqs)
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
logger.info("[%s evaluation] avg loss: %f, ave acc: %f, elapsed time: %f s" %
(eval_phase, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs), time_end - time_begin))
def main(config):
"""
Main Function
"""
# 定义 executor
if config['use_cuda']:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
# 定义数据 reader
reader = ClassifyReader(
vocab_path=config['vocab_path'],
label_map_config=config['label_map_config'],
max_seq_len=config['max_seq_len'],
do_lower_case=config['do_lower_case'],
random_seed=config['random_seed'])
startup_prog = fluid.Program()
if config['random_seed'] is not None:
startup_prog.random_seed = config['random_seed']
# 训练阶段初始化
train_data_generator = reader.data_generator(
input_file=config['train_set'],
batch_size=config['batch_size'],
epoch=config['epoch'],
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(config['train_set'])
# 通过训练集大小 * 训练轮数得出总训练步数
max_train_steps = config['epoch'] * num_train_examples // config['batch_size'] // dev_count + 1
logger.info("Device count: %d" % dev_count)
logger.info("Num train examples: %d" % num_train_examples)
logger.info("Max train steps: %d" % max_train_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
# create ernie_pyreader
train_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='train_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
# user defined model based on ernie embeddings
loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
"""
sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=config['lr'])
sgd_optimizer.minimize(loss)
"""
optimizer = fluid.optimizer.Adam(learning_rate=config['lr'])
optimizer.minimize(loss)
if config['verbose']:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=config['batch_size'])
logger.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
# 验证阶段初始化
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
# create ernie_pyreader
test_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='eval_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
# user defined model based on ernie embeddings
loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
# 加载预训练模型
# if config['init_checkpoint']:
# init_checkpoint(exe, config['init_checkpoint'], main_program=train_program)
# 模型训练代码
if not os.path.exists(config['output_dir']):
os.mkdir(config['output_dir'])
logger.info('Start training')
train_pyreader.decorate_tensor_provider(train_data_generator)
train_pyreader.start()
steps = 0
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
steps += 1
if steps % config['skip_steps'] == 0:
fetch_list = [loss.name, accuracy.name, num_seqs.name]
else:
fetch_list = []
# 执行一步训练
outputs = exe.run(program=train_program, fetch_list=fetch_list, return_numpy=False)
if steps % config['skip_steps'] == 0:
# 打印日志
np_loss, np_acc, np_num_seqs = outputs
np_loss = np.array(np_loss)
np_acc = np.array(np_acc)
np_num_seqs = np.array(np_num_seqs)
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if config['verbose']:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
logger.info(verbose)
time_end = time.time()
used_time = time_end - time_begin
logger.info("step: %d, avg loss: %f, "
"avg acc: %f, speed: %f steps/s" %
(steps, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
config['skip_steps'] / used_time))
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
if steps % config['save_steps'] == 0:
# 缓存模型
# fluid.io.save_persistables(exe, config['output_dir'], train_program)
fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
if steps % config['validation_steps'] == 0:
# 在验证集上执行验证
test_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['dev_set'],
batch_size=config['batch_size'],
phase='dev',
epoch=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name],
"dev")
except fluid.core.EOFException:
# 训练结束
# fluid.io.save_persistables(exe, config['output_dir'], train_program)
fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
train_pyreader.reset()
logger.info('Training end.')
break
# 模型验证代码
test_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['test_set'],
batch_size=config['batch_size'], phase='test', epoch=1,
shuffle=False))
logger.info("Final validation result:")
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name], "test")
if __name__ == "__main__":
init_log_config()
print_arguments(train_config)
main(train_config)
2020-02-12 15:26:37,110 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: ----------- Configuration Arguments ----------- 2020-02-12 15:26:37,112 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: pretrained_model/ernie_finetune/params 2020-02-12 15:26:37,113 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: output_dir: train_model 2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: epoch: 5 2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: save_steps: 100 2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: validation_steps: 100 2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: lr: 2e-05 2020-02-12 15:26:37,116 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: skip_steps: 10 2020-02-12 15:26:37,117 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: verbose: False 2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True 2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data 2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt 2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32 2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0 2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv 2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv 2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv 2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None 2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True 2020-02-12 15:26:37,124 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------ 2020-02-12 15:26:37,155 - <ipython-input-21-a69ca2beccb7>[line:87] - INFO: Device count: 1 2020-02-12 15:26:37,156 - <ipython-input-21-a69ca2beccb7>[line:88] - INFO: Num train examples: 1036 2020-02-12 15:26:37,157 - <ipython-input-21-a69ca2beccb7>[line:89] - INFO: Max train steps: 162 2020-02-12 15:26:37,157 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:26:38,624 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:26:41,918 - <ipython-input-21-a69ca2beccb7>[line:138] - INFO: Start training 2020-02-12 15:26:43,916 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 10, avg loss: 1.100799, avg acc: 0.656250, speed: 5.022962 steps/s 2020-02-12 15:26:45,749 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 20, avg loss: 0.826600, avg acc: 0.687500, speed: 5.464362 steps/s 2020-02-12 15:26:47,657 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 30, avg loss: 0.786590, avg acc: 0.750000, speed: 5.247581 steps/s 2020-02-12 15:26:49,552 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 40, avg loss: 0.942438, avg acc: 0.593750, speed: 5.286641 steps/s 2020-02-12 15:26:51,397 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 50, avg loss: 0.457657, avg acc: 0.875000, speed: 5.426712 steps/s 2020-02-12 15:26:53,161 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 60, avg loss: 0.668453, avg acc: 0.718750, speed: 5.680988 steps/s 2020-02-12 15:26:54,994 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 70, avg loss: 1.101823, avg acc: 0.562500, speed: 5.463856 steps/s 2020-02-12 15:26:56,797 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 80, avg loss: 0.627734, avg acc: 0.812500, speed: 5.554073 steps/s 2020-02-12 15:26:58,488 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 90, avg loss: 0.610504, avg acc: 0.750000, speed: 5.927647 steps/s 2020-02-12 15:27:00,257 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 100, avg loss: 0.598749, avg acc: 0.781250, speed: 5.664717 steps/s 2020-02-12 15:27:15,284 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [dev evaluation] avg loss: 0.637231, ave acc: 0.787037, elapsed time: 2.431484 s 2020-02-12 15:27:17,637 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 110, avg loss: 0.523219, avg acc: 0.843750, speed: 0.575587 steps/s 2020-02-12 15:27:19,406 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 120, avg loss: 0.484762, avg acc: 0.812500, speed: 5.671743 steps/s 2020-02-12 15:27:21,776 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 130, avg loss: 0.280636, avg acc: 0.937500, speed: 4.227057 steps/s 2020-02-12 15:27:24,124 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 140, avg loss: 0.624467, avg acc: 0.687500, speed: 4.264188 steps/s 2020-02-12 15:27:26,591 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 150, avg loss: 0.506643, avg acc: 0.875000, speed: 4.058757 steps/s 2020-02-12 15:27:28,928 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 160, avg loss: 0.584385, avg acc: 0.750000, speed: 4.284505 steps/s 2020-02-12 15:27:42,380 - <ipython-input-21-a69ca2beccb7>[line:201] - INFO: Training end. 2020-02-12 15:27:42,384 - <ipython-input-21-a69ca2beccb7>[line:210] - INFO: Final validation result: 2020-02-12 15:27:45,154 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [test evaluation] avg loss: 0.436581, ave acc: 0.837838, elapsed time: 2.761349 s
5. Model prediction
In the prediction stage, load the saved model and make predictions on the prediction set, which is realized by modifying the following parameters
Forecast related configuration
infer_config = {
'init_checkpoint': 'train_model',
'use_cuda': True,
}
Parameter introduction:
- init_checkpoint : load the pre-trained model, default: 'train_model'
- use_cuda : whether to use GPU, default True
In [22]
# ERNIE 预测代码
infer_config = {
'init_checkpoint': 'train_model', # Init checkpoint to resume training from.
'use_cuda': True, # If set, use GPU for training.
}
infer_config.update(data_config)
def init_checkpoint_infer(exe, init_checkpoint_path, main_program):
"""
加载缓存模型
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
# fluid.io.load_vars(
# exe,
# init_checkpoint_path,
# main_program=main_program,
# predicate=existed_persitables)
fluid.load(main_program, os.path.join(init_checkpoint_path, "checkpoint"), exe)
logger.info("Load model from {}".format(init_checkpoint_path))
def infer(exe, infer_program, infer_pyreader, fetch_list, infer_phase, examples):
"""Infer"""
infer_pyreader.start()
time_begin = time.time()
while True:
try:
# 进行一步预测
batch_probs = exe.run(program=infer_program, fetch_list=fetch_list,
return_numpy=True)
for i, probs in enumerate(batch_probs[0]):
logger.info("Probs: %f %f %f, prediction: %d, input: %s" % (probs[0], probs[1], probs[2], np.argmax(probs), examples[i]))
except fluid.core.EOFException:
infer_pyreader.reset()
break
time_end = time.time()
logger.info("[%s] elapsed time: %f s" % (infer_phase, time_end - time_begin))
def main(config):
"""
Main Function
"""
# 定义 executor
if config['use_cuda']:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
# 定义数据 reader
reader = ClassifyReader(
vocab_path=config['vocab_path'],
label_map_config=config['label_map_config'],
max_seq_len=config['max_seq_len'],
do_lower_case=config['do_lower_case'],
random_seed=config['random_seed'])
startup_prog = fluid.Program()
if config['random_seed'] is not None:
startup_prog.random_seed = config['random_seed']
# 预测阶段初始化
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
infer_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='infer_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
probs = create_ernie_model(config, embeddings, labels=labels, is_prediction=True)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
# 加载预训练模型
if not config['init_checkpoint']:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or infer!")
init_checkpoint_infer(exe, config['init_checkpoint'], main_program=test_prog)
# 模型预测代码
infer_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['infer_set'],
batch_size=config['batch_size'],
phase='infer',
epoch=1,
shuffle=False))
logger.info("Final test result:")
infer(exe, test_prog, infer_pyreader,
[probs.name], "infer", reader.get_examples(config['infer_set']))
if __name__ == "__main__":
init_log_config()
print_arguments(infer_config)
main(infer_config)
2020-02-12 15:27:45,174 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: ----------- Configuration Arguments ----------- 2020-02-12 15:27:45,175 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: train_model 2020-02-12 15:27:45,176 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True 2020-02-12 15:27:45,177 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data 2020-02-12 15:27:45,178 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt 2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32 2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0 2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3 2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512 2020-02-12 15:27:45,181 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv 2020-02-12 15:27:45,182 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None 2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True 2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------ 2020-02-12 15:27:45,210 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:27:47,503 - <ipython-input-22-09c53c7a0ed3>[line:22] - INFO: Load model from train_model 2020-02-12 15:27:47,505 - <ipython-input-22-09c53c7a0ed3>[line:95] - INFO: Final test result: 2020-02-12 15:27:47,558 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019586 0.875026 0.105388, prediction: 1, input: Example(label='1', text_a= 'I want to be objective') 2020-02-12 15:27:47,559 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.608523 0.325968 0.065509, prediction: 0, input: Example(label='0', text_a= 'Are you really talking nonsense') 2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.003523 0.947431 0.049045, prediction: 1, input: Example(label='1', text_a='口嗅会') 2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014141 0.889832 0.096027, prediction: 1, input: Example(label='1', text_a= 'Every time it's the cousin who takes the nest because the nest is crazy') 2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.234133 0.636430 0.129437, prediction: 1, input: Example(label='0', text_a= 'Stop talking nonsense I'm asking you a question') 2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014605 0.887870 0.097524, prediction: 1, input: Example(label='1', text_a= '4967 is the bank in Singapore') 2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.692878 0.215159 0.091963, prediction: 0, input: Example(label='2', text_a= 'Yes I like rabbits') 2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019696 0.888937 0.091367, prediction: 1, input: Example(label='1', text_a= 'Have you ever written about Huangshan Strange Rock') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.012140 0.872288 0.115572, prediction: 1, input: Example(label='1', text_a= 'One by one slowly') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.770847 0.185456 0.043697, prediction: 0, input: Example(label='0', text_a= 'I've played this and it's not fun at all') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.007810 0.900273 0.091916, prediction: 1, input: Example(label='1', text_a= 'Online development of girls' QQ') 2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.072372 0.808013 0.119615, prediction: 1, input: Example(label='1', text_a= 'You guessed it right') 2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.874610 0.099676 0.025713, prediction: 0, input: Example(label='0', text_a= 'I hate you, hehehe...') 2020-02-12 15:27:47,596 - <ipython-input-22-09c53c7a0ed3>[line:40] - INFO: [infer] elapsed time: 0.085669 s