Dialogue Emotion Recognition Based on PaddlePaddle: Exploring Deep Learning with the ERNIE Model

Table of contents

1. What is ERNIE?

2. Dataset introduction

3. Construction of network structure

3.1 ERNIE model definition

3.2 Basic network structure definition

3.3 Encoder and classifier definitions

3.4 Word segmentation code

3.5 Word segmentation auxiliary code

3.6 Data reading and preprocessing code

General parameter introduction

4. Model training

Configuration related to the training phase

5. Model prediction

Forecast related configuration


Conversational emotion recognition is an important task in natural language processing (NLP), which has a wide range of applications in many fields such as chatbots, customer service, and social media analysis. In this blog, we will explore how to use PaddlePaddle's ERNIE model for dialogue emotion recognition.

1. What is ERNIE?

ERNIE (Enhanced Representation through Knowledge Integration) is a knowledge-enhanced semantic pre-training model developed by Baidu, which has demonstrated excellent performance in many Chinese NLP tasks.

Model gossip customer service Weibo
BOW 90.2% 87.6% 74.2%
LSTM 91.4% 90.1% 73.8%
Bi-LSTM 91.2% 89.9% 73.6%
CNN 90.8% 90.7% 76.3%
TextCNN 91.1% 91.0% 76.8%
BERT 93.6% 92.3% 78.6%
ERNIE 94.4% 94.0% 80.6%

2. Dataset introduction

The input of the dialogue emotion recognition task is a piece of user text, and the output is the detected emotion category, including negative, positive, and neutral. This is a classic short text three-category task.

After the data set is decompressed, a data directory is generated. In the data directory, there are training set data (train.tsv), development set data (dev.tsv), test set data (test.tsv), data to be predicted (infer.tsv) and corresponding dictionaries (vocab.txt)
Examples of data used for training, prediction, and evaluation are as follows. The data consists of two columns separated by tabs ('\t'). The first column is the category of emotional classification (0 means negative; 1 means medium sex; 2 means positive), the second column is the Chinese text separated by spaces:

label   text_a
0   谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1   我 有事 等会儿 就 回来 和 你 聊
2   我 见到 你 很高兴 谢谢 你 帮 我

In [1]

# 解压数据集
!cd /home/aistudio/data/data9740 && unzip -qo 对话情绪识别.zip

In [2]

# 各种引用库
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import io
import os
import six
import sys
import time
import random
import string
import logging
import argparse
import collections
import unicodedata
from functools import partial
from collections import namedtuple

import multiprocessing

import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import numpy as np

In [3]

# 统一的 logger 配置

logger = None

def init_log_config():
    """
    初始化日志相关配置
    :return:
    """
    global logger
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    log_path = os.path.join(os.getcwd(), 'logs')
    if not os.path.exists(log_path):
        os.makedirs(log_path)
    log_name = os.path.join(log_path, 'train.log')
    sh = logging.StreamHandler()
    fh = logging.FileHandler(log_name, mode='w')
    fh.setLevel(logging.DEBUG)
    formatter = logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
    fh.setFormatter(formatter)
    sh.setFormatter(formatter)
    logger.handlers = []
    logger.addHandler(sh)
    logger.addHandler(fh)

In [4]

# util

def print_arguments(args):
    """
    打印参数
    """
    logger.info('-----------  Configuration Arguments -----------')
    for key in args.keys():
        logger.info('%s: %s' % (key, args[key]))
    logger.info('------------------------------------------------')


def init_checkpoint(exe, init_checkpoint_path, main_program):
    """
    加载缓存模型
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path

    def existed_persitables(var):
        """
        If existed presitabels
        """
        if not fluid.io.is_persistable(var):
            return False
        return os.path.exists(os.path.join(init_checkpoint_path, var.name))

    fluid.io.load_vars(
        exe,
        init_checkpoint_path,
        main_program=main_program,
        predicate=existed_persitables)
    logger.info("Load model from {}".format(init_checkpoint_path))


def csv_reader(fd, delimiter='\t'):
    """
    csv 文件读取
    """
    def gen():
        for i in fd:
            slots = i.rstrip('\n').split(delimiter)
            if len(slots) == 1:
                yield slots,
            else:
                yield slots
    return gen()

3. Construction of network structure

ERNIE : Baidu self-developed a general text semantic representation model based on massive data and prior knowledge training, and based on this, fine-tune it on the dialogue emotion classification dataset.

 Released in March 2019, ERNIE learns real-world semantic knowledge by modeling words, entities, and entity relationships in massive data. Compared with BERT learning the original language signal, ERNIE  directly models the prior semantic knowledge unit, which enhances the semantic representation ability of the model.
In July of the same year, Baidu released  ERNIE 2.0 . ERNIE 2.0  is a semantic understanding pre-training framework based on continuous learning, which uses multi-task learning to build pre-training tasks incrementally. In ERNIE 2.0  , the newly constructed pre-training task type can be seamlessly added to the training framework for continuous semantic understanding learning. Through the new semantic tasks such as entity prediction, sentence causality judgment, and article sentence structure reconstruction, the ERNIE 2.0  semantic understanding pre-training model obtains natural language information in multiple dimensions such as morphology, syntax, and semantics from the training data, which greatly enhances the The general semantic representation capability is provided, and the schematic diagram is as follows:

References:

  1. ERNIE: Enhanced Representation through Knowledge Integration
  2. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
  3. ERNIE Preview: Baidu Knowledge Enhanced Semantic Representation Model ERNIE
  4. ERNIE 2.0 GitHub

3.1 ERNIE model definition

class ErnieModel defines ERNIE encoder network structure
input  src_ids, position_ids, sentence_ids and input_mask
output  sequence_output and pooled_output

In [5]

class ErnieModel(object):
    """Ernie模型定义"""

    def __init__(self,
                 src_ids,
                 position_ids,
                 sentence_ids,
                 input_mask,
                 config,
                 weight_sharing=True,
                 use_fp16=False):

        # Ernie 相关参数
        self._emb_size = config['hidden_size']
        self._n_layer = config['num_hidden_layers']
        self._n_head = config['num_attention_heads']
        self._voc_size = config['vocab_size']
        self._max_position_seq_len = config['max_position_embeddings']
        self._sent_types = config['type_vocab_size']
        self._hidden_act = config['hidden_act']
        self._prepostprocess_dropout = config['hidden_dropout_prob']
        self._attention_dropout = config['attention_probs_dropout_prob']
        self._weight_sharing = weight_sharing

        self._word_emb_name = "word_embedding"
        self._pos_emb_name = "pos_embedding"
        self._sent_emb_name = "sent_embedding"
        self._dtype = "float16" if use_fp16 else "float32"

        # Initialize all weigths by truncated normal initializer, and all biases
        # will be initialized by constant zero by default.
        self._param_initializer = fluid.initializer.TruncatedNormal(
            scale=config['initializer_range'])

        self._build_model(src_ids, position_ids, sentence_ids, input_mask)

    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
        emb_out = fluid.layers.embedding(
            input=src_ids,
            size=[self._voc_size, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._word_emb_name, initializer=self._param_initializer),
            is_sparse=False)
        position_emb_out = fluid.layers.embedding(
            input=position_ids,
            size=[self._max_position_seq_len, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._pos_emb_name, initializer=self._param_initializer))

        sent_emb_out = fluid.layers.embedding(
            sentence_ids,
            size=[self._sent_types, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._sent_emb_name, initializer=self._param_initializer))

        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out

        emb_out = pre_process_layer(
            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')

        if self._dtype == "float16":
            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
        self_attn_mask = fluid.layers.matmul(
            x=input_mask, y=input_mask, transpose_y=True)

        self_attn_mask = fluid.layers.scale(
            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
        n_head_self_attn_mask = fluid.layers.stack(
            x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True

        self._enc_out = encoder(
            enc_input=emb_out,
            attn_bias=n_head_self_attn_mask,
            n_layer=self._n_layer,
            n_head=self._n_head,
            d_key=self._emb_size // self._n_head,
            d_value=self._emb_size // self._n_head,
            d_model=self._emb_size,
            d_inner_hid=self._emb_size * 4,
            prepostprocess_dropout=self._prepostprocess_dropout,
            attention_dropout=self._attention_dropout,
            relu_dropout=0,
            hidden_act=self._hidden_act,
            preprocess_cmd="",
            postprocess_cmd="dan",
            param_initializer=self._param_initializer,
            name='encoder')

    def get_sequence_output(self):
        """Get embedding of each token for squence labeling"""
        return self._enc_out

    def get_pooled_output(self):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(
            input=self._enc_out, axes=[1], starts=[0], ends=[1])
        next_sent_feat = fluid.layers.fc(
            input=next_sent_feat,
            size=self._emb_size,
            act="tanh",
            param_attr=fluid.ParamAttr(
                name="pooled_fc.w_0", initializer=self._param_initializer),
            bias_attr="pooled_fc.b_0")
        return next_sent_feat

3.2 Basic network structure definition

The following 4 cells define the basic network structure used in ErnieModel, including:

  1. multi_head_attention
  2. positionwise_feed_forward
  3. pre_post_process_layer: Add residual connection, layer normalization and droput, used before and after multi_head_attention and positionwise_feed_forward
  4. encoder_layer: call the above three structures to generate the encoder layer
  5. encoder: Stack encoder_layer to generate a complete encoder

For an introduction to multi_head_attention and positionwise_feed_forward, please refer to: The Annotated Transformer

In [6]

def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0.,
                         cache=None, param_initializer=None, name='multi_head_att'):
    """
    Multi-Head Attention. Note that attn_bias is added to the logit before
    computing softmax activiation to mask certain selected positions so that
    they will not considered in attention weights.
    """
    keys = queries if keys is None else keys
    values = keys if values is None else values

    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
        raise ValueError(
            "Inputs: quries, keys and values should all be 3-D tensors.")

    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
        """
        Add linear projection to queries, keys, and values.
        """
        q = layers.fc(input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_query_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
        k = layers.fc(input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_key_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
        v = layers.fc(input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_value_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_value_fc.b_0')
        return q, k, v

    def __split_heads(x, n_head):
        """
        Reshape the last dimension of inpunt tensor x so that it becomes two
        dimensions and then transpose. Specifically, input a tensor with shape
        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
        with shape [bs, n_head, max_sequence_length, hidden_dim].
        """
        hidden_size = x.shape[-1]
        # The value 0 in shape attr means copying the corresponding dimension
        # size of the input as the output dimension size.
        reshaped = layers.reshape(
            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)

        # permuate the dimensions into:
        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])

    def __combine_heads(x):
        """
        Transpose and then reshape the last two dimensions of inpunt tensor x
        so that it becomes one dimension, which is reverse to __split_heads.
        """
        if len(x.shape) == 3:
            return x
        if len(x.shape) != 4:
            raise ValueError("Input(x) should be a 4-D Tensor.")

        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
        # The value 0 in shape attr means copying the corresponding dimension
        # size of the input as the output dimension size.
        return layers.reshape(
            x=trans_x,
            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
            inplace=True)

    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
        """
        Scaled Dot-Product Attention
        """
        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
        if attn_bias:
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
            weights = layers.dropout(
                weights,
                dropout_prob=dropout_rate,
                dropout_implementation="upscale_in_train",
                is_test=False)
        out = layers.matmul(weights, v)
        return out

    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)

    if cache is not None:  # use cache and concat time steps
        # Since the inplace reshape in __split_heads changes the shape of k and
        # v, which is the cache input for next time step, reshape the cache
        # input from the previous time step first.
        k = cache["k"] = layers.concat(
            [layers.reshape(
                cache["k"], shape=[0, 0, d_model]), k], axis=1)
        v = cache["v"] = layers.concat(
            [layers.reshape(
                cache["v"], shape=[0, 0, d_model]), v], axis=1)

    q = __split_heads(q, n_head)
    k = __split_heads(k, n_head)
    v = __split_heads(v, n_head)

    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
                                                  dropout_rate)

    out = __combine_heads(ctx_multiheads)

    # Project back to the model size.
    proj_out = layers.fc(input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(
                             name=name + '_output_fc.w_0',
                             initializer=param_initializer),
                         bias_attr=name + '_output_fc.b_0')
    return proj_out

In [7]

def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
    """
    Position-wise Feed-Forward Networks.
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
    hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act,
                       param_attr=fluid.ParamAttr(
                           name=name + '_fc_0.w_0',
                           initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
        hidden = layers.dropout(hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(
                        name=name + '_fc_1.w_0', initializer=param_initializer),
                    bias_attr=name + '_fc_1.b_0')
    return out

In [8]

def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
                           name=''):
    """
    Add residual connection, layer normalization and droput to the out tensor
    optionally according to the value of process_cmd.
    This will be used before or after multi-head attention and position-wise
    feed-forward networks.
    """
    for cmd in process_cmd:
        if cmd == "a":  # add residual connection
            out = out + prev_out if prev_out else out
        elif cmd == "n":  # add layer normalization
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
            out = layers.layer_norm(
                out,
                begin_norm_axis=len(out.shape) - 1,
                param_attr=fluid.ParamAttr(
                    name=name + '_layer_norm_scale',
                    initializer=fluid.initializer.Constant(1.)),
                bias_attr=fluid.ParamAttr(
                    name=name + '_layer_norm_bias',
                    initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
                out = layers.dropout(
                    out,
                    dropout_prob=dropout_rate,
                    dropout_implementation="upscale_in_train",
                    is_test=False)
    return out


pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer

In [9]

def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
                  attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
                  param_initializer=None, name=''):
    """The encoder layers that can be stacked to form a deep encoder.
    This module consits of a multi-head (self) attention followed by
    position-wise feed-forward networks and both the two components companied
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
    attn_output = multi_head_attention(
        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
        None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout,
        param_initializer=param_initializer, name=name + '_multi_head_att')
    attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
    ffd_output = positionwise_feed_forward(
        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
        d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer,
        name=name + '_ffn')
    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')


def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
            attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
            param_initializer=None, name=''):
    """
    The encoder is composed of a stack of identical layers returned by calling
    encoder_layer.
    """
    for i in range(n_layer):
        enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid,
            prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd,
            postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i))
        enc_input = enc_output
    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")

    return enc_output

3.3 Encoder and classifier definitions

The following cells define the organizational structure of encoder and classification:

  1. ernie_encoder: organize output embeddings according to ErnieModel
  2. create_ernie_model: Define the classification network, take embeddings as input, use fully connected network + softmax for classification

In [10]

def ernie_encoder(ernie_inputs, ernie_config):
    """return sentence embedding and token embeddings"""

    ernie = ErnieModel(
        src_ids=ernie_inputs["src_ids"],
        position_ids=ernie_inputs["pos_ids"],
        sentence_ids=ernie_inputs["sent_ids"],
        input_mask=ernie_inputs["input_mask"],
        config=ernie_config)

    enc_out = ernie.get_sequence_output()
    unpad_enc_out = fluid.layers.sequence_unpad(
        enc_out, length=ernie_inputs["seq_lens"])
    cls_feats = ernie.get_pooled_output()

    embeddings = {
        "sentence_embeddings": cls_feats,
        "token_embeddings": unpad_enc_out,
    }

    for k, v in embeddings.items():
        v.persistable = True

    return embeddings


def create_ernie_model(args,
                 embeddings,
                 labels,
                 is_prediction=False):

    """
    Create Model for sentiment classification based on ERNIE encoder
    """
    sentence_embeddings = embeddings["sentence_embeddings"]
    token_embeddings = embeddings["token_embeddings"]

    cls_feats = fluid.layers.dropout(
        x=sentence_embeddings,
        dropout_prob=0.1,
        dropout_implementation="upscale_in_train")
    logits = fluid.layers.fc(
        input=cls_feats,
        size=args['num_labels'],
        param_attr=fluid.ParamAttr(
            name="cls_out_w",
            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
        bias_attr=fluid.ParamAttr(
            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))

    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
        logits=logits, label=labels, return_softmax=True)
    if is_prediction:
        return probs
    loss = fluid.layers.mean(x=ce_loss)

    num_seqs = fluid.layers.create_tensor(dtype='int64')
    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)

    return loss, accuracy, num_seqs

3.4 Word segmentation code

The following three cells define word segmentation code classes, including:

  1. FullTokenizer: complete word segmentation, used in data reading code, call BasicTokenizer and WordpieceTokenizer to achieve
  2. BasicTokenizer: Basic word segmentation, including punctuation, lowercase conversion, etc.
  3. WordpieceTokenizer: word division

In [11]

class FullTokenizer(object):
    """Runs end-to-end tokenziation."""

    def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

    def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
            for sub_token in self.wordpiece_tokenizer.tokenize(token):
                split_tokens.append(sub_token)

        return split_tokens

    def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)

In [12]

class BasicTokenizer(object):
    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""

    def __init__(self, do_lower_case=True):
        """Constructs a BasicTokenizer.
        Args:
            do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case

    def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)

        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)

        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if self.do_lower_case:
                token = token.lower()
                token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token))

        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens

    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)

    def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1

        return ["".join(x) for x in output]

    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
            (cp >= 0x3400 and cp <= 0x4DBF) or  #
            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
            (cp >= 0x2B820 and cp <= 0x2CEAF) or
            (cp >= 0xF900 and cp <= 0xFAFF) or  #
            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
            return True

        return False

    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xfffd or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

In [13]

class WordpieceTokenizer(object):
    """Runs WordPiece tokenziation."""

    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.
        For example:
            input = "unaffable"
            output = ["un", "##aff", "##able"]
        Args:
            text: A single token or whitespace separated tokens. This should have
                already been passed through `BasicTokenizer.
        Returns:
            A list of wordpiece tokens.
        """

        text = convert_to_unicode(text)

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens

3.5 Word segmentation auxiliary code

The following cell defines auxiliary codes in word segmentation, including convert_to_unicode, whitespace_tokenize, etc.

In [14]

def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
    if six.PY3:
        if isinstance(text, str):
            return text
        elif isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    elif six.PY2:
        if isinstance(text, str):
            return text.decode("utf-8", "ignore")
        elif isinstance(text, unicode):
            return text
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    else:
        raise ValueError("Not running on Python2 or Python 3?")


def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    fin = io.open(vocab_file, encoding="utf8")
    for num, line in enumerate(fin):
        items = convert_to_unicode(line.strip()).split("\t")
        if len(items) > 2:
            break
        token = items[0]
        index = items[1] if len(items) == 2 else num
        token = token.strip()
        vocab[token] = int(index)
    return vocab


def convert_by_vocab(vocab, items):
    """Converts a sequence of [tokens|ids] using the vocab."""
    output = []
    for item in items:
        output.append(vocab[item])
    return output


def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a peice of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens


def _is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False


def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat.startswith("C"):
        return True
    return False


def _is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False

3.6 Data reading and preprocessing code

The following 4 cells define data readers and preprocessing code, including:

  1. BaseReader: data reader base class
  2. ClassifyReader: Data reader for classification models, override _readtsv and _pad_batch_records methods
  3. pad_batch_data: Data preprocessing, adding padding to the data, and generating position data and mask
  4. ernie_pyreader: generate pyreader for training, validation and prediction

In [15]

class BaseReader(object):
    """BaseReader for classify and sequence labeling task"""

    def __init__(self,
                 vocab_path,
                 label_map_config=None,
                 max_seq_len=512,
                 do_lower_case=True,
                 in_tokens=False,
                 random_seed=None):
        self.max_seq_len = max_seq_len
        self.tokenizer = FullTokenizer(
            vocab_file=vocab_path, do_lower_case=do_lower_case)
        self.vocab = self.tokenizer.vocab
        self.pad_id = self.vocab["[PAD]"]
        self.cls_id = self.vocab["[CLS]"]
        self.sep_id = self.vocab["[SEP]"]
        self.in_tokens = in_tokens

        np.random.seed(random_seed)

        self.current_example = 0
        self.current_epoch = 0
        self.num_examples = 0

        if label_map_config:
            with open(label_map_config) as f:
                self.label_map = json.load(f)
        else:
            self.label_map = None

    def _read_tsv(self, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with io.open(input_file, "r", encoding="utf8") as f:
            reader = csv_reader(f, delimiter="\t")
            headers = next(reader)
            Example = namedtuple('Example', headers)

            examples = []
            for line in reader:
                example = Example(*line)
                examples.append(example)
            return examples

    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
        """Truncates a sequence pair in place to the maximum length."""

        # This is a simple heuristic which will always truncate the longer sequence
        # one token at a time. This makes more sense than truncating an equal percent
        # of tokens from each, since if one sequence is very short then each token
        # that's truncated likely contains more information than a longer sequence.
        while True:
            total_length = len(tokens_a) + len(tokens_b)
            if total_length <= max_length:
                break
            if len(tokens_a) > len(tokens_b):
                tokens_a.pop()
            else:
                tokens_b.pop()

    def _convert_example_to_record(self, example, max_seq_length, tokenizer):
        """Converts a single `Example` into a single `Record`."""

        text_a = convert_to_unicode(example.text_a)
        tokens_a = tokenizer.tokenize(text_a)
        tokens_b = None
        if "text_b" in example._fields:
            text_b = convert_to_unicode(example.text_b)
            tokens_b = tokenizer.tokenize(text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[0:(max_seq_length - 2)]

        # The convention in BERT/ERNIE is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0     0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        text_type_ids = []
        tokens.append("[CLS]")
        text_type_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            text_type_ids.append(0)
        tokens.append("[SEP]")
        text_type_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                text_type_ids.append(1)
            tokens.append("[SEP]")
            text_type_ids.append(1)

        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        position_ids = list(range(len(token_ids)))

        if self.label_map:
            label_id = self.label_map[example.label]
        else:
            label_id = example.label

        Record = namedtuple(
            'Record',
            ['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])

        qid = None
        if "qid" in example._fields:
            qid = example.qid

        record = Record(
            token_ids=token_ids,
            text_type_ids=text_type_ids,
            position_ids=position_ids,
            label_id=label_id,
            qid=qid)
        return record

    def _prepare_batch_data(self, examples, batch_size, phase=None):
        """generate batch records"""
        batch_records, max_len = [], 0
        for index, example in enumerate(examples):
            if phase == "train":
                self.current_example = index
            record = self._convert_example_to_record(example, self.max_seq_len,
                                                     self.tokenizer)
            max_len = max(max_len, len(record.token_ids))
            if self.in_tokens:
                to_append = (len(batch_records) + 1) * max_len <= batch_size
            else:
                to_append = len(batch_records) < batch_size
            if to_append:
                batch_records.append(record)
            else:
                yield self._pad_batch_records(batch_records)
                batch_records, max_len = [record], len(record.token_ids)

        if batch_records:
            yield self._pad_batch_records(batch_records)

    def get_num_examples(self, input_file):
        """return total number of examples"""
        examples = self._read_tsv(input_file)
        return len(examples)
    
    def get_examples(self, input_file):
        examples = self._read_tsv(input_file)
        return examples

    def data_generator(self,
                       input_file,
                       batch_size,
                       epoch,
                       shuffle=True,
                       phase=None):
        """return generator which yields batch data for pyreader"""
        examples = self._read_tsv(input_file)

        def _wrapper():
            for epoch_index in range(epoch):
                if phase == "train":
                    self.current_example = 0
                    self.current_epoch = epoch_index
                if shuffle:
                    np.random.shuffle(examples)

                for batch_data in self._prepare_batch_data(
                        examples, batch_size, phase=phase):
                    yield batch_data

        return _wrapper

In [16]

class ClassifyReader(BaseReader):
    """ClassifyReader"""

    def _read_tsv(self, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with io.open(input_file, "r", encoding="utf8") as f:
            reader = csv_reader(f, delimiter="\t")
            headers = next(reader)
            text_indices = [
                index for index, h in enumerate(headers) if h != "label"
            ]
            Example = namedtuple('Example', headers)

            examples = []
            for line in reader:
                for index, text in enumerate(line):
                    if index in text_indices:
                        line[index] = text.replace(' ', '')
                example = Example(*line)
                examples.append(example)
            return examples

    def _pad_batch_records(self, batch_records):
        batch_token_ids = [record.token_ids for record in batch_records]
        batch_text_type_ids = [record.text_type_ids for record in batch_records]
        batch_position_ids = [record.position_ids for record in batch_records]
        batch_labels = [record.label_id for record in batch_records]
        batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])

        # padding
        padded_token_ids, input_mask, seq_lens = pad_batch_data(
            batch_token_ids,
            pad_idx=self.pad_id,
            return_input_mask=True,
            return_seq_lens=True)
        padded_text_type_ids = pad_batch_data(
            batch_text_type_ids, pad_idx=self.pad_id)
        padded_position_ids = pad_batch_data(
            batch_position_ids, pad_idx=self.pad_id)

        return_list = [
            padded_token_ids, padded_text_type_ids, padded_position_ids,
            input_mask, batch_labels, seq_lens
        ]

        return return_list

In [17]

def pad_batch_data(insts,
                   pad_idx=0,
                   return_pos=False,
                   return_input_mask=False,
                   return_max_len=False,
                   return_num_token=False,
                   return_seq_lens=False):
    """
    Pad the instances to the max sequence length in batch, and generate the
    corresponding position data and input mask.
    """
    return_list = []
    max_len = max(len(inst) for inst in insts)
    # Any token included in dict can be used to pad, since the paddings' loss
    # will be masked out by weights and make no effect on parameter gradients.

    inst_data = np.array(
        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]

    # position data
    if return_pos:
        inst_pos = np.array([
            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
            for inst in insts
        ])

        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]

    if return_input_mask:
        # This is used to avoid attention on paddings.
        input_mask_data = np.array([[1] * len(inst) + [0] *
                                    (max_len - len(inst)) for inst in insts])
        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
        return_list += [input_mask_data.astype("float32")]

    if return_max_len:
        return_list += [max_len]

    if return_num_token:
        num_token = 0
        for inst in insts:
            num_token += len(inst)
        return_list += [num_token]

    if return_seq_lens:
        seq_lens = np.array([len(inst) for inst in insts])
        return_list += [seq_lens.astype("int64").reshape([-1])]

    return return_list if len(return_list) > 1 else return_list[0]

In [18]

def ernie_pyreader(args, pyreader_name):
    """define standard ernie pyreader"""
    pyreader_name += '_' + ''.join(random.sample(string.ascii_letters + string.digits, 6))
    pyreader = fluid.layers.py_reader(
        capacity=50,
        shapes=[[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1],
                [-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1], [-1, 1],
                [-1]],
        dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
        lod_levels=[0, 0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

    (src_ids, sent_ids, pos_ids, input_mask, labels,
     seq_lens) = fluid.layers.read_file(pyreader)

    ernie_inputs = {
        "src_ids": src_ids,
        "sent_ids": sent_ids,
        "pos_ids": pos_ids,
        "input_mask": input_mask,
        "seq_lens": seq_lens
    }
    return pyreader, ernie_inputs, labels

General parameter introduction

  1. Dataset related configuration
data_config = {
    'data_dir': 'data/data9740/data',
    'vocab_path': 'data/data9740/data/vocab.txt',
    'batch_size': 32,
    'random_seed': 0,
    'num_labels': 3,
    'max_seq_len': 512,
    'train_set': 'data/data9740/data/test.tsv',
    'test_set':  'data/data9740/data/test.tsv',
    'dev_set':   'data/data9740/data/dev.tsv',
    'infer_set': 'data/data9740/data/infer.tsv',
    'label_map_config': None,
    'do_lower_case': True,
}

Parameter introduction:

  • data_dir : dataset path, default 'data/data9740/data'
  • vocab_path : the path where vocab.txt is located, the default is 'data/data9740/data/vocab.txt'
  • batch_size : batch size for training and validation, default: 32
  • random_seed : random seed, default 0
  • num_labels : number of categories, default 3
  • max_seq_len : the longest word in the sentence, the default is 512
  • train_set : training set path, default 'data/data9740/data/test.tsv'
  • test_set : test set path, default 'data/data9740/data/test.tsv'
  • dev_set : Validation set path, default 'data/data9740/data/dev.tsv'
  • infer_set : prediction set path, default 'data/data9740/data/infer.tsv'
  • label_map_config : label_map path, default None
  • do_lower_case : Whether to perform additional lowercase processing on the input, the default is True

     
  1. ERNIE network structure related configuration
ernie_net_config = {
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "relu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 513,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 18000,
}

Parameter introduction:

  • attention_probs_dropout_prob : attention block dropout ratio, default 0.1
  • hidden_act : Hidden layer activation function, default 'relu'
  • hidden_dropout_prob : Hidden layer dropout ratio, default 0.1
  • hidden_size : hidden layer size, default 768
  • initializer_range : parameter initialization zoom range, default 0.02
  • max_position_embeddings : The maximum length of the position sequence, the default is 513
  • num_attention_heads : number of attention block headers, default 12
  • num_hidden_layers : number of hidden layers, default 12
  • type_vocab_size : number of sentence categories, default 2
  • vocab_size : dictionary length, default 18000

In [19]

# 数据集相关配置
data_config = {
    'data_dir': 'data/data9740/data',               # Directory path to training data.
    'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt',   # Vocabulary path.
    'batch_size': 32,   # Total examples' number in batch for training.
    'random_seed': 0,   # Random seed.
    'num_labels': 3,    # label number
    'max_seq_len': 512, # Number of words of the longest seqence.
    'train_set': 'data/data9740/data/test.tsv',   # Path to training data.
    'test_set':  'data/data9740/data/test.tsv',   # Path to test data.
    'dev_set':   'data/data9740/data/dev.tsv',    # Path to validation data.
    'infer_set': 'data/data9740/data/infer.tsv',  # Path to infer data.
    'label_map_config': None,   # label_map_path
    'do_lower_case': True,      # Whether to lower case the input text. Should be True for uncased models and False for cased models.
}

# Ernie 网络结构相关配置
ernie_net_config = {
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "relu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 513,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 18000,
}

4. Model training

Users can implement Finetune training on their own data based on Baidu's open-source dialogue emotion recognition model, in order to obtain better results. Baidu provides two pre-trained models, TextCNN and ERNIE. The specific model Finetune method is as follows:

  1. Download the pretrained model
  2. Change parameters
    • 'init_checkpoint':'pretrained_model/ernie_finetune/params'
  3. Execute "ERNIE training code"
     

Configuration related to the training phase

train_config = {
    'init_checkpoint': 'pretrained_model/ernie_finetune/params',
    'output_dir': 'train_model',
    
    'epoch': 10,
    'save_steps': 100,
    'validation_steps': 100,
    'lr': 0.00002,
    
    'skip_steps': 10,
    'verbose': False,
    
    'use_cuda': True,
}

Parameter introduction:

  • init_checkpoint : Whether to use the pre-trained model, default: 'pretrained_model/ernie_finetune/params'
  • output_dir : model cache path, default 'train_model'
  • epoch : number of training rounds, default 10
  • save_steps : model cache interval, default 100
  • validation_steps : validation interval, default 100
  • lr : learning rate, default 0.00002
  • skip_steps : log output interval, default 10
  • verbose : Whether to output verbose logs, the default is False
  • use_cuda : whether to use GPU, default True

In [20]

# 下载预训练模型
!mkdir pretrained_model
# 下载并解压 ERNIE 预训练模型
!cd pretrained_model && wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
!cd pretrained_model && tar xzf emotion_detection_ernie_finetune-1.0.0.tar.gz
mkdir: cannot create directory ‘pretrained_model’: File exists
--2020-02-12 15:26:18--  https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
Resolving baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 182.61.200.195, 182.61.200.229
Connecting to baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|182.61.200.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 744568046 (710M) [application/x-gzip]
Saving to: ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’

emotion_detection_e 100%[===================>] 710.08M  72.2MB/s    in 15s     

2020-02-12 15:26:33 (48.4 MB/s) - ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ saved [744568046/744568046]


gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

In [21]

# ERNIE 训练代码
train_config = {
    'init_checkpoint': 'pretrained_model/ernie_finetune/params',        # Init checkpoint to resume training from.
    # 'init_checkpoint': 'None',
    'output_dir': 'train_model',    # Directory path to save checkpoints
    
    'epoch': 5,                # Number of epoches for training.
    'save_steps': 100,          # The steps interval to save checkpoints.
    'validation_steps': 100,    # The steps interval to evaluate model performance.
    'lr': 0.00002,              # The Learning rate value for training.
    
    'skip_steps': 10,   # The steps interval to print loss.
    'verbose': False,   # Whether to output verbose log
    
    'use_cuda':True,   # If set, use GPU for training.
}
train_config.update(data_config)


def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
    """
    Evaluation Function
    """
    test_pyreader.start()
    total_cost, total_acc, total_num_seqs = [], [], []
    time_begin = time.time()
    while True:
        try:
            # 执行一步验证
            np_loss, np_acc, np_num_seqs = exe.run(program=test_program,
                                                   fetch_list=fetch_list,
                                                   return_numpy=False)
            np_loss = np.array(np_loss)
            np_acc = np.array(np_acc)
            np_num_seqs = np.array(np_num_seqs)
            total_cost.extend(np_loss * np_num_seqs)
            total_acc.extend(np_acc * np_num_seqs)
            total_num_seqs.extend(np_num_seqs)
        except fluid.core.EOFException:
            test_pyreader.reset()
            break
    time_end = time.time()
    logger.info("[%s evaluation] avg loss: %f, ave acc: %f, elapsed time: %f s" %
        (eval_phase, np.sum(total_cost) / np.sum(total_num_seqs),
        np.sum(total_acc) / np.sum(total_num_seqs), time_end - time_begin))


def main(config):
    """
    Main Function
    """

    # 定义 executor
    if config['use_cuda']:
        place = fluid.CUDAPlace(0)
        dev_count = fluid.core.get_cuda_device_count()
    else:
        place = fluid.CPUPlace()
        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
    exe = fluid.Executor(place)

    # 定义数据 reader
    reader = ClassifyReader(
        vocab_path=config['vocab_path'],
        label_map_config=config['label_map_config'],
        max_seq_len=config['max_seq_len'],
        do_lower_case=config['do_lower_case'],
        random_seed=config['random_seed'])

    startup_prog = fluid.Program()
    if config['random_seed'] is not None:
        startup_prog.random_seed = config['random_seed']

    # 训练阶段初始化
    train_data_generator = reader.data_generator(
        input_file=config['train_set'],
        batch_size=config['batch_size'],
        epoch=config['epoch'],
        shuffle=True,
        phase="train")

    num_train_examples = reader.get_num_examples(config['train_set'])

    # 通过训练集大小 * 训练轮数得出总训练步数
    max_train_steps = config['epoch'] * num_train_examples // config['batch_size'] // dev_count + 1

    logger.info("Device count: %d" % dev_count)
    logger.info("Num train examples: %d" % num_train_examples)
    logger.info("Max train steps: %d" % max_train_steps)

    train_program = fluid.Program()

    with fluid.program_guard(train_program, startup_prog):
        with fluid.unique_name.guard():
            # create ernie_pyreader
            train_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='train_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            # user defined model based on ernie embeddings
            loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)

            """
            sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=config['lr'])
            sgd_optimizer.minimize(loss)
            """
            optimizer = fluid.optimizer.Adam(learning_rate=config['lr'])
            optimizer.minimize(loss)

    if config['verbose']:
        lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
            program=train_program, batch_size=config['batch_size'])
        logger.info("Theoretical memory usage in training: %.3f - %.3f %s" %
            (lower_mem, upper_mem, unit))

    # 验证阶段初始化
    test_prog = fluid.Program()
    with fluid.program_guard(test_prog, startup_prog):
        with fluid.unique_name.guard():
            # create ernie_pyreader
            test_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='eval_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            # user defined model based on ernie embeddings
            loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)

    test_prog = test_prog.clone(for_test=True)

    exe.run(startup_prog)

    # 加载预训练模型
    # if config['init_checkpoint']:
    #     init_checkpoint(exe, config['init_checkpoint'], main_program=train_program)

    # 模型训练代码
    if not os.path.exists(config['output_dir']):
        os.mkdir(config['output_dir'])
    
    logger.info('Start training')
    train_pyreader.decorate_tensor_provider(train_data_generator)
    train_pyreader.start()
    steps = 0
    total_cost, total_acc, total_num_seqs = [], [], []
    time_begin = time.time()
    while True:
        try:
            steps += 1
            if steps % config['skip_steps'] == 0:
                fetch_list = [loss.name, accuracy.name, num_seqs.name]
            else:
                fetch_list = []

            # 执行一步训练
            outputs = exe.run(program=train_program, fetch_list=fetch_list, return_numpy=False)
            if steps % config['skip_steps'] == 0:
                # 打印日志
                np_loss, np_acc, np_num_seqs = outputs
                np_loss = np.array(np_loss)
                np_acc = np.array(np_acc)
                np_num_seqs = np.array(np_num_seqs)
                total_cost.extend(np_loss * np_num_seqs)
                total_acc.extend(np_acc * np_num_seqs)
                total_num_seqs.extend(np_num_seqs)

                if config['verbose']:
                    verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
                    logger.info(verbose)

                time_end = time.time()
                used_time = time_end - time_begin
                logger.info("step: %d, avg loss: %f, "
                    "avg acc: %f, speed: %f steps/s" %
                    (steps, np.sum(total_cost) / np.sum(total_num_seqs),
                    np.sum(total_acc) / np.sum(total_num_seqs),
                    config['skip_steps'] / used_time))
                total_cost, total_acc, total_num_seqs = [], [], []
                time_begin = time.time()

            if steps % config['save_steps'] == 0:
                # 缓存模型
                # fluid.io.save_persistables(exe, config['output_dir'], train_program)
                fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
            if steps % config['validation_steps'] == 0:
                # 在验证集上执行验证
                test_pyreader.decorate_tensor_provider(
                    reader.data_generator(
                        input_file=config['dev_set'],
                        batch_size=config['batch_size'],
                        phase='dev',
                        epoch=1,
                        shuffle=False))

                evaluate(exe, test_prog, test_pyreader,
                        [loss.name, accuracy.name, num_seqs.name],
                        "dev")

        except fluid.core.EOFException:
            # 训练结束
            # fluid.io.save_persistables(exe, config['output_dir'], train_program)
            fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
            train_pyreader.reset()
            logger.info('Training end.')
            break

    # 模型验证代码
    test_pyreader.decorate_tensor_provider(
        reader.data_generator(
            input_file=config['test_set'],
            batch_size=config['batch_size'], phase='test', epoch=1,
            shuffle=False))
    logger.info("Final validation result:")
    evaluate(exe, test_prog, test_pyreader,
        [loss.name, accuracy.name, num_seqs.name], "test")


if __name__ == "__main__":
    init_log_config()
    print_arguments(train_config)
    main(train_config)
2020-02-12 15:26:37,110 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: -----------  Configuration Arguments -----------
2020-02-12 15:26:37,112 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: pretrained_model/ernie_finetune/params
2020-02-12 15:26:37,113 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: output_dir: train_model
2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: epoch: 5
2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: save_steps: 100
2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: validation_steps: 100
2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: lr: 2e-05
2020-02-12 15:26:37,116 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: skip_steps: 10
2020-02-12 15:26:37,117 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: verbose: False
2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True
2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data
2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt
2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32
2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0
2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv
2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv
2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv
2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None
2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True
2020-02-12 15:26:37,124 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------
2020-02-12 15:26:37,155 - <ipython-input-21-a69ca2beccb7>[line:87] - INFO: Device count: 1
2020-02-12 15:26:37,156 - <ipython-input-21-a69ca2beccb7>[line:88] - INFO: Num train examples: 1036
2020-02-12 15:26:37,157 - <ipython-input-21-a69ca2beccb7>[line:89] - INFO: Max train steps: 162
2020-02-12 15:26:37,157 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:26:38,624 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:26:41,918 - <ipython-input-21-a69ca2beccb7>[line:138] - INFO: Start training
2020-02-12 15:26:43,916 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 10, avg loss: 1.100799, avg acc: 0.656250, speed: 5.022962 steps/s
2020-02-12 15:26:45,749 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 20, avg loss: 0.826600, avg acc: 0.687500, speed: 5.464362 steps/s
2020-02-12 15:26:47,657 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 30, avg loss: 0.786590, avg acc: 0.750000, speed: 5.247581 steps/s
2020-02-12 15:26:49,552 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 40, avg loss: 0.942438, avg acc: 0.593750, speed: 5.286641 steps/s
2020-02-12 15:26:51,397 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 50, avg loss: 0.457657, avg acc: 0.875000, speed: 5.426712 steps/s
2020-02-12 15:26:53,161 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 60, avg loss: 0.668453, avg acc: 0.718750, speed: 5.680988 steps/s
2020-02-12 15:26:54,994 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 70, avg loss: 1.101823, avg acc: 0.562500, speed: 5.463856 steps/s
2020-02-12 15:26:56,797 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 80, avg loss: 0.627734, avg acc: 0.812500, speed: 5.554073 steps/s
2020-02-12 15:26:58,488 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 90, avg loss: 0.610504, avg acc: 0.750000, speed: 5.927647 steps/s
2020-02-12 15:27:00,257 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 100, avg loss: 0.598749, avg acc: 0.781250, speed: 5.664717 steps/s
2020-02-12 15:27:15,284 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [dev evaluation] avg loss: 0.637231, ave acc: 0.787037, elapsed time: 2.431484 s
2020-02-12 15:27:17,637 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 110, avg loss: 0.523219, avg acc: 0.843750, speed: 0.575587 steps/s
2020-02-12 15:27:19,406 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 120, avg loss: 0.484762, avg acc: 0.812500, speed: 5.671743 steps/s
2020-02-12 15:27:21,776 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 130, avg loss: 0.280636, avg acc: 0.937500, speed: 4.227057 steps/s
2020-02-12 15:27:24,124 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 140, avg loss: 0.624467, avg acc: 0.687500, speed: 4.264188 steps/s
2020-02-12 15:27:26,591 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 150, avg loss: 0.506643, avg acc: 0.875000, speed: 4.058757 steps/s
2020-02-12 15:27:28,928 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 160, avg loss: 0.584385, avg acc: 0.750000, speed: 4.284505 steps/s
2020-02-12 15:27:42,380 - <ipython-input-21-a69ca2beccb7>[line:201] - INFO: Training end.
2020-02-12 15:27:42,384 - <ipython-input-21-a69ca2beccb7>[line:210] - INFO: Final validation result:
2020-02-12 15:27:45,154 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [test evaluation] avg loss: 0.436581, ave acc: 0.837838, elapsed time: 2.761349 s

5. Model prediction

In the prediction stage, load the saved model and make predictions on the prediction set, which is realized by modifying the following parameters
 

Forecast related configuration

infer_config = {
    'init_checkpoint': 'train_model',
    'use_cuda': True,
}

Parameter introduction:

  • init_checkpoint : load the pre-trained model, default: 'train_model'
  • use_cuda : whether to use GPU, default True

In [22]

# ERNIE 预测代码
infer_config = {
    'init_checkpoint': 'train_model',        # Init checkpoint to resume training from.
    'use_cuda': True,   # If set, use GPU for training.
}
infer_config.update(data_config)


def init_checkpoint_infer(exe, init_checkpoint_path, main_program):
    """
    加载缓存模型
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path

    # fluid.io.load_vars(
    #     exe,
    #     init_checkpoint_path,
    #     main_program=main_program,
    #     predicate=existed_persitables)
    fluid.load(main_program, os.path.join(init_checkpoint_path, "checkpoint"), exe)
    logger.info("Load model from {}".format(init_checkpoint_path))


def infer(exe, infer_program, infer_pyreader, fetch_list, infer_phase, examples):
    """Infer"""
    infer_pyreader.start()
    time_begin = time.time()
    while True:
        try:
            # 进行一步预测
            batch_probs = exe.run(program=infer_program, fetch_list=fetch_list,
                                return_numpy=True)
            for i, probs in enumerate(batch_probs[0]):
                logger.info("Probs: %f %f %f, prediction: %d, input: %s" % (probs[0], probs[1], probs[2], np.argmax(probs), examples[i]))
        except fluid.core.EOFException:
            infer_pyreader.reset()
            break
    time_end = time.time()
    logger.info("[%s] elapsed time: %f s" % (infer_phase, time_end - time_begin))


def main(config):
    """
    Main Function
    """

    # 定义 executor
    if config['use_cuda']:
        place = fluid.CUDAPlace(0)
        dev_count = fluid.core.get_cuda_device_count()
    else:
        place = fluid.CPUPlace()
        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
    exe = fluid.Executor(place)

    # 定义数据 reader
    reader = ClassifyReader(
        vocab_path=config['vocab_path'],
        label_map_config=config['label_map_config'],
        max_seq_len=config['max_seq_len'],
        do_lower_case=config['do_lower_case'],
        random_seed=config['random_seed'])

    startup_prog = fluid.Program()
    if config['random_seed'] is not None:
        startup_prog.random_seed = config['random_seed']

    # 预测阶段初始化
    test_prog = fluid.Program()
    with fluid.program_guard(test_prog, startup_prog):
        with fluid.unique_name.guard():
            infer_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='infer_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            probs = create_ernie_model(config, embeddings, labels=labels, is_prediction=True)
    test_prog = test_prog.clone(for_test=True)

    exe.run(startup_prog)

    # 加载预训练模型
    if not config['init_checkpoint']:
        raise ValueError("args 'init_checkpoint' should be set if"
                         "only doing validation or infer!")
    init_checkpoint_infer(exe, config['init_checkpoint'], main_program=test_prog)

    # 模型预测代码
    infer_pyreader.decorate_tensor_provider(
        reader.data_generator(
            input_file=config['infer_set'],
            batch_size=config['batch_size'],
            phase='infer',
            epoch=1,
            shuffle=False))
    logger.info("Final test result:")
    infer(exe, test_prog, infer_pyreader,
        [probs.name], "infer", reader.get_examples(config['infer_set']))

if __name__ == "__main__":
    init_log_config()
    print_arguments(infer_config)
    main(infer_config)
2020-02-12 15:27:45,174 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: -----------  Configuration Arguments -----------
2020-02-12 15:27:45,175 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: train_model
2020-02-12 15:27:45,176 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True
2020-02-12 15:27:45,177 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data
2020-02-12 15:27:45,178 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt
2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32
2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0
2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3
2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512
2020-02-12 15:27:45,181 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv
2020-02-12 15:27:45,182 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None
2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True
2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------
2020-02-12 15:27:45,210 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:27:47,503 - <ipython-input-22-09c53c7a0ed3>[line:22] - INFO: Load model from train_model
2020-02-12 15:27:47,505 - <ipython-input-22-09c53c7a0ed3>[line:95] - INFO: Final test result:
2020-02-12 15:27:47,558 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019586 0.875026 0.105388, prediction: 1, input: Example(label='1', text_a= 'I want to be objective')
2020-02-12 15:27:47,559 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.608523 0.325968 0.065509, prediction: 0, input: Example(label='0', text_a= 'Are you really talking nonsense')
2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.003523 0.947431 0.049045, prediction: 1, input: Example(label='1', text_a='口嗅会')
2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014141 0.889832 0.096027, prediction: 1, input: Example(label='1', text_a= 'Every time it's the cousin who takes the nest because the nest is crazy')
2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.234133 0.636430 0.129437, prediction: 1, input: Example(label='0', text_a= 'Stop talking nonsense I'm asking you a question')
2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014605 0.887870 0.097524, prediction: 1, input: Example(label='1', text_a= '4967 is the bank in Singapore')
2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.692878 0.215159 0.091963, prediction: 0, input: Example(label='2', text_a= 'Yes I like rabbits')
2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019696 0.888937 0.091367, prediction: 1, input: Example(label='1', text_a= 'Have you ever written about Huangshan Strange Rock')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.012140 0.872288 0.115572, prediction: 1, input: Example(label='1', text_a= 'One by one slowly')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.770847 0.185456 0.043697, prediction: 0, input: Example(label='0', text_a= 'I've played this and it's not fun at all')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.007810 0.900273 0.091916, prediction: 1, input: Example(label='1', text_a= 'Online development of girls' QQ')
2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.072372 0.808013 0.119615, prediction: 1, input: Example(label='1', text_a= 'You guessed it right')
2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.874610 0.099676 0.025713, prediction: 0, input: Example(label='0', text_a= 'I hate you, hehehe...')
2020-02-12 15:27:47,596 - <ipython-input-22-09c53c7a0ed3>[line:40] - INFO: [infer] elapsed time: 0.085669 s

 

Guess you like

Origin blog.csdn.net/m0_68036862/article/details/131359721