[Deep Learning] BERT Variation—Baidu ERNIE 3.0

         Pretrained models achieve state-of-the-art results in various natural language processing (NLP) tasks. Scaling up the size of a pre-trained language model can improve its generalization ability. However, the existing large-scale pre-training models mainly rely on plain text learning, lack large-scale knowledge-guided learning, and have limited model capabilities. ERNIE 3.0 further taps the potential of large-scale pre-training models for pre-training large-scale knowledge enhancement models. 

         The ERNIE 3.0 framework explores the effectiveness of knowledge-enhanced large-scale pre-training models, and performs model pre-training on large-scale unsupervised corpora including plain text and knowledge graphs. Furthermore, we adopt various types of pre-training tasks to enable the model to learn different levels of knowledge composed of meaningful lexical, syntactic and semantic information more efficiently, where the pre-training tasks are distributed in three task paradigms, namely natural language understanding, natural language generation and knowledge extraction. Therefore, ERNIE 3.0 innovatively designs a continuous multi-paradigm unified pre-training framework to achieve collaborative pre-training among multi-task paradigms. A definitive introduction to ERNIE 3.0 will be described in the following sections.

        论文题目:ERNIE 3.0: LARGE-SCALE KNOWLEDGE ENHANCED PRE-TRAINING FOR LANGUAGE UNDERSTANDING AND GENERATION

​​​​​​​​Paper link: https://arxiv.org/pdf/2107.02137.pdf

1. Introduction to ERNIE 3.0

        The large-scale pre-training model has brought new breakthroughs in the field of artificial intelligence. Due to its strong versatility and excellent migration capabilities, it has set off a wave of pre-training models towards large-scale parameterization. But these large-scale models are trained on plain text without introducing language knowledge and world knowledge . Furthermore, most large models are trained in an autoregressive fashion . Therefore, this traditional fine-tuning method exhibits relatively weak performance in solving downstream language understanding tasks. However, the existing large-scale pre-training models mainly rely on plain text learning, lack large-scale knowledge-guided learning , and have limited model capabilities.

        The researchers of ERNIE 3.0 further explored the potential of large-scale pre-training models. In order to solve the problems caused by a single autoregressive framework and explore the performance of large-scale parameter knowledge-enhanced pre-training models, for the first time, large-scale knowledge maps were introduced into tens of billions of pre-training models for pre-training large-scale knowledge enhancement models, and a parallel pre-training method (Universal Knowledge-Text Prediction) for massive unsupervised text and large-scale knowledge maps was proposed.

        By simultaneously inputting the entity relationship of large-scale knowledge graph and large-scale text data into the pre-training model for joint mask training, it promotes the information sharing between structured knowledge and unstructured text , and greatly improves the model's ability to remember and reason about knowledge.

        Integrating autoregressive and autoencoder networks, the model is trained with 10 billion parameters on a 4TB corpus consisting of common text and large-scale knowledge graphs . Therefore, the trained model can be easily used for zero-shot learning, few-shot learning or fine-tuning for natural language understanding and generation tasks.

        In addition, the framework supports the introduction of various custom tasks at any time . These tasks share the same encoding network and are trained by multi-task learning . This approach enables the encoding of lexical, syntactic, and semantic information between different tasks. Furthermore, when given a new task, our framework can progressively train distributed representations based on previous training parameters without training from scratch. Furthermore, to help the model effectively learn lexical, syntactic and semantic representations, ERNIE 3.0 utilizes the continuous multi-task learning framework introduced in ERNIE 2.0.

2. ERNIE 3.0 framework

        Pretrained language models acquire syntactic and semantic knowledge but lack world knowledge from large-scale corpora. A typical form of world knowledge is a knowledge graph. Many works embed entities and relations in knowledge graphs into pretrained language models.

        The framework of ERNIE 3.0 is shown in the figure, which can be widely used for pre-training, fine-tuning and zero/few-shot learning. Specifically, ERNIE 3.0 employs a collaborative architecture of a general representation module ( backbone shared network ) and two task-specific representation modules , namely a natural language understanding (NLU) specific representation module and a natural language generation (NLG) specific representation module.

        The ERNIE3.0 framework is divided into two layers. The first layer is a general semantic representation network , which learns the basic and general knowledge in the data, and plays the role of a general semantic feature extractor (for example, it can be a multi-layer transformer), where the parameters are shared across various task paradigms. The second layer is the task semantic representation network , which learns task-related knowledge based on general semantic representations. Semantic representation networks for different tasks can be implemented through an autoencoding structure or an autoregressive structure, and interact and enhance through underlying sharing. During the learning process, the task semantic representation network only learns the corresponding category of pre-training tasks, the parameters of which are learned by the target of the specific task, while the general semantic representation network learns all the pre-training tasks.

         Baidu researchers proposed a model framework that combines general semantic representation and task semantic representation. This framework integrates different task semantic representation networks such as autoencoder and autoregressive. It can not only handle language understanding and language generation tasks at the same time, but also do zero-shot learning without labeled data and fine-tuning training with labeled data. In addition, on the basis of the continuous learning framework, ERNIE 3.0 adds a task semantic representation network to accelerate model evolution.

2-1 General Representation Module​​​​​​

        ERNIE 3.0 uses a multi-layer Transformer-XL as the backbone network, just like other pre-trained models such as XLNet, Segatron, and ERNIE-Doc, where Transformer-XL is similar to Transformer, but introduces an auxiliary recurrent memory module to help model long texts. We refer to this backbone as the universal representation module, which is shared across all task paradigms. It is well known that Transformer can capture the context information of each token in a sequence through self-attention and generate a context-embedded sequence. Obviously, the larger the size of the Transformer model, the stronger its ability to capture and store various levels of semantic information. Therefore, ERNIE 3.0 sets up a large-scale general representation module, which enables the model to effectively capture general lexical and syntactic information from training data by learning various pre-training tasks in different paradigms. It is important to note that the memory module is only effective for natural language generation tasks when the attention masking matrix is ​​controlled.

2-2 Task-Specific Representation Module

        Similar to the basic shared representation module, the task-specific representation module is also a multi-layer Transformer-XL to capture the top-level semantic representations of different task paradigms. ERNIE 3.0 sets the task-specific representation module to a manageable scale, that is, the base model scale, instead of the multi-layer perceptron or shallow Transformer commonly used in multi-task learning, which will produce three obvious benefits. The first is that the base network has a stronger ability to capture semantic information than the multi-layer perceptron and shallow Transformer; , the smaller model size of task-specific networks leads to achievable practical applications of large-scale pretrained models with only fine-tuning of task-specific representation modules. ERNIE 3.0 builds two task-specific representation modules, the NLU-specific representation module and the NLG-specific representation module, where the former is a bidirectional modeling network, while the latter is a unidirectional modeling network.

 3. Pre-training tasks

        Several tasks are constructed in ERNIE 3.0 for various task paradigms to capture different aspects of information in the training corpus and to enable pre-trained models with comprehension, generation and reasoning capabilities.

3-1 Pre-training tasks for vocabulary awareness

3-1-1 Knowledge Masking Language Model

        ERNIE 1.0 proposes an effective strategy that knowledge integration masks language model tasks to enhance representations through knowledge integration. It introduces phrase masking and named entity masking, predicting the whole masked phrases and named entities to help the model learn the dependency information in local context and global context.

3-1-2 Document Language Modeling

        Generative pre-training models usually utilize traditional language models (e.g., GPT, GPT-2) or sequence-to-sequence language models (e.g., BART, T5, ERNIE-GEN) as pre-training tasks, the latter being trained on a network with an auxiliary decoder structure. ERNIE 3.0 chooses the traditional language model as the pre-training task to reduce the complexity of the network and improve the effect of unified pre-training. In addition, in order to enable the NLG network of ERNIE 3.0 to model longer texts, we introduce the enhanced recursive memory mechanism proposed in ERNIE-Doc, which can model a larger effective context length than traditional recursive transformers by changing the shifted one-layer downward recursion to the same layer recursion.

3-2 Structure-aware pre-training tasks

3-2-1 Sentence rearrangement

        The sentence rearrangement task was introduced in ERNIE 2.0, the purpose is to train the model to learn the relationship between sentences by rearranging the sentence segments. In terms of length, a given paragraph is randomly divided into 1 to m segments during pre-training, and all combinations are randomly permuted and shuffled. The pre-trained model is then asked to reorganize these permutations, modeled as a k-class classification problem.

3-2-2 Sentence distance

        The sentence distance task is an extension of the traditional next sentence prediction (NSP) task, which is widely used in various pre-training models to improve their ability to learn sentence-level information, and it can be modeled as a 3-class classification problem. These three categories represent that two sentences are adjacent, not adjacent but in the same document, and from two different documents.

3-3 Knowledge-aware pre-training tasks

3-3-1 General Knowledge - Text Prediction

        To incorporate knowledge into a pre-trained language model, we introduce the Universal Knowledge-Text Prediction (UKTP) task, which is an extension of knowledge-masked language models. Knowledge masked language models only require unstructured text, while general knowledge-text prediction tasks require both unstructured text and knowledge graphs.

        The figure above illustrates the general knowledge-text prediction task. Given a pair of triplets from a knowledge graph and corresponding sentences from an encyclopedia, we randomly mask relations in triplets or words in sentences. To predict relations in triplets, the model needs to detect mentions of head and tail entities and determine their semantic relations in the corresponding sentences. The essence of this process is similar to distant supervised algorithms in relation extraction tasks. Distant supervision algorithms believe that if two entities have some relationship, any sentence containing these two entities is likely to express this relationship. Meanwhile, in order to predict the words in the corresponding sentences, the model not only considers the dependency information in the sentences, but also considers the logical relations in the triplets. Specifically, the steps to obtain a triplet and this corresponding sentence are as follows: Given a document from an encyclopedia, we first find candidate triplets in the knowledge graph that mention the head entity or tail entity as the title of the document, and then select triplets in which the head entity and the tail entity are mentioned in the same sentence of the document from the candidate triplets.

        ERNIE 3.0 trains the NLU network through knowledge masking language modeling to improve the ability to capture lexical information, trains sentence rearrangement tasks and sentence distance discrimination tasks to enhance the ability to capture syntactic information, and finally optimizes the model with general knowledge-text prediction tasks to improve knowledge memory and reasoning capabilities. At the same time, ERNIE 3.0 trains the NLG network with document language modeling tasks to achieve various generation methods.

4. Data and Settings

4-1 Pre-training data

        In order to ensure the success of ERNIE 3.0 pre-training, we constructed a large-scale, multi-category, high-quality Chinese text corpus with a storage capacity of 4TB and 11 different categories. As far as we know, this is currently the largest Chinese pre-training corpus compared to CLUECorpus2020 (100GB), Chinese multimodal pre-training data (300GB), WuDaoCorpus2.0 (2.3TB Chinese data and 300GB English data) used by CPM-2, and PanGu Corpus (1.1TB).

        Specifically, we built a corpus for ERNIE 3.0 on the basis of ERNIE 2.0 (including encyclopedia, feed, etc.), Baidu search (including Baijiahao, Zhihu, Tieba, experience), web text, QA-long, QA-short, Poetry 2&Couplet 3, specific data in medical, legal, financial and other fields, and Baidu knowledge graph (more than 50 million facts). To improve data quality, we adopted the following preprocessing strategies:

        Data deduplication is performed at different granularities, including character level, paragraph level, and document level. At the character level, we replace consecutive identical characters (i.e. spaces, tabs, exclamation points, question marks, etc.) with a single character. At the paragraph level, we replace two identical consecutive paragraphs consisting of N sentences, where 0<N<100, with a single paragraph. The above two deduplication strategies are crucial for ERNIE 3.0 to generate non-duplicate content. Finally, we adopt Message Digest Algorithm5 (MD5) to filter duplicate documents by comparing the MD5 sum of the longest three sentences in each document.
Sentences with less than 10 words are filtered out as they may be problematic or incomplete sentences containing limited semantic information for model pre-training.
        We further use regular expressions for sentence segmentation and word segmentation based on Baidu's word segmentation tool. This helps ERNIE 3.0 learn better sentence boundaries and named entity knowledge during pre-training.
Each dataset is then multiplied by a user-defined multiplier to increase the data diversity after truncating the data for NLU network pre-training.

4-2 Pre-training settings

        Both the general representation module and the task-specific representation module of ERNIE 3.0 use the Transformer-XL structure as the backbone. For the general representation module, we adopt a structure with 48 layers, 4096 hidden units and 64 heads. For the task-specific representation module, we employ a structure of 12 layers, 768 hidden units and 12 heads. The total parameters of the general representation module and the task-specific representation module are 10 billion. The activation function used is GeLU. The maximum sequence length for context and the memory length for language generation are set to 512 and 128, respectively. The total batch size of all pre-training tasks is set to 6144. We use Adam, the learning rate is 1e-4, β1=0.9, β2=0.999, the L2 weight decay is 0.01, the learning rate is preheated in the first 10,000 steps, and the learning rate decays linearly. During the first 10,000 steps, we also use progressive learning to speed up the convergence in the initial phase of pre-training. The model is trained on a total of 375 billion tokens with 384 NVDIA v100 GPU cards and implemented on the PaddlePaddle framework. We managed to reduce the memory usage of our model and solved the problem that the total parameters of the model exceeded the memory of a single GPU card.

4-3 Experiments on fine-tuning tasks

4-3-1 Fine-tuning of natural language understanding tasks

       1. Sentiment Analysis

        Sentiment analysis is a classification task that aims to determine whether a sentence is positive, negative or neutral. ERNIE 3.0 uses 4 datasets from different domains, including shopping (NLPCC2014-SC), electronics (SE-ABSA16_PHNS, SE-ABSA16_CAM) and finance (BDCI2019). ERNIE 3.0 achieves substantial improvements on all four datasets.

       2. Viewpoint Extraction

        Similar to the sentiment analysis task, opinion extraction requires the model to mine the opinion of a sentence. ERNIE 3.0 uses 3 sub-datasets from Chinese Customer Reviews (COTE). Experimental results show that ERNIE 3.0 also outperforms current SoTA systems by a large margin.

       3. Natural language reasoning

        The natural language inference task is to determine whether a given premise semantically contains another hypothesis. ERNIE 3.0 uses OCNLI and XNLI datasets. The results show that ERNIE 3.0 achieves 3.9 and 0.7 accuracy improvements on the two datasets, respectively. The improvement on the XNLI dataset is rather limited, which may be due to the poor quality of the dataset, since the XNLI dataset is translated from English.

        4. Winograd Schemas Challenge

        WSC2020 is a task of solving the anaphora problem, requiring the model to decide whether the pronouns and nouns in the sentence corefer, and ERNIE 3.0 has achieved a significant improvement of 25.7 points.

        5. Relation Extraction

        The task of relation extraction is to identify the relations between different entities such as people and organizations. ERNIE 3.0 considers two relation extraction datasets, FinRE and SanWen, for financial news and Chinese literature, respectively. ERNIE 3.0 outperforms the previous SoTA model by an average of 2.46 points.

        6. Event Extraction

        Similar to relation extraction, the task of event extraction aims to identify event entities and classify them into different categories. ERNIE 3.0 selects CCKS2020 – a text-level event subject extraction dataset in the financial domain. ERNIE 3.0 has an improvement of 3 points on the test set.

        7. Semantic Similarity

        Semantic similarity is a classic NLP task that determines the similarity between various terms such as words, sentences, documents. ERNIE 3.0 Tests ERNIE 3.0 on several datasets in different domains, including AFQMC, LCQMC, CSL, PAWS-X, and BQ, focusing on sentence-level similarity tasks. Experimental results show that ERNIE 3.0 significantly outperforms the baseline models. Especially with comparable number of parameters, ERNIE 3.0 outperforms CPM-2 by 1.2 points on the LCQMC dataset.

        8. Chinese news classification

        ERNIE 3.0 is evaluated on Chinese news classification. ERNIE 3.0 considers 6 datasets, including news headlines (TNEWS), application descriptions (IFLYTEK) and news stories (THUCNEWS, CNSE, CNSS). Under different types of classification tasks, ERNIE 3.0 can consistently achieve better accuracy with an average improvement of 2.8 points.

        9. Closed book answering

        The purpose of the closed-book questions is to answer the question directly, without any additional reference or knowledge. ERNIE 3.0 selects a general QA dataset NLPCC-DBQA and three medical field datasets – CHIP2019, cMedQA and cMedQA2 to test the capabilities of ERNIE 3.0. The experimental results show that ERNIE 3.0 performs better on all QA tasks, and the knowledge-augmented pre-training method does bring benefits to closed-book QA tasks.

       10. slang understanding

        Slang, also known as puns, is an advanced language usage of humans. However, it is quite difficult for a machine to understand this type of language. The cant understanding ability of ERNIE 3.0 was tested on DogWhistle - a dataset based on the Decrypto game. The model needs to choose the correct answer guided by the corresponding cant. ERNIE 3.0 achieves the best results and shows its potential in understanding more difficult languages.

        11. Named entity recognition

        Named entity recognition is a classic NLP task that extracts and classifies entities in text. ERNIE 3.0 selected the widely used OntoNotes, CLUENER, Weibo, and a domain-specific dataset CCKS2019. From the results, ERNIE 3.0 outperforms the baseline model on all datasets.

        12. Machine Reading Comprehension

        The machine reading comprehension capabilities of ERNIE 3.0 are comprehensively evaluated in different aspects, including span prediction reading comprehension (CMRC2018, DuReader, DRCD, DuReader checklist), multiple-choice reading comprehension (C3, DuReaderyesno), cloze and completion (CHID, CMRC2019), and robustness test (Durederrobust). With the help of knowledge-augmented pre-training, ERNIE 3.0 outperforms the baseline model with significant improvements on all types of tasks. More specifically, ERNIE 3.0 achieves at least a 1.0-point EM improvement on the 5-span prediction task and an average of 0.89-point accuracy improvement on the multiple-choice task. In addition, with the same number of parameters, ERNIE 3.0 surpasses CPM-2 by 0.6 points on the C3 dataset. For robustness testing, ERNIE 3.0 also performs best on test sets with oversensitive and overstable samples.

        13. Legal document analysis        

        To test the capabilities of ERNIE 3.0 on document analysis, two domain-specific legal tasks were chosen. Both datasets from CAIL2018 are multi-label document classification tasks. The performance of ERNIE 3.0 is better than that of ERNIE 2.0, and there is a significant improvement.

        14. Document retrieval

        The goal of document retrieval is to match documents to a given query. The retrieval capability of ERNIE 3.0 is evaluated on Sogou logs. Following the previous work, the performance of NDCG@1 on test-same test set and MRR on test-original test set, ERNIE 3.0 outperforms CPM-2.

4-3-2 Fine-tuning for natural language generation tasks

        1. Text summarization

        We consider the large-scale Chinese short text summarization (LCSTS) dataset, which requires a model to understand text and distill key information to generate coherent, informative summaries. LCSTS is a classic Chinese text summarization dataset, which consists of 2 million real Chinese essays and essay summaries from Sina Weibo. ERNIE 3.0 achieves a Rouge-L score of 48.46%, surpassing CPM-2 (11B) with a comparable number of parameters and the current SoTA ProphetNet-zh.

        2. Question Generation

        Question generation is the inverse task of machine reading comprehension (MRC), requiring the model to understand a document and generate a plausible question given a short answer. ERNIE 3.0 uses three sets of datasets, including Knowledge Base Question Generation (KBQG), two MRC datasets named Dureader and Dureaderrobust. Compared with the baseline, ERNIE 3.0 performs best on these three datasets.

        3. Mathematics

        To test the ability of ERNIE 3.0 to perform simple arithmetic operations, the Math23K dataset was used, which contains 23,161 real math word problems from elementary school students, with problem descriptions, structured equations, and answers. After fine-tuning, ERNIE 3.0 can generate the suffix expression of the structural equation according to the problem description, and then use Python's eval() function to calculate the final answer (note that '[' and ']' should be replaced by '(' and ')' respectively, and '%' should be replaced by '*0.01' to avoid using Python's eval() function to solve the problem). This shows that ERNIE 3.0 is a good mathematical solver, achieving a high accuracy of 75% compared to 69.37% of CPM-2.

        4. Advertisement Generation

        Consider AdGen, which consists of 119K pairs of ad text and clothing specification sheets from Chinese e-commerce platforms. It asks the model to generate a long advertisement text covering all given attribute-value pairs of a garment. An attribute-value pair is connected with a colon, and several attribute-value pairs are connected sequentially with "|" according to their segment numbers. Then take the structured attribute-value pair string as input to ERNIE 3.0. The results show that ERNIE 3.0 is able to generate coherent and thought-provoking long-form advertising texts by extracting information from structural inputs, achieving a 19.56% improvement on BLEU-4 compared to CPM-2.

        5. Translation

        For ERNIE 3.0, pre-training on the Chinese corpus is mainly considered. To test its multilingual ability, the vocabulary was expanded to include an additional 10K English subwords. On the classic multilingual dataset WMT20-enzh, ERNIE 3.0 is fine-tuned to translate English to Chinese. Compared with mT5-xxLarge and CPM-2, ERNIE 3.0 is the best, showing excellent multilingual ability.

        6. Dialog generation

        Evaluation on dialogue generation tasks using ERNIE 3.0. A Chinese multi-domain knowledge-driven dialogue dataset is used, which contains 4.5K dialogues in three domains (movies, music and travel). Train and test ERNIE 3.0 on the fusion dataset of the above three domains, only given the dialogue history to generate the current corpus. The knowledge triad is excluded from the input, so it is suitable for testing the model's ability to exploit the inherent knowledge during pre-training to simulate multiple rounds of dialogue. Compared with the baseline, the performance of ERNIE 3.0 has been greatly improved by 8.1 percentage points, which verifies that the knowledge graph greatly enhances the pre-training properties.

5. Summary

        The unified framework ERNIE3.0, which combines autoregressive network and autoencoder network , so that the trained model can handle natural language understanding and generation tasks through zero-shot learning, few-shot learning or fine-tuning.
        During the training process, 10 billion parameters are used to pre-train the large-scale knowledge enhancement model , and a series of experimental evaluations are carried out on natural language understanding and natural language generation tasks. Experimental results show that ERNIE 3.0 consistently outperforms the state-of-the-art models by a large margin in 54 benchmarks and achieves the first place in the SuperGLUE benchmark.

Guess you like

Origin blog.csdn.net/weixin_44750512/article/details/129332841