Talk about language models again

title:

Language models play an extremely important role in natural language processing tasks. What everyone is more familiar with is the n-gram type language model. Whether it is traditional or NN-based, it predicts the current word based on the first N words, or Find the probability of the current word. When the probability of the entire sequence is required, it is necessary to traverse the sequence and then perform the accumulation operation. In fact, the language model of natural language can not be played like this. The following introduces some new language models and discusses their applications.

New language model:

bert pre-training method:

Task 1: Masked LM
Intuitively, the research team has reason to believe that the deep two-way model is more powerful than the shallow connection of the left-to-right model or the left-to-right and right-to-left model. Unfortunately, the standard conditional language model can only be trained from left to right or right to left, because the two-way conditional action will allow each word to "see itself" in the middle of multiple contexts.

In order to train a deep bidirectional representation, the research team adopted a simple method of randomly masking some of the input tokens, and then predicting only those tokens that are masked. The paper refers to this process as "masked LM" (MLM), although it is often referred to as the Cloze task in the literature (Taylor, 1953).

In this example, the final hidden vector corresponding to the masked token is input into the output softmax on the vocabulary, just like in the standard LM. In all experiments of the team, 15% of WordPiece tokens in each sequence were randomly blocked. In contrast to denoising autoencoders (Vincent et al., 2008), only masked words are predicted instead of reconstructing the entire input.

Although this does give the team a two-way pre-trained model, this approach has two disadvantages. First, there is a mismatch between pre-training and finetuning because the [MASK] token is never seen during finetuning. To solve this problem, the team does not always replace the "masked" word with the actual [MASK] token. Instead, the training data generator randomly selects 15% of the tokens. For example, in the sentence "my dog ​​is hairy", the token it chooses is "hairy". Then, perform the following process:

The data generator will do the following instead of always replacing the selected word with [MASK]:

80% of the time: replace the word with the [MASK] tag, for example, my dog ​​is hairy → my dog ​​is [MASK]
10% of the time: replace the word with a random word, for example, my dog ​​is hairy → my dog ​​is Apple
10% of the time: Keep the word unchanged, for example, my dog ​​is hairy → my dog ​​is hairy. The purpose of this is to bias the representation toward the actual observed word.
The Transformer encoder does not know which words it will be asked to predict or which words have been replaced by random words, so it is forced to maintain a distributed contextual representation of each input token. In addition, because random replacement only occurs in 1.5% of all tokens (that is, 10% of 15%), this does not seem to damage the language understanding of the model.

The second disadvantage of using MLM is that each batch only predicts 15% of the tokens, which indicates that the model may require more pre-training steps to converge. The team proved that the convergence speed of MLM is slightly slower than the left-to-right model (predicting each token), but the experimental improvement of the MLM model far exceeds the increased training cost.

Task 2: Next sentence prediction

Many important downstream tasks, such as question answering (QA) and natural language inference (NLI), are based on understanding the relationship between two sentences, which is not directly obtained through language modeling.

In order to train a model relationship for understanding sentences, a binary next sentence test task is pre-trained. This task can be generated from any monolingual corpus. Specifically, when sentences A and B are selected as pre-training samples, B is 50% likely to be the next sentence of A, and 50% is likely to be a random sentence from the corpus. E.g:

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

The team chose NotNext sentences completely randomly, and the final pre-trained model achieved 97%-98% accuracy on this task.
Retrieved from: https://blog.csdn.net/qq_39521554/article/details/83062188

ELMo model introduction

ELMo is a new type of deep contextualized word representation, which can model complex features (such as syntax and semantics) of words and the changes of words in language context (ie, model polysemous words). Our word vector is a function of the internal state of the deep bidirectional language model (biLM), pre-trained in a large text corpus.
When it comes to word vectors, we will definitely think of word2vec, because the word vector concept proposed in it has brought a huge improvement to the development of NLP. The main method of ELMo is to first train a complete language model, and then use this language model to process the text that needs to be trained, and generate the corresponding word vector. Therefore, it has been emphasized in the text that the ELMo model can generate the same word in different sentences. Different words vector.
They use a two-way LSTM language model, which consists of a forward and a backward language model. The objective function is to take the maximum likelihood of the language model in these two directions.
(1) ELMo's assumptions premise that the word vector of a word should not be fixed, so the effect of ELMo must be better than word2vec in terms of multiple meanings of a word.
The process of word2vec learning word vector is to learn through the upper and lower windows of the central word, the learning range is too small, and ELMo learns from the entire corpus when learning the language model, and then the word vector generated by the language model It is equivalent to a word vector learned based on the entire corpus, which more accurately represents the meaning of a word.
(2) Another advantage of ELMo is that when it builds a language model, it can use a large non-task corpus to learn. Once it has been learned, it can be applied to similar problems in parallel.
Retrieved from: https://www.cnblogs.com/huangyc/p/9860430.html

application

The above are actually language model training methods, and the pre-trained language model is also loaded during application and then applied to the semantic coding of the text. Next, we will discuss two aspects of NER and text VAE text generation.

Guess you like

Origin blog.csdn.net/cyinfi/article/details/91377231