Ten NLP research direction Highlights!

Foreword

DeepMind scientist Sebastian Ruder summary finishing machine learning and natural language processing 10 huge impact interesting research, this article will introduce the major progress made in the 10 direction, briefly explain why I think it is important in this direction, and finally the future work brief outlook. 

10 which are directions:

  1. General unsupervised pre-training (Universal unsupervised pretraining)

  2. Lottery (Lottery tickets) hypothesis

  3. Tangent nerve nucleus (The Neural Tangent Kernel)

  4. Multi-language unsupervised learning (Unsupervised multilingual learning)

  5. More robust comparative benchmark (More robust benchmarks)

  6. Machine learning and natural language processing contribution to the development of science (ML and NLP for science)

  7. Decoding error (Fixing decoding errors in NLG) to solve the problem of natural language generation

  8. Enhanced model pre-trained (Augmenting pretrained models)

  9. Efficient and wide range memory Transformer (Efficient and long-range Transformers)

  10. More reliable analytical methods (More reliable analysis methods)

General unsupervised pre-training

Since BERT (Devlin et al., 2019) and its variants turned out, unsupervised pre-training shine at this year's Natural Language Processing (NLP) field. BERT numerous variants have been applied in a multi-modal scenarios, which primarily relates to an image and its associated text, video (shown below). Unsupervised training is also beginning to infiltrate the field in the past supervised learning rule. In the field of bioinformatics, the pre-trained language model Transformer began to be applied on the predicted protein sequence (Rives et al., 2019).

In the field of computer vision, including CPC (Hénaff et al., 2019), MoCo (He et al., 2019) and PIRL (Misra & van der Maaten, 2019), including the model, and in order to enhance the data sampled at ImageNet powerful generator model BigBiGAN (Donahue & Simonyan, 2019) image generation efficiency and take advantage of the effect of proposed self-supervised learning method. In the field of voice, multi-convolution neural network (Schneider et al., 2019) and two-way CPC (Kawakami et al., 2019) to study the characterization of the most advanced model than the performance is better, but needed more training data less.

Why is it important?

Unsupervised pre-training allows us to significantly reduce the demand for data has been marked in the training model. This makes those areas before the start of the data needs are not met with the possibility rejuvenated.

The next will be how to develop?

Although researchers have started studying unsupervised pre-training, and has already achieved great success in a number of separate areas. But if in the future it can move in the direction of tightly integrated multi-modal, and it should still be very interesting.

BERT recently proposed multimodal variants VideoBERT (Sun et al., 2019). It can be generated based on recipes video "token" (the upper half of the picture), can also be given a video when the "token", predicted at a different time scales back "token" (lower half of the picture).

Lottery hypothesis

As shown below, Frankle and Carbin (2019) defines "winning lottery ticket" (winning tickets) - that is, initialized good sub-networks found in dense, random initialization feedforward networks, independent training this sub-networks should accurate rate and complete training network similar.

Although the initial pruning process only applies to small visual tasks, but later work (Frankle et al., 2019) is applied to the pruning of early training, rather than the initialization phase, which makes the find deeper boy network model to become may. Yu et al. (2019) in NLP and RL models of LSTM and Transoformer also found in "winning the lottery." Although this is still hard to find the winning lottery ticket, but they seem to be transferred between data sets and optimizer (Morcos et al., 2019).

Why is it important?

In the neural network becomes more and more advanced, while also increasing their size, they are trained and used to calculate the predicted power demand is also growing. Stably identify smaller sub-networks to achieve similar performance can be greatly reduced operator training and reasoning power demand. This model can accelerate iteration, and opens a new possibility for the computing device and the terminal edge computation.

The next will be how to develop?

Currently, in the scenario of low resources, in order to produce real benefits, you want to find "winning lottery ticket" is still required large computational overhead. More robust one-shot pruning smaller, so you can alleviate this problem to some extent on the sensitivity of the pruning process noise. Research on "winning lottery ticket" feature can also help us better understand the initialization, the process of understanding the neural network training.

Test accuracy at different rates pruning - solid line represents the winning tickets, the dashed lines represent the sampled random sub-network (Frankle & Carbin, 2019).

Tangent nerve nucleus

Estimates generally difficult to think of, when the neural network is very wide (more precisely, infinitely wide), narrower than it actually is better when study. The results show that, under the infinitely wide limits, the neural network can be approximated by a linear model with the core, the core is the tangent nerve nucleus (Neural Tangent Kernel, NTK, Jacot et al., 2018). In fact, the limited performance of these models is less than the depth model (Novak et al, 2019;. Allen-Zhu et al, 2019;. Bietti & Mairal, 2019), which also limits the application of the results on the standard method.

However, some recent work (Li et al, 2019;.. Arora et al, 2019) has been greatly reduced the performance gap nuclear nervous tangent with the standard method (see Chip Huyen NeurIPS 2019 on other papers related blog post).

Why is it important?

NTK may be at our disposal the most powerful tool for analyzing the behavior of neural network theory. Although it also has its limitations (ie practical neural network model is still better than the performance of the corresponding version of NTK), and the results to date in this area have not yet translated into practical benefits, but it might help us open depth study black box.

What's next?

Now it seems that the gap between the standard method of NTK mainly from the different widths, future work might try to describe this gap. This will also help us to put the idea of ​​infinite width limit in practice. Finally, NTK may help us to understand the behavior of the training process and the generalization of neural networks.

Linear model with NTK learning process when the output scaling factors for different values ​​of α, NTK ellipse in FIG visual results.

Unsupervised learning multiple languages

Over the years, cross-language characterization study focused on a single level, see total review article, "A Survey of Cross-lingual Word Embedding Models." Thanks to the development of pre-trained unsupervised, between 2019 emerged as multilingual BERT, XLM (Conneau & Lample, 2019) and XLM- R (Conneau et al., 2019). Although these models do not explicitly use any cross-language signal, but even in the absence of shared word list or joint training (Artetxe et al, 2019;.. Karthikeyan et al, 2019;. Wu et al, 2019), they generalization effect between different languages ​​is also surprisingly good.

"Unsupervised Cross-lingual Representation Learning" multi-language model are outlined. This model also gives depth unsupervised machine translation field brings a lot of lifting (Song et al, 2019;. Conneau & Lample, 2019). In 2018 this area has also made significant progress, due to the more rational integration of statistical methods and neural network method has improved. Another good news is that we can build depth development of multi-language model (see below) in accordance with the pre-existing English training characterization.

Why is it important?

Existing characterization techniques make cross-language model to other languages ​​in addition to English, using less training corpus. Moreover, if there is sufficient data marked in English, and that these methods can make zero-shot migration possible. Eventually, they may help us understand the relationship between different languages.

Next, how to develop?

The cause of these methods can achieve such a good performance in the absence of any cross-language supervisory signal situation is not clear. Learn more about the working mechanism of these methods may help us design more powerful algorithms, it is possible to reveal the relationship between the different language structures. Moreover, we should not only focus on the zero-shot migration, we should also consider learning the target language from those barely marked data.

Artetxe et al. Four step process migration monolingual (2019) proposed.

More robust comparative benchmark

SOTA something rotten in the --Nie et al. (2019) quote Shakespeare, "Something is rotten in the state of Denmark" paraphrase.

 

Recently, as HellaSWAG (Zellers et al., 2019) NLP this new data set is to test the performance of the current best model created. An example data set are artificially screened in order to identify those who are sure that there is data (for examples see below) is currently the best model performance can not be successfully processed. This artificial antagonism data involved in the construction work can be repeated several times, such as the recent comparison reference Adversarial NLI (Nie et al., 2019) makes creating a data set for the current natural language inference model is more challenging.

Why is it important?

Many researchers have found that the current model of NLP did not learn what they should learn, but with some very simple heuristics to find some data in very shallow clues, see "NLP's Clever Hans Moment has Arrived" . As data collection becomes more robust, we hope that the new proposed model may be forced to learn the relationship between the data in real deep.

Next, how to develop?

With more and more powerful model, most of the data sets need to continually improve, or will soon be obsolete. We need a dedicated basis systems and tools to facilitate this process. Further, it runs the appropriate baseline comparison, including the use of different data models and a simple method variants (e.g. incomplete input), so that the initial version of the dataset as robust.

The picture above shows HellaSWAG sentence in a multiple choice fill in the blank, the current optimal model performance is difficult to answer this question. Examples of the most difficult is that the complexity of the needs "just right" answer will contain three background generated sentences and two sentences (Zellers et al., 2019).

Science in machine learning and natural language processing

The machine learning for basic scientific issues has made some important progress. FIELD OF herein focuses on the application of neural networks in the depth of the predicted protein folding and multi-electron Schrödinger equation (Pfau et al., 2019). From the NLP perspective, the good news is that even the standard model in the field of fusion can be a huge indication of progress. In the field of materials science, researchers have completed a word using embedded to analyze the potential of knowledge work (Tshitoyan et al., 2019), to predict some kind of material would not have certain characteristics (see below). In the biological field, genes and proteins are sequence-type data, so NLP method (LSTM, Transformer, etc.) are inherently suitable for solving such problems, these methods have been applied in the classification task of protein (Strodthoff et al., 2019 ; Rives et al, 2019)..

Why is it important?

Science can be said to be one of the largest machine-learning applications. Solutions may have a great impact on many other areas, and can help solve practical problems.

The next step how to do?

From the physical problems of energy modeling (Greydanus et al., 2019) to solve the differential equation (Lample & Charton, 2020), the machine learning techniques continue to be used in new scientific problems. In 2020, let's see what this one the most influential work is that it will be very interesting!

Summary Based on the literature different time obtaining word embedded training, these materials to predict what will be in the future as (the ferroelectric, PV, topological insulator).

Image shows the comparison (Tshitoyan et al., 2019) 50 th prediction material most likely candidate with all the studied material.

The decoding error resolve natural language generation (NLG) of

Although natural language generation (natural language generation, NLG) model in the field of more and more powerful, but they are still often generate duplicate or meaningless words (as shown below). This is mainly caused by maximum likelihood training. Fortunately, it is being improved, and its progress is the modeling work are orthogonal. Such modifications are mostly new sampling methods (such as nuclei samples, Holtzman et al., 2019) or a new loss function (Welleck et al., 2019) of the form.

Another surprising finding is that the search results are not good model to help generate better results: the current model to some extent dependent on the error inaccurate search and cluster search. In contrast, the scene in machine translation, often precise search returns no meaningful translation (Stahlberg & Byrne, 2019). This finding suggests that searching and modeling progress must go hand in hand.

Why is it important?

NLG is one of the basic tasks of NLP. In the study NLP and machine learning, most papers focus on the improvement of the model, and the development of other parts are often ignored. NLG for researchers, it is important to remind ourselves that our model is still flawed, may be able to improve the output by modifying search or training process.

The next step how to do?

Although NLG more powerful model, but also help with the migration study, but predicted results of the model still contains a large degree of the factors considered. Identify and understand the causes of these human factors is a very important research direction.

The results GPT-2 as well as the use of cluster search simple (greed) sampling method produces, blue is repeated, the red part is meaningless words.

Enhanced pre-trained model

2019, I am pleased that we make the pre-method training model has a new ability. Some methods use the knowledge base to strengthen pre-training model to improve the model entity recognition (Liu et al., 2019) performance on the task and recall of facts (Logan et al., 2019). There are also methods for simple inference algorithm (Andor et al., 2019) by accessing some predefined executable program. Since most models have a weak inductive bias, and most of the knowledge from the data to learn from, so another option to enhance the pre-training model is to enhance the training data itself (such as access to knowledge, Bosselut et al., 2019 ),As shown below.

Why is it important?

Model is becoming more and more powerful, but there is a lot of knowledge model can not be learned only from the text. Especially when dealing with more complex tasks, the available data may be too limited to use common sense facts or explicit reasoning, it may require a stronger inductive bias.

The next step how to do?

As these models are applied to more challenging problems, more and more necessary to model the combination of changes. In the future, we may combine powerful pre-training model and combined program of learning (Pierrot et al., 2019).

A standard Transformer with long attention mechanisms. In the case where a given head of entities and relationships, the trained model can predict the end of the knowledge base entity triplet (Bosselut et al., 2019) .

Efficient memory and wide range of Transformer

This year Transformer architecture has been some improvement (Vaswani et al., 2017). These new architectures (e.g. Transformer-XL, Dai et al., 2019 and the Compressive Transformer, Rae et al., 2020) so that it can obtain a long distance dependencies.

There are also methods by using different desired (usually very sparse) attentional mechanisms (e.g. adaptively sparse attention, Correia et al, 2019;. Adaptive attention spans, Sukhbaatar et al, 2019;. Product-key attention, Lample et al, 2019;. locality-sensitive hashing, Kitaev et al, 2020, etc.) to make more efficient Transformer.

In pretraining art Transformer based, there have been many more efficient variant, such as using a parameter shared ALBERT (Lan et al., 2020) and the use of more efficient pre-training mission ELECTRA (Clark et al., 2020 )Wait. Of course, there are some pre-trained models do not use the same Transformer and more efficient, such as one yuan document model VAMPIRE (Gururangan et al., 2019), and QRNN-based MultiFiT (Eisenschlos et al., 2019). Another trend worth noting: large BERT distilled to obtain smaller model (Tang et al, 2019; Tsai et al, 2019; Sanh et al, 2019...).

Why is it important?

Transformer architecture from the beginning of the birth of the very influential. It is one of the most advanced NLP model, and has been successfully applied in many other areas (see Section 1 and 6). Therefore, any improvement in Transformer architecture are likely to have a strong ripple effect.

The next step how to do?

These improvements will take time to implement in practice, but given the popularity and ease of use, pre-training model, which is more efficient alternative may soon be adopted. Overall, the researchers hope we can continue to focus model architecture emphasis on efficiency, rather sparsity is one of the major trend.

Compressive Transformer (Rae et al., 2020) may be fine-grained memory past activation function coarser compressed into a compression memory.

More reliable analytical method

A key trend in 2019 is the growing model papers. In fact, the author of several papers favorite is such a analysis paper. Earlier work is a highlight of Belinkov & Glass overview of the analytical method in 2019. At the same time, in my memory, this is the first time began work focuses on the analysis BERT this kind of model (this type of paper is called BERTology) papers appear. In this case, the probe (probe) has become a commonly used tool, its purpose is to predict certain properties to see if the model "understanding" the morphology, syntax and so on.

The authors would like those papers (Liu et al, 2019.; Hewitt & Liang, 2019) to explore how to make the probe more reliable technology. Reliability also has been a theme in question, namely whether attention can provide meaningful interpretation (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019; Wallace, 2019). Researchers interested in analysis of the ascendant, the best example may be a new track --NLP ACL 2020 model analysis and interpretability.

Why is it important?

The most advanced methods have generally been used as a black box. In order to develop better models and use them in the real world, we need to understand why the model will make these decisions. However, we currently interpret forecast results of the model approach is still limited.

The next step how to do?

We need to do more to explain the prediction exceeded our expectations, they are generally unreliable. In this direction, a significant trend of more artificial data set provides an explanation of writing (Camburu et al, 2018;.. Rajani et al, 2019; Nie et al, 2019.).

Setting the language to learn probing characterization of knowledge.

The above is the author of the 2019 inventory of the NLP field. As can be seen, NLP is an area that is still booming, a lot of theories in rapid development, future research results can be expected.

Published 33 original articles · won praise 0 · Views 3280

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104548746