The large model can be fine-tuned with very little data, and this article explains the operation principle of LoRA and other methods in detail

Michael Liu contributed
qubits | public account QbitAI

Recently, the fine-tuning method of large models has exploded together with large models .

This type of method can make the large model "stand out" in the downstream tasks that did not perform so well with only a small amount of data, and become an expert in this task.

Among them, the most popular large-scale model fine-tuning method belongs to LoRA .

a7d06425a117b57f021ee1b505413ba5.png

But what exactly is the core principle of such methods, including LoRA? What is the relationship between it and the big model? Let's look at it in detail.

I. Introduction

Let's start with the recent fire LoRA  ("LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGEMODELS").

d1bce713708c0e59f9a6c1b5a70aea2d.png

This article was proposed in ICLR2022, saying that using the low-rank adaptation method, only a small number of parameters need to be trained to achieve a good effect when using a large model to adapt to downstream tasks.

How does LoRA fine-tune and adapt to downstream tasks?

The process is very simple. LoRA uses the data corresponding to the downstream tasks, and only adds some parameters through training to adapt to the downstream tasks.

After the new parameters are trained, the new parameters are combined with the old model parameters by means of re-parameterization, so that the effect of fine-tuning the entire model can be achieved on the new task, and the inference time will not be increased during inference.

The schematic diagram of LoRA is as follows:

7713f6c2b0c0130d5abdbe76311a7ffe.png

The blue part in the figure is the pre-trained model parameters. LoRA adds two structures A and B next to the pre-trained model structure. The parameters of these two structures are initialized to Gaussian distribution and 0 respectively, so the additional parameters at the beginning of training are 0.

The input dimension of A and the output dimension of B are the same as the input and output dimensions of the original model respectively, while the output dimension of A and the input dimension of B are a value much smaller than the input and output dimensions of the original model, which is the embodiment of low-rank (somewhat similar to the structure of Resnet), which can greatly reduce the parameters to be trained.

Only the parameters of A and B are updated during training, and the pre-trained model parameters are fixed. The idea of ​​reparametrization can be used during inference to combine AB and W, so that additional calculations will not be introduced during inference.

And for different downstream tasks, you only need to retrain AB on the basis of the pre-trained model, which can also speed up the training rhythm of large models.

Since this article does not specifically introduce LoRA, you can view the original text of LoRA for details. We only need to know that subsequent experiments in the LoRA article have demonstrated the effectiveness of the method.

Then think further, why can LoRA's idea work well?

The answer is the Intrinsic dimension to be discussed next  .

This point is also mentioned in the original LoRA text, which was inspired by the following two articles:

1. MEASURING THE INTRINSIC DIMENSION OF OBJECTIVE LANDSCAPES, published in ICLR2018, for convenience, the next paper is called [Paper 1]

2. INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGEMODEL FINE-TUNING, published in ACL2021, for convenience, the next paper is called [Paper 2]

2. What is the intrinsic dimension?

The concept of intrinsic dimension is proposed in [Paper 1].

Training a neural network often involves the following steps:

1. For a given data set, first design the structure of the network and select the corresponding loss
2. Randomly initialize the parameters in the network
3. Train the network to make the loss lower and lower

The training phase can be considered as finding an effective path on a fixed objective landscape.

Here is an explanation of why it is a fixed target map. Because after the data set and network structure are fixed, the problem to be optimized has been defined, so the target graph is determined.

As shown below:

a17d98cdedaefdd7baa96dfd979a6beb.png

Then for a model with a parameter amount of D 706e163fdd82bd309e8587919bc571ee.png, we train the model, which means finding an effective solution in the D-dimensional space. The article thinks that D may be redundant, and it may actually only need to optimize the d parameters to find an effective solution.

The formula is as follows:

81c4cf653089679df9ccf73e7a07d047.png

Among them 11c23671c77e298413b5132aa2fc6a11.png, it represents the optimization parameter of D dimension, 03918a5c91abd10e3831bab2cecd0555.pngwhich represents a parameter that is randomly initialized and is not updated during training, and P is a randomly initialized matrix of D×d size that is not updated during training, representing the d-dimensional parameter to be optimized 91337e16423c475ec188306d28660112.png.

That is to say, only the d-dimensional parameters can be updated when training the network to achieve the desired effect of the network. Then this d is the so-called intrinsic dimension of the model.

It may be a little dizzy after talking here, let's take a look at the following picture:

b4299151bdd64f0eae09e956f261c2f0.png

In the figure above, the blue part is the initialized network parameters 032adff97e3f8d1d1bc53a6513f8ec32.png, the green part is f34db6d55a27e61d79317acf60bf83f4.png, and the red part is 43b9a993b4ca24443c86e85383127dfd.png. During network training, only the red part is trained, and other parameters are fixed. d is the intrinsic dimension.

The above only updates the d-dimensional parameters to make the network achieve the desired effect, so what should the effect be? The article defines that when only the d-dimensional parameters are updated, when the network effect reaches 90% of the effect of training the original model, then it is considered to have achieved the "should effect", and d is the intrinsic dimension.

For example, when doing the digital classification task of mnist, if the accuracy of the original model can reach 0.9, then when only the d-dimensional parameters are updated, the accuracy can reach 90%×0.9=0.81, and it is considered that d at this time is the intrinsic dimension and recorded as 7bc0a99a8afb21fc127d62e60c4fd90e.png.

3. Use the intrinsic dimension to think about the effectiveness of large model fine-tuning

[Paper 2] Using the previously proposed eigendimensions to think about the effectiveness of large model fine-tuning, why is it possible to effectively fine-tune large models with hundreds or thousands of pictures now?

According to [Paper 1], for a certain type of problem, there are intrinsic features at a certain accuracy (such as 90% accuracy). For a large model, the test of the intrinsic dimension can know how many parameters need to be adjusted to solve the current problem approximately when solving a certain type of downstream problem.

If there are experiments that prove that only adjusting a few parameters can solve downstream problems well, then the above questions can also be answered, that is, a small amount of fine-tuning (adjusting a small number of parameters) on the large model can solve the current problem.

Unless otherwise specified below, "article" refers to 【Paper 2】

3.1 For large models, is there an intrinsic dimension?

Like [Paper 1], [Paper 2] also uses the formula a44f5ebe234d2c9766c4b9e05bd6dc0d.pngto train the model, that is, only the d-dimensional parameters are adjusted during training 58c357fe964afb6fe2b0dfc0d5d49c57.png. But it is a little different from the experiment in [Paper 1] that [Paper 1] a3fd0cad98302b7a69e4c4348a522c8c.pngis randomly initialized, while [Paper 2] 4d76821696b978c761d61db96ba1f37d.pngis a pre-trained parameter.

[Paper 2] First select the four models of BERT-Base\BERT-Large\RoBERTa-Base\RoBERTa-Large, and select the two data sets of MRPC and QQP in the GLUE benchmark (both data sets are used to test whether the sentence pairs have the same meaning).

09977c89fb8c1774f75afdc39df65ded.png

The upper and lower subgraphs represent the two tasks of MRPC and QQP respectively. Each subgraph has four solid lines representing the accuracy of the four models, and four dotted lines representing the value that reaches 90% of the accuracy of the entire fine-tune model. The abscissa represents the size of the training d dimension. It can be seen from the figure that two tasks and four different models only need to train smaller d-dimensional parameters to achieve 90% accuracy. The concept of intrinsic dimension holds true in large models.

Therefore, when training a downstream task, only a small number of parameters need to be trained to achieve good results. At this time, the problem at the beginning of the article has been solved. But the author did some other experiments and found some interesting conclusions.

3.2 The relationship between the quality of pre-training and the intrinsic dimension

The article puts forward such a hypothesis that the pre-training model can implicitly reduce the intrinsic dimension of the model in each task of NLP.

Based on this conjecture, the article did the following experiment. When pre-training the RoBERTa-base model, the corresponding pre-training model was saved every 10K, and then the saved pre-training model was tested in the six data set eigendimensional dimensions of MRPC, QQP, Yelp Polarity, SST-2, MNLI, and ANLI.

The result is as follows:

32cb152d6e7bf64f4a6fe2c6bf677060.png

It can be seen that there is the same trend in different data sets, that is, the more pre-training times, the lower the intrinsic dimension of the model on each task. The experiment did not deliberately optimize the so-called intrinsic dimension, but the pre-training was longer. Therefore, it is confirmed that the stronger the representation ability of the pre-training model (the better the training), the smaller the intrinsic dimension.

3.3 Relationship between pre-training model parameters and intrinsic dimensions

Originally, when doing the relationship between pre-training parameters and intrinsic dimensions, it is necessary to unify the structure of the model, which is more convincing. But the author said that in this way, many large-scale model experiments are required to be trained. In order to make it easier to compare the article, the experiment is done based on the existing structure. From the trend of the experimental results, different structures can also draw valid conclusions.

The article uses the existing pre-training model to calculate the intrinsic dimension on the MRPC data set.

The experimental results are as follows:

455edb330db008a65adc54466322d9f2.png

In the figure above, the ordinate represents the value of the intrinsic dimension, and the abscissa represents the parameter quantity of the model. From the trend in the figure, it can be clearly seen that the larger the model, the smaller the intrinsic dimension, that is, the stronger the model, the lower the intrinsic dimension.

3.4 The relationship between intrinsic dimension and generalization ability

The relationship between fine-tune (3.1), pre-training (3.2) and intrinsic dimension is introduced above, but the relationship between intrinsic dimension and generalization ability has not been verified. That is, we now know the way to make the eigendimension small, but if the eigendimension is small, can the generalization ability be improved?

The article has done the following experiments again, using the model saved in 3.2 2c5dc0f2870c87b9ebbfa928daaafa51.pngto test different data sets on the corresponding eigendimensions, the results are as follows:

c150ebdd0241c9eadde4ddaedb8f66fe.png

It can be seen that the model with low intrinsic dimension has a higher accuracy rate of the trained model. That is to say, the lower the intrinsic dimension, the better the generalization performance.

Back to the introduction question: Why can the LoRA idea work?

Because the large model has the concept of intrinsic dimension, it only needs to adjust a few parameters to get good results on downstream tasks.

References:
[1]https://en.wikipedia.org/wiki/Gradient_descent
[2]https://arxiv.org/pdf/1804.08838.pdf
[3]https://arxiv.org/pdf/2012.13255.pdf
[4]https://arxiv.org/pdf/2106.09685.pdf

Original blog address:
https://michaelliudev.blog.csdn.net/article/details/131745794

Guess you like

Origin blog.csdn.net/QbitAI/article/details/131798891