background
Autoregressive Langeuage Models
Incomplete sentences, predict the remaining empty words
- sentence completion
Transformer-based ALMs
Masked language models-MLMs
The pre-trained model can convert the input text into hidden feature representation
The model parameters are initially obtained from the pre-training model, and then fine-tuned after specific tasks are given. The intermediate model parameters can be fixed or micro-trained
- Related papers
The Problems of PLMs
Problem 1: There are few data with labels
Question 2: The model is getting bigger and bigger, and reasoning takes time
4 tasks require 4 times the size of the video memory
and the reasoning takes a long time
Solution
Labeled Data Scarcity——Data-efficient-tuning
When there is less data, the model may not be able to learn the above task functions.
By converting the data into a natural language prompt, the model can more easily know what it should do
- 1 A prompt template: Tell the model what to do, here is the mask filled in the middle
- 2- A plm model performs the task, and the possible situation with the highest output probability
- verbalizer: Map labels and probabilities
When the labeled data is relatively small, standard fine-tuning is more difficult to train;
few-shot learning
semi-supervised learning
- PET
- Step 1: Design different prompts
- Step 2: Use multiple trained models to predict labels, and add up the predicted results as the total prediction
- Step 3: Use standard training methods, soft label
- Step 1: Design different prompts
Zero-shot learning
The large model is large enough to achieve zero-shot
Summarize
- to distill
- refine to downstream tasks
Share parameters of related transformer layers
PLMs Are Gigantic——Reducing the Number of Parameters
Convert to share a bert model
Adapter
Only update the adapter, not the transformer; what the adapter does is to reduce the dimension first, and then increase the dimension to generate △h.
Each downstream task only learns its own △h, and the parameter h of the transformer layer does not change, which can greatly reduce the need for memory space.
LoRA
First turn the low-dimensional vector into a high-dimensional, then high-dimensional and then low-dimensional.
The effect of Lora is better than that of adapter, it will not increase the number of model layers, and the number of parameters is smaller than that of adapter.
Prefix Tuning
Insert something in front of the standard self-attention structure
and throw away the blue part when infer
Soft Prompting
Summarize
Early Exit
It takes a long time to run the whole model.
The classifier of the first layer is not confident enough. Go to the second layer:
if the confidence is enough, the subsequent process is not used to save time.
Summarize
Closing Remarks