机器学习:self supervised learning- Recent Advances in pre-trained language models

insert image description here
insert image description here

insert image description here
insert image description here

background

insert image description here

Autoregressive Langeuage Models

Incomplete sentences, predict the remaining empty words
insert image description here

  • sentence completion
    insert image description here
    insert image description here
    insert image description here

Transformer-based ALMs

insert image description here

Masked language models-MLMs

insert image description here
insert image description here

insert image description here
insert image description here
The pre-trained model can convert the input text into hidden feature representation

insert image description here
insert image description here
insert image description here
The model parameters are initially obtained from the pre-training model, and then fine-tuned after specific tasks are given. The intermediate model parameters can be fixed or micro-trained
insert image description here

  • Related papers
    insert image description here
    insert image description here
    insert image description here
    insert image description here

The Problems of PLMs

Problem 1: There are few data with labels

insert image description here

Question 2: The model is getting bigger and bigger, and reasoning takes time

insert image description here

insert image description here
4 tasks require 4 times the size of the video memory
insert image description here
and the reasoning takes a long time

Solution

Labeled Data Scarcity——Data-efficient-tuning

insert image description here
When there is less data, the model may not be able to learn the above task functions.
insert image description here
By converting the data into a natural language prompt, the model can more easily know what it should do
insert image description here
insert image description hereinsert image description here

  • 1 A prompt template: Tell the model what to do, here is the mask filled in the middle
    insert image description here
  • 2- A plm model performs the task, and the possible situation with the highest output probability

insert image description here

  • verbalizer: Map labels and probabilities
    insert image description here
    insert image description here
    insert image description here
    When the labeled data is relatively small, standard fine-tuning is more difficult to train;
    insert image description here
    insert image description here

few-shot learning

insert image description here
insert image description here
insert image description here
insert image description here

semi-supervised learning

insert image description here
insert image description here

  • PET
    • Step 1: Design different prompts
      insert image description here
    • Step 2: Use multiple trained models to predict labels, and add up the predicted results as the total prediction
      insert image description here
    • Step 3: Use standard training methods, soft label
      insert image description here

Zero-shot learning

insert image description here
insert image description here
The large model is large enough to achieve zero-shot
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Summarize

insert image description here

  • to distill
  • refine to downstream tasks

insert image description here
Share parameters of related transformer layers

PLMs Are Gigantic——Reducing the Number of Parameters

insert image description here
Convert to share a bert model
insert image description here
insert image description here
insert image description here
insert image description here

Adapter

insert image description here
insert image description here
insert image description here
Only update the adapter, not the transformer; what the adapter does is to reduce the dimension first, and then increase the dimension to generate △h.
insert image description here
Each downstream task only learns its own △h, and the parameter h of the transformer layer does not change, which can greatly reduce the need for memory space.

LoRA

insert image description here
insert image description here
First turn the low-dimensional vector into a high-dimensional, then high-dimensional and then low-dimensional.
insert image description here
insert image description here
insert image description here
The effect of Lora is better than that of adapter, it will not increase the number of model layers, and the number of parameters is smaller than that of adapter.

Prefix Tuning

insert image description here
insert image description here
insert image description here
Insert something in front of the standard self-attention structure
insert image description here
and throw away the blue part when infer
insert image description here

Soft Prompting

insert image description here
insert image description here

Summarize

insert image description here
insert image description here
insert image description here

Early Exit

insert image description here
It takes a long time to run the whole model.
insert image description here
insert image description here
The classifier of the first layer is not confident enough. Go to the second layer:
insert image description here
if the confidence is enough, the subsequent process is not used to save time.
insert image description here
insert image description here

Summarize

insert image description here

Closing Remarks

insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/uncle_ll/article/details/131747434