【学习】Meta Learning、


一、Meta Learning

元学习:学会学习
在学术界,我们的GPU不多,超参数不能全列举。机器能自动确定超参数吗?
insert image description here
回顾ML的三步骤
insert image description here
insert image description here
insert image description here

什么是元学习?

让机器学习找到这个方程。
insert image description here

元学习–第1步

学习算法中什么是可以学习的?组成:网络架构,初始参数,学习率。根据什么是可学习的对元学习进行分类。
insert image description here

元学习–第2步

为学习算法Fφ定义损失函数L(φ)。
insert image description here
insert image description here
insert image description here
我们如何知道一个分类器是好是坏? 在测试集上评估分类器。
训练和测试资料都是有标注的。

insert image description here
insert image description here
上面都是对同一个任务的训练,但是在元学习里面不止一个任务。

insert image description here
在典型的ML中,是基于训练示例来计算损失,但是在元学习上用的是测试资料计算loss。
insert image description here

元学习–步骤3

求出最好的参数和Fφ。在参数φ没办法求微分的时候可以用RL。
insert image description here

架构

使用训练任务的资料,运用上面的三步骤就能得到一个学习出来的算法Fφ * 。

然后使用测试任务里面的训练资料放到这个学出的算法里面得到一个分类器fθ * 。
然后把分类器用在测试任务里面的测试资料上得到结果。
训练任务是跟测试任务无关的任务。小任务学习是用一点标注的资料就能得到好的结果,这两个还是有区别的。在ML里面的测试资料是不能用的!在元学习里面需要使用。
insert image description here

ML和Meta

insert image description here
insert image description here
Meta-learning - cross-task learning, ML - intra-task learning
insert image description here
insert image description here
insert image description here
An episode of meta-learning has a large amount of computation.
insert image description here
What you know about ML can generally be applied to meta-learning:
overfit to training tasks, get more training tasks to improve results, task augmentation, learn learning algorithms with hyperparameters too..., improve tasks (meta-learning should also have validation set)
insert image description here

Review GD

Learned initialization parameters

The initialization parameter θ0 is randomly sampled
insert image description here
. Good initialization is also important.
insert image description here
Random seeds and tuning are also required.
insert image description here
Pre-training is trained by proxy tasks. Pre-training is very similar to MAML, and a good initialization parameter can be found. Difference: MAML requires labeled data, but pre-training does not have labeled data.
insert image description here
Pre-training is to mix many tasks together for training, which can also be said to be multi-task learning.
insert image description here
Meta-learning is also very similar to domain adaptation/transfer learning
insert image description here
MAML. Fortunately, the initialization parameters are close to the good parameters.

learning learning rate

insert image description here
insert image description here

NAS looking for network structure

insert image description here
insert image description here
insert image description here
insert image description here
Make the network differentiable:
insert image description here

data augmentation

insert image description here

Sample Reweighting

Give different samples different weights
insert image description here
The above methods are all based on GD.
insert image description here
All training and test data are entered together:
insert image description here

Few-shot Image Classification

Few-shot Image Classification : Only a few images per class.
insert image description hereinsert image description here
insert image description here
insert image description here

Meta-learning and self-supervised learning

Both BERT and MAML are parameters for finding initialization:
insert image description here
MAML learns the initialization parameter φ through the gradient descent algorithm. Since GD also needs random initialization, what is the initialization parameter φ0? Can be generated from BERT!
insert image description here
There is a "learning gap": the goal of self-supervision is different from downstream tasks. There are many BERT pre-trained tasks, but they are not necessarily suitable for downstream tasks. But MAML is concerned with learning to achieve good performance on the training task. MAML needs to use training data, but BERT can use a large amount of unlabeled data.
insert image description here
insert image description here
insert image description here
The combination of BERT and meta-learning has good results.

Meta-learning and knowledge distillation

insert image description here
Let students network to learn from teachers' networks, but we don't know whether teacher networks are good or not.
insert image description here
We can find that the teacher network with the best results does not necessarily teach well!
We can use meta-learning to let teachers learn how to teach online.
insert image description here
We need to update the teacher network (add temperature parameters, without updating the entire large network) so that the student's results are the best (low loss).

Meta-learning and domain adaptation

We don't know the target domain (there is no data in the target domain during the training phase), and we can train domain generalization to make the network perform well in unknown domains.
insert image description here
insert image description here
Use one of the domains in the training domain as the pseudo-target domain:
insert image description here
can be used as the target domain separately:
insert image description here
insert image description here
training samples and test samples can have different distributions. Training tasks and test tasks can also have different distributions.
insert image description here

Meta-Learning and Life-long Learning

If the machine goes through lifelong learning, it will eventually become very powerful!
insert image description here
But in fact, if one task is trained for one task, the machine will produce the result of catastrophic forgetting.
insert image description here
Solving catastrophic forgetting:
insert image description here
meta-learning can also be used in this process! Meta-learning can be used to learn not to forget old tasks when updating parameters.
insert image description here
insert image description here
Meta-learning also has the problem of destructive forgetting:
insert image description here
meta-learning learns to learn:
" insert image description here
insert image description here
Meta-learning has different models for different tasks, but lifelong learning is one task to learn multiple tasks.
insert image description here
insert image description here
insert image description here
MAML uses the trained model to calculate loss, and pre-training Directly use the parameters of the model to calculate the loss.
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
Train a model in one step. When using the algorithm, it still needs to be updated multiple times. The data learned by few-shot is limited. MAML is generally only updated once during training. Each task:
for Set a target sine function y = a sin(x + b), sample K points from the target function, and use the sample to estimate the target function. The pre-training is not very good,
insert image description here
but MAML is better.
insert image description here
insert image description here
φi is the initial parameter
insert image description here
insert image description here
insert image description here
insert image description here
MAML The direction of the parameters updated twice is used to update the parameters, and the pre-training uses the direction of updating the parameters of each step to update the parameters. The
insert image description here
above meta-learning is better than pre-training.

Guess you like

Origin blog.csdn.net/Raphael9900/article/details/128628743