文章目录
一、Meta Learning
元学习:学会学习
在学术界,我们的GPU不多,超参数不能全列举。机器能自动确定超参数吗?
回顾ML的三步骤
什么是元学习?
让机器学习找到这个方程。
元学习–第1步
学习算法中什么是可以学习的?组成:网络架构,初始参数,学习率。根据什么是可学习的对元学习进行分类。
元学习–第2步
为学习算法Fφ定义损失函数L(φ)。
我们如何知道一个分类器是好是坏? 在测试集上评估分类器。
训练和测试资料都是有标注的。
上面都是对同一个任务的训练,但是在元学习里面不止一个任务。
在典型的ML中,是基于训练示例来计算损失,但是在元学习上用的是测试资料计算loss。
元学习–步骤3
求出最好的参数和Fφ。在参数φ没办法求微分的时候可以用RL。
架构
使用训练任务的资料,运用上面的三步骤就能得到一个学习出来的算法Fφ * 。
然后使用测试任务里面的训练资料放到这个学出的算法里面得到一个分类器fθ * 。
然后把分类器用在测试任务里面的测试资料上得到结果。
训练任务是跟测试任务无关的任务。小任务学习是用一点标注的资料就能得到好的结果,这两个还是有区别的。在ML里面的测试资料是不能用的!在元学习里面需要使用。
ML和Meta
Meta-learning - cross-task learning, ML - intra-task learning
An episode of meta-learning has a large amount of computation.
What you know about ML can generally be applied to meta-learning:
overfit to training tasks, get more training tasks to improve results, task augmentation, learn learning algorithms with hyperparameters too..., improve tasks (meta-learning should also have validation set)
Review GD
Learned initialization parameters
The initialization parameter θ0 is randomly sampled
. Good initialization is also important.
Random seeds and tuning are also required.
Pre-training is trained by proxy tasks. Pre-training is very similar to MAML, and a good initialization parameter can be found. Difference: MAML requires labeled data, but pre-training does not have labeled data.
Pre-training is to mix many tasks together for training, which can also be said to be multi-task learning.
Meta-learning is also very similar to domain adaptation/transfer learning
MAML. Fortunately, the initialization parameters are close to the good parameters.
learning learning rate
NAS looking for network structure
Make the network differentiable:
data augmentation
Sample Reweighting
Give different samples different weights
The above methods are all based on GD.
All training and test data are entered together:
Few-shot Image Classification
Few-shot Image Classification : Only a few images per class.
Meta-learning and self-supervised learning
Both BERT and MAML are parameters for finding initialization:
MAML learns the initialization parameter φ through the gradient descent algorithm. Since GD also needs random initialization, what is the initialization parameter φ0? Can be generated from BERT!
There is a "learning gap": the goal of self-supervision is different from downstream tasks. There are many BERT pre-trained tasks, but they are not necessarily suitable for downstream tasks. But MAML is concerned with learning to achieve good performance on the training task. MAML needs to use training data, but BERT can use a large amount of unlabeled data.
The combination of BERT and meta-learning has good results.
Meta-learning and knowledge distillation
Let students network to learn from teachers' networks, but we don't know whether teacher networks are good or not.
We can find that the teacher network with the best results does not necessarily teach well!
We can use meta-learning to let teachers learn how to teach online.
We need to update the teacher network (add temperature parameters, without updating the entire large network) so that the student's results are the best (low loss).
Meta-learning and domain adaptation
We don't know the target domain (there is no data in the target domain during the training phase), and we can train domain generalization to make the network perform well in unknown domains.
Use one of the domains in the training domain as the pseudo-target domain:
can be used as the target domain separately:
training samples and test samples can have different distributions. Training tasks and test tasks can also have different distributions.
Meta-Learning and Life-long Learning
If the machine goes through lifelong learning, it will eventually become very powerful!
But in fact, if one task is trained for one task, the machine will produce the result of catastrophic forgetting.
Solving catastrophic forgetting:
meta-learning can also be used in this process! Meta-learning can be used to learn not to forget old tasks when updating parameters.
Meta-learning also has the problem of destructive forgetting:
meta-learning learns to learn:
"
Meta-learning has different models for different tasks, but lifelong learning is one task to learn multiple tasks.
MAML uses the trained model to calculate loss, and pre-training Directly use the parameters of the model to calculate the loss.
Train a model in one step. When using the algorithm, it still needs to be updated multiple times. The data learned by few-shot is limited. MAML is generally only updated once during training. Each task:
for Set a target sine function y = a sin(x + b), sample K points from the target function, and use the sample to estimate the target function. The pre-training is not very good,
but MAML is better.
φi is the initial parameter
MAML The direction of the parameters updated twice is used to update the parameters, and the pre-training uses the direction of updating the parameters of each step to update the parameters. The
above meta-learning is better than pre-training.