A review of comparative learning
latent hidden embedding feature all means features
The first stage
InstDisc(2018,Memory Bank)
Unsupervised Feature Learning via Non-Parametric Instance Discrimination
Contributions:
-
Propose agent task: individual
-
Discrimination (each picture is regarded as a category, positive: the picture itself, negative: other pictures)
-
The memory bank stores negative samples. The last mb feature of each picture is 128 dimensions (the dimension is too large to be stored)
For the ImageNet data set, there are a total of 1.28 million images. The Memory Bank is a 1280000*128 data matrix, and 4096 negative samples are randomly selected.
Suppose the batch size is 256, that is, there are 256 positive samples and 4096 negative samples are taken. NCEloss is calculated as the loss. After calculation, the characteristics of this batch can be replaced with the characteristics of Memory Bank.
-
Proposed a model parameter update method based on Momentum (Proximal Regularization: adds a constraint to model training, and the subsequent MoCo idea is consistent with it)
Experimental Settings:
The subsequent MoCo experimental settings are the same as InstDisc
InvaSpread (CVPR, 2019, end-to-end, batch size is too small)
Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
After similar pictures pass through the encoder, the features are similar (Invariant), and the features of dissimilar pictures are not similar (spreading).
Agent task: individual discrimination
Contributions:
- batchsize256, as shown in the figure above, after data enhancement, x1 , x2 , X3^, positive samples: 256, negative samples: (256-1)*2, you can use one encoder for end-to-end learning
- No external data is required to provide positive samples
The reasons why the results are not good enough are: the dictionary is not large enough, resulting in not enough negative samples, not enough data augmentation, and there is no mlp projector
CPCv1(2019,InfoNCE loss)
Representation Learning with Contrastive Predictive Coding
It can process not only audio but also text and images, and can be used in reinforcement learning.
gar:auto regressive,RNN,LSTM
Ct (context representation) can be used to predict future output (Zt+1, etc.)
Agent tasks:
Positive sample: The genc feature output at the future moment is similar to the prediction (query) of ct
Negative sample: You can choose the feature output of genc at any time, which is not similar to the prediction of ct
CMC (multi-view multi-modal)
Contrastive Multiview Coding
A multi-view of an object can be regarded as a positive sample
The disadvantage is that too many encoders are required
Doing multi-perspective work very early not only proves the flexibility of contrastive learning, but also proves the feasibility of this kind of multi-perspective and multi-modality
Abstract:
People observe the world through many sensors. For example, eyes or ears act as different sensors to provide different signals to the brain. Each perspective is noisy and may be incomplete, but the most important information is actually shared among all these perspectives, such as basic physical laws, geometric shapes, or their speech information. It is shared, for example a dog can be seen, heard or felt.
Learn a very powerful feature that is invariant to perspective (no matter which perspective you look at, whether you see a dog or hear a dog barking, you can tell that it is a dog).
Learning goal: Increase mutual information between all perspectives
Select the data set: NYU RGBD, four views: original image, depth information, surface normal, segmented image
All inputs correspond to one picture, and each other is a positive sample.
May require multiple encoders to handle multiple types of input (CLIP)
Distillation: Teacher and student make positive sample pairs
Summarize:
Agent tasks: Instance Discrimination, predictive, multi-view, multi-modal
Objective functions: NCE, InfoNCE, and other variants
Model architecture:
- an encoder + memory bank (Inst Disc);
- an encoder (Invariant Spread);
- An encoder + an auto regressive (CPC);
- Multiple encoders (CMC)
Task type: image, audio, text, reinforcement learning, etc.
second stage:
MoCov1(CVPR2020)
Momentum Contrast for Unsupervised Visual Representation Learning
My previous blog: https://blog.csdn.net/qq_52038588/article/details/130857141?spm=1001.2014.3001.5502
Writing method: summarize the problem, expand the scope, write from the top (big) down, and write in a universal way
SimCLRv1(ICML,2020.2.13)
A Simple Framework for Contrastive Learning of Visual Representations
Training process:
x->xi, xj (data enhancement, each other is a positive sample)->encoder f ( ⋅ ) f(\cdot)f(⋅) ->projector g ( ⋅ ) g(\cdot) g ( ⋅ ) ->Feature z
Positive samples: 2, negative samples: 2*(batch size-1)
encoder f ( ⋅ ) f(\cdot) f ( ⋅ ) shared weight
normalized temperature-scaled, similar to InfoNCE loss
Contributions:(tricks)
-
Stronger data enhancement
The ablation experiment for data enhancement is as follows, crop and color are useful
-
MLP head dimensionality reduction (2048->128), that is, the projection head g ( ⋅ ) g(\cdot) is addedg ( ⋅ ) , only used for training, not for downstream tasks
g ( ⋅ ) g(\cdot) g ( ⋅ ) : a fully connected layer and a ReLu activation function
Linear: projection head without ReLU
Non-Linear: the whole projection head
None: without projection head
-
Large batch training for a long time
MoCov2 (2020.3.9, technical report)
Improved Baselines with Momentum Contrastive Learning
Improve:
-
add mlp
-
Add aug
-
Add cosine learning rate schedule (according to the above table, increase by 0.2, not much)
cosine learning rate schedule:
The initial learning rate is lr=10^-3
If it is a total of 100 epochs, only the last 60 are used to apply cosine learning rate decay, then the first 40 epochs are not calculated by cosine, and the learning rate of the first 40 epochs is: lr1=epoch/40*lr, then the last 60 epochs are applicable 0.5 * (math.cos(40/100) * math.pi) + 1).
The calculated learning rate at the end is 2.5x10^-4.
-
The epoch is longer (200->800, MAE used 1600 epoch)
8 V100
SimCLRv2(Nerual IPS,2020)
Big Self-Supervised Models are Strong Semi-Supervised Learners
Inspired by the work of Google's noisy student (first train the teacher model, generate pseudo labels from the data set, and train the student model on more unlabeled data together, SOTA at the time)
Major improvements:
- A larger model (unsupervised is better), changing the backbone network from ResNet-50 to ResNet-152, and equipped with 3 times the width of channels and selective kernels (SK) net
- Change the projection head from one layer of MLP to two layers of MLP. There will be little improvement in layer 3.
- Motivated by mocov2, using momentum encoder, their batch4096 is already large, and the dictionary is large enough
SWAV
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
swap assignment views
It is recommended to read deep cluster first and do some previous work.
The advantages on the left are primitive and resource-intensive
SwAV does not use negative samples, relies on prior information, and uses cluster center c (prototypes) for comparison
D: feature dimension, K: how many cluster centers (3000)
z1, z2 first use the clustering method to let z and c generate targets q1 and q2 (GT)
Agent tasks:
z1 and z2 should be similar and can predict each other. The dot product of z1 and c predicts Q2 or the dot product of z2 and c predicts Q1.
Benefits of using clustering:
- If you compare with each instance-like negative sample, you need thousands of negative samples, and even this is only approximate ; on the contrary, if you compare with cluster centers, use a few hundred or at most 3000 on ImageNet cluster centers are enough
- Cluster centers have clear semantic meaning. Compared with random sampling in instance-like negative samples, which will encounter problems such as some positive samples will also be sampled and sample categories are imbalanced, it is not as effective as using cluster centers.
Important trick: multiple crops
The original 256*256 picture is taken from two 224*224 pictures to learn the global characteristics
Improved 2*160*160+4*96*96, 6 viewing angles
CPCV2(ICML 2020)
Data-Efficient Image Recognition with Contrastive Predictive Coding
- Using a larger model , CPC v1 only uses the first three residual stacks in ResNet-101, and CPC v2 deepens the model to ResNet-161 (ImageNet top-1 accuracy increased by 5%), while improving the input image patch Resolution (from 60x60 to 80x80, ImageNet top-1 accuracy increased by 2%).
- Since the prediction of CPC v1 is only related to several patches, and BN will introduce information from other patches , similar to image generation, the BN algorithm will damage the performance of CPC v1. Replacing BN with LN will increase the ImageNet top-1 accuracy by 2%.
- Since large models are more likely to overfit, the author increased the difficulty of the self-supervised task. To predict a patch, CPC v2 uses feature vectors in the four directions of up, down, left, and right, while CPC v1 only uses the upper feature vector. Since CPC v2 is exposed to more semantic information, it will be more difficult to extract semantic information related to the patch below. ImageNet top-1 accuracy increased by 2.5%.
- Using better data enhancement , first randomly take out two of the three rgb channels , and the ImageNet top-1 accuracy is increased by 3%. Then, some geometry, color, elastic deformation and other data enhancements are applied , and the ImageNet top-1 accuracy is increased by 4.5. %, it can be seen that data enhancement has a great impact on self-supervision.
InfoMin (NeurIPS,2020)
What Makes for Good Views for Contrastive Learning
Mainly analytical extension work, Minimize Mutual Information. The main point is that appropriate Mutual Information is important
A new InfoMin Principle is proposed, whose purpose is to make the feature representation learn the information shared between different views and try to remove redundant information irrelevant to downstream tasks to ensure that the learned feature representation has good generalization ability.
The third stage: no negative samples
BYOL(2020)-You can learn without negative samples
Bootstrap your own latent: A new approach to self-supervised Learning
x->v, v^->encoder architecture is the same, but the parameters are different. ftheta is updated with the gradient update. The following one uses momentum encoder->projector-> qtheta prediction prediction and target as similar as possible.
Features from different perspectives predict features from another perspective, leaving only the encoder
Objective function: MSE LOSS
In response to the blog (BN provides BYOL implicit negative samples):
- Only Projector can’t train it even if it has BN.
- Without any normalization, SimCLR cannot be trained even with negative samples.
The author of BYOL believes that BN is stable for training, and proposes that initialization is better and training can be done without BN. Using GN (group norm) and WS (weight standardization), this version of BYOL can also learn well.
SimSiam (CVPR2021, concluding work, no large batch size, no momentum encoder, no negative samples)
Exploring Simple Siamese Representation Learning
The encoder architecture shares parameters the same way
There are two predictors in the pseudocode, which predict z1 and z2 respectively, which are inconsistent with the picture.
What is calculated in D is MSEloss
Stop gradient is very important. It can be regarded as an EM algorithm. It gradually updates parameters to avoid collapse.
All twin networks:
You can see that without multi crop, SWAV is not as good as MoCov2
Barlow Twins (ICML 2021)
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
There is neither comparison nor prediction. Essentially, a different objective function is used.
Specifically, it is to generate a Cross Correlation Matrix (correlation matrix), hoping that the matrix is as similar as possible to the Identity Matrix
Stage 4: Based on Transformer
MoCov3(CVPR,2021)
An Empirical Study of Training Self-Supervised Vision Transformers
VIT training becomes unstable as the batch size increases, as follows:
trick:
Randomly initialize the patch projection layer and freeze it, that is, randomly initialize an MLP and freeze it. It is also useful for BYOL.
DINO
Emerging Properties in Self-Supervised Vision Transformers
The teacher network output is normalized (centering, minus the mean)
The pseudocode is similar to MoCoV3. The objective function has a center operation.
Next is MAE
Summarize
reference:
1. Blog.https://www.bilibili.com/read/cv24218439?spm_id_from=333.999.0.0&jump_opus=1
2. Video.https://www.bilibili.com/video/BV19S4y1M7hm/?spm_id_from=333.999.0.0&vd_source=4e2df178682eb78a7ad1cc398e6e154d
3. Blog.https://blog.csdn.net/dhaiuda/article/details/117870030