A review of comparative learning

A review of comparative learning

latent hidden embedding feature all means features

The first stage

InstDisc(2018,Memory Bank)

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

img

Contributions:

  1. Propose agent task: individual

  2. Discrimination (each picture is regarded as a category, positive: the picture itself, negative: other pictures)

  3. The memory bank stores negative samples. The last mb feature of each picture is 128 dimensions (the dimension is too large to be stored)

    For the ImageNet data set, there are a total of 1.28 million images. The Memory Bank is a 1280000*128 data matrix, and 4096 negative samples are randomly selected.

    Suppose the batch size is 256, that is, there are 256 positive samples and 4096 negative samples are taken. NCEloss is calculated as the loss. After calculation, the characteristics of this batch can be replaced with the characteristics of Memory Bank.

  4. Proposed a model parameter update method based on Momentum (Proximal Regularization: adds a constraint to model training, and the subsequent MoCo idea is consistent with it)

Experimental Settings:

img

The subsequent MoCo experimental settings are the same as InstDisc

InvaSpread (CVPR, 2019, end-to-end, batch size is too small)

Unsupervised Embedding Learning via Invariant and Spreading Instance Feature

https://i0.hdslb.com/bfs/note/fd618b06c332626cf4dc4cf9f4cd0c7d52542511.png@690w_!web-note.webp

After similar pictures pass through the encoder, the features are similar (Invariant), and the features of dissimilar pictures are not similar (spreading).

Agent task: individual discrimination

Contributions:

img

  1. batchsize256, as shown in the figure above, after data enhancement, x1 , x2 , X3^, positive samples: 256, negative samples: (256-1)*2, you can use one encoder for end-to-end learning
  2. No external data is required to provide positive samples

The reasons why the results are not good enough are: the dictionary is not large enough, resulting in not enough negative samples, not enough data augmentation, and there is no mlp projector

CPCv1(2019,InfoNCE loss)

Representation Learning with Contrastive Predictive Coding

img

It can process not only audio but also text and images, and can be used in reinforcement learning.

gar:auto regressive,RNN,LSTM

Ct (context representation) can be used to predict future output (Zt+1, etc.)

Agent tasks:

Positive sample: The genc feature output at the future moment is similar to the prediction (query) of ct

Negative sample: You can choose the feature output of genc at any time, which is not similar to the prediction of ct

CMC (multi-view multi-modal)

Contrastive Multiview Coding

A multi-view of an object can be regarded as a positive sample

The disadvantage is that too many encoders are required

Doing multi-perspective work very early not only proves the flexibility of contrastive learning, but also proves the feasibility of this kind of multi-perspective and multi-modality

Abstract:

People observe the world through many sensors. For example, eyes or ears act as different sensors to provide different signals to the brain. Each perspective is noisy and may be incomplete, but the most important information is actually shared among all these perspectives, such as basic physical laws, geometric shapes, or their speech information. It is shared, for example a dog can be seen, heard or felt.

Learn a very powerful feature that is invariant to perspective (no matter which perspective you look at, whether you see a dog or hear a dog barking, you can tell that it is a dog).

Learning goal: Increase mutual information between all perspectives

img

Select the data set: NYU RGBD, four views: original image, depth information, surface normal, segmented image

All inputs correspond to one picture, and each other is a positive sample.

May require multiple encoders to handle multiple types of input (CLIP)

Distillation: Teacher and student make positive sample pairs

Summarize:

Agent tasks: Instance Discrimination, predictive, multi-view, multi-modal
Objective functions: NCE, InfoNCE, and other variants
Model architecture:

  1. an encoder + memory bank (Inst Disc);
  2. an encoder (Invariant Spread);
  3. An encoder + an auto regressive (CPC);
  4. Multiple encoders (CMC)

Task type: image, audio, text, reinforcement learning, etc.

second stage:

MoCov1(CVPR2020)

Momentum Contrast for Unsupervised Visual Representation Learning

img

My previous blog: https://blog.csdn.net/qq_52038588/article/details/130857141?spm=1001.2014.3001.5502

Writing method: summarize the problem, expand the scope, write from the top (big) down, and write in a universal way

SimCLRv1(ICML,2020.2.13)

A Simple Framework for Contrastive Learning of Visual Representations

img

Training process:

x->xi, xj (data enhancement, each other is a positive sample)->encoder f ( ⋅ ) f(\cdot)f() ->projector g ( ⋅ ) g(\cdot) g ( ) ->Feature z

Positive samples: 2, negative samples: 2*(batch size-1)

encoder f ( ⋅ ) f(\cdot) f ( ) shared weight

normalized temperature-scaled, similar to InfoNCE loss

Contributions:(tricks)

  1. Stronger data enhancement

    img

    The ablation experiment for data enhancement is as follows, crop and color are useful

    Insert image description here

  2. MLP head dimensionality reduction (2048->128), that is, the projection head g ( ⋅ ) g(\cdot) is addedg ( ) , only used for training, not for downstream tasks

    g ( ⋅ ) g(\cdot) g ( ) : a fully connected layer and a ReLu activation function

    imgLinear: projection head without ReLU

    Non-Linear: the whole projection head

    None: without projection head

  3. Large batch training for a long time

MoCov2 (2020.3.9, technical report)

Improved Baselines with Momentum Contrastive Learning

img

Improve:

  1. add mlp

  2. Add aug

  3. Add cosine learning rate schedule (according to the above table, increase by 0.2, not much)

    cosine learning rate schedule:Insert image description here

    The initial learning rate is lr=10^-3

    If it is a total of 100 epochs, only the last 60 are used to apply cosine learning rate decay, then the first 40 epochs are not calculated by cosine, and the learning rate of the first 40 epochs is: lr1=epoch/40*lr, then the last 60 epochs are applicable 0.5 * (math.cos(40/100) * math.pi) + 1).

    The calculated learning rate at the end is 2.5x10^-4.

  4. The epoch is longer (200->800, MAE used 1600 epoch)

    8 V100

SimCLRv2(Nerual IPS,2020)

Big Self-Supervised Models are Strong Semi-Supervised Learners

Inspired by the work of Google's noisy student (first train the teacher model, generate pseudo labels from the data set, and train the student model on more unlabeled data together, SOTA at the time)

img

Major improvements:

  1. A larger model (unsupervised is better), changing the backbone network from ResNet-50 to ResNet-152, and equipped with 3 times the width of channels and selective kernels (SK) net
  2. Change the projection head from one layer of MLP to two layers of MLP. There will be little improvement in layer 3.
  3. Motivated by mocov2, using momentum encoder, their batch4096 is already large, and the dictionary is large enough

SWAV

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

swap assignment views

It is recommended to read deep cluster first and do some previous work.

img

The advantages on the left are primitive and resource-intensive

SwAV does not use negative samples, relies on prior information, and uses cluster center c (prototypes) for comparison

D: feature dimension, K: how many cluster centers (3000)

z1, z2 first use the clustering method to let z and c generate targets q1 and q2 (GT)

Agent tasks:

z1 and z2 should be similar and can predict each other. The dot product of z1 and c predicts Q2 or the dot product of z2 and c predicts Q1.

Benefits of using clustering:

  • If you compare with each instance-like negative sample, you need thousands of negative samples, and even this is only approximate ; on the contrary, if you compare with cluster centers, use a few hundred or at most 3000 on ImageNet cluster centers are enough
  • Cluster centers have clear semantic meaning. Compared with random sampling in instance-like negative samples, which will encounter problems such as some positive samples will also be sampled and sample categories are imbalanced, it is not as effective as using cluster centers.

Insert image description here

Important trick: multiple crops

The original 256*256 picture is taken from two 224*224 pictures to learn the global characteristics

Improved 2*160*160+4*96*96, 6 viewing angles

CPCV2(ICML 2020)

Data-Efficient Image Recognition with Contrastive Predictive Coding

  1. Using a larger model , CPC v1 only uses the first three residual stacks in ResNet-101, and CPC v2 deepens the model to ResNet-161 (ImageNet top-1 accuracy increased by 5%), while improving the input image patch Resolution (from 60x60 to 80x80, ImageNet top-1 accuracy increased by 2%).
  2. Since the prediction of CPC v1 is only related to several patches, and BN will introduce information from other patches , similar to image generation, the BN algorithm will damage the performance of CPC v1. Replacing BN with LN will increase the ImageNet top-1 accuracy by 2%.
  3. Since large models are more likely to overfit, the author increased the difficulty of the self-supervised task. To predict a patch, CPC v2 uses feature vectors in the four directions of up, down, left, and right, while CPC v1 only uses the upper feature vector. Since CPC v2 is exposed to more semantic information, it will be more difficult to extract semantic information related to the patch below. ImageNet top-1 accuracy increased by 2.5%.
  4. Using better data enhancement , first randomly take out two of the three rgb channels , and the ImageNet top-1 accuracy is increased by 3%. Then, some geometry, color, elastic deformation and other data enhancements are applied , and the ImageNet top-1 accuracy is increased by 4.5. %, it can be seen that data enhancement has a great impact on self-supervision.

Insert image description here

InfoMin (NeurIPS,2020)

What Makes for Good Views for Contrastive Learning

Mainly analytical extension work, Minimize Mutual Information. The main point is that appropriate Mutual Information is important

A new InfoMin Principle is proposed, whose purpose is to make the feature representation learn the information shared between different views and try to remove redundant information irrelevant to downstream tasks to ensure that the learned feature representation has good generalization ability.

The third stage: no negative samples

BYOL(2020)-You can learn without negative samples

Bootstrap your own latent: A new approach to self-supervised Learning

img

x->v, v^->encoder architecture is the same, but the parameters are different. ftheta is updated with the gradient update. The following one uses momentum encoder->projector-> qtheta prediction prediction and target as similar as possible.

Features from different perspectives predict features from another perspective, leaving only the encoder

Objective function: MSE LOSS

img

In response to the blog (BN provides BYOL implicit negative samples):

Insert image description here

  1. Only Projector can’t train it even if it has BN.
  2. Without any normalization, SimCLR cannot be trained even with negative samples.

The author of BYOL believes that BN is stable for training, and proposes that initialization is better and training can be done without BN. Using GN (group norm) and WS (weight standardization), this version of BYOL can also learn well.

SimSiam (CVPR2021, concluding work, no large batch size, no momentum encoder, no negative samples)

Exploring Simple Siamese Representation Learning

img

The encoder architecture shares parameters the same way

img

There are two predictors in the pseudocode, which predict z1 and z2 respectively, which are inconsistent with the picture.

What is calculated in D is MSEloss

Stop gradient is very important. It can be regarded as an EM algorithm. It gradually updates parameters to avoid collapse.

All twin networks:

img

img

You can see that without multi crop, SWAV is not as good as MoCov2

Barlow Twins (ICML 2021)

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

There is neither comparison nor prediction. Essentially, a different objective function is used.

Specifically, it is to generate a Cross Correlation Matrix (correlation matrix), hoping that the matrix is ​​as similar as possible to the Identity Matrix

Stage 4: Based on Transformer

MoCov3(CVPR,2021)

An Empirical Study of Training Self-Supervised Vision Transformers

VIT training becomes unstable as the batch size increases, as follows:

img

trick:

Randomly initialize the patch projection layer and freeze it, that is, randomly initialize an MLP and freeze it. It is also useful for BYOL.

img

DINO

Emerging Properties in Self-Supervised Vision Transformers

The teacher network output is normalized (centering, minus the mean)

img

The pseudocode is similar to MoCoV3. The objective function has a center operation.

img

Next is MAE

Summarize

Insert image description here

reference:

1. Blog.https://www.bilibili.com/read/cv24218439?spm_id_from=333.999.0.0&jump_opus=1

2. Video.https://www.bilibili.com/video/BV19S4y1M7hm/?spm_id_from=333.999.0.0&vd_source=4e2df178682eb78a7ad1cc398e6e154d

3. Blog.https://blog.csdn.net/dhaiuda/article/details/117870030

Guess you like

Origin blog.csdn.net/qq_52038588/article/details/131733727