Deep learning pre-training and MMPretrain
MMPreTrain algorithm library introduction
MMPretrain is a newly upgraded pre-training open source algorithm framework, which aims to provide a variety of powerful pre-training backbone networks and supports different pre-training strategies. MMPretrain is derived from the well-known open source projects MMClassification and MMSelfSup and has developed many exciting new features. Currently, the pre-training stage is crucial for visual recognition, and with rich and powerful pre-trained models, we are able to improve various downstream vision tasks.
Our codebase is designed to be an easy-to-use and user-friendly repository of codebases, and to simplify academic research activities and engineering tasks. We detail the features and design of MMPretrain in the different sections below.
Code warehouse: https://github.com/open-mmlab/mmpretrain
documentation tutorial: https://mmpretrain.readthedocs.io/en/latest/
Supports out-of-the-box inference APIs and models, including rich related tasks
- image classification
- image description
- visual quiz
- visual orientation
- retrieve
Install
pip install openmim
git clone https://github.com/open-mmlab/mmpretrain.git
cd mmpretrain
pip install -U openmim && mim install -e .
Multimodal installation
# 从源码安装
mim install -e ".[multimodal]"
# 作为 Python 包安装
mim install "mmpretrain[multimodal]>=1.0.0rc8"
verify installation
from mmpretrain import get_model, inference_model
model = get_model('resnet18_8xb32_in1k', device='cpu') # 或者 device='cuda:0'
inference_model(model, 'demo/demo.JPEG')
code frame
classic backbone network
Basic idea of residual learning
Two Residual Networks
ResNet(2015)
Based on VGG
Maintain multi-level organization and increase the number of layers
Increase cross-layer connection
ResNet-34 34-layer ImageNet Top-5 accuracy rate: 94.4
5 levels, each level contains several residual modules, different residual modules have different ResNet structures
Each stage output resolution is halved, communication is doubled
Global average pooling compresses spatial dimensions
A single fully connected layer produces class probabilities
ResNet's achievements and influence
One of the most influential and widely used model institutions in the field of deep learning, won the CVPR 2016 Best Paper Award
The residual structure has also been widely used so far, regardless of the various visual TransFormer or convolutional neural networks such as ConvNeXt in computer vision, or the recent popular GPT and various large language models, there are residual networks.
Vision Transformer
Divide the image into several small blocks of 16*16, and arrange all the blocks into word vectors. After linear layer mapping, a [H, W, C] dimension image becomes [L, C], and then passes through a multi-layer Transformer The calculation of the Encoder generates the corresponding feature vector
Add additional tokens outside the block to query the features of other patches and give the final classification
The attention module is based on the receptive field of the whole play, and the complexity is the 4th power of the scale
Common Types of Self-Supervised Learning
- Based on various agent tasks
- contrast-based learning
- mask-based learning
SimCLR
Basic assumption: If the model can extract the essence of the picture content well, then no matter what kind of data enhancement operation the picture undergoes, the extracted features are very similar
Masked Autencoders
Basic assumption: Only by understanding the content of the picture and mastering the context information of the picture can the model recover the randomly occluded content in the picture