RevCol: A new paradigm for large model architecture design, adding a dimension to the neural network architecture!

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

04474e249c58287e7242c8e2624569c9.gif

2647ca0681536c6126d821efcdea6f76.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

2e4c3262665e2a3107e6688086e8757e.png

Paper address: https://arxiv.org/pdf/2212.11696.pdf

Project code: https://github.com/megvii-research/RevCol

Computer Vision Research Institute column

Column of Computer Vision Institute

A new neural network design paradigm, the Reversible Column Network (RevCol), is proposed. The main body of RevCol consists of copies of multiple sub-networks, which are named columns respectively, and multi-level reversible connections are used between sub-networks.

59e54a67e873c72bc9817720a4fc73c8.gif

01

Overview

Such an architectural scheme makes RevCol behave very differently from traditional networks: during forward propagation, features in RevCol are gradually unraveled as they pass through each column, and their total information is preserved instead of being blocked like in other networks. Compress or discard.

Experiments show that the CNN-style RevCol model can achieve very competitive performance on multiple computer vision tasks such as image classification, object detection, and semantic segmentation, especially when the parameter budget is large and the dataset is large. For example, RevCol-XL achieves 88.2% accuracy on ImageNet-1K after ImageNet-22K pre-training. Given more pre-trained data, the largest model, RevCol-H, achieves 90.0% on ImageNet-1K, 63.8% APbox on the COCO detection minimum set, and 61.0% mIoU on the ADE20k split.

To the best of our knowledge, this is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. In addition, as a general macro-architecture approach, RevCol can also be introduced into Transformer or other neural networks, which is proven to improve performance in computer vision and NLP tasks.

0fcbdc732bc8ec1e78b834c319186e72.gif

02

Background & Motivation

The Information Bottleneck principle (IB) rules the deep learning world. Consider a typical supervised learning network, as shown in Figure a below: the layers near the input contain more low-level information, while the features near the output have rich semantics.

c8cce44c161c4ca92f04fd10406bfbc6.png

In other words, during layer-by-layer propagation, target-independent information is gradually compressed. Although this learning paradigm has achieved great success in many practical applications, it may not be the best choice from the perspective of feature learning—if the learned features are overcompressed, or the learned semantic information is not consistent with the target task irrelevant, the downstream tasks may perform poorly, especially if there is a significant domain gap between the source and target tasks. Significant efforts have been made by researchers to make the learned features more generally applicable, such as through self-supervised pre-training or multi-task learning.

In today's sharing, the researchers mainly focus on another method: building a network to learn decoupled representations. Different from IB learning, decoupled feature learning does not intend to extract the most relevant information while discarding less relevant information; instead, it aims to embed task-related concepts or semantic words into several decoupled dimensions respectively . At the same time, the entire feature vector holds roughly as much information as the input. This is very similar to the mechanism in biological cells, where each cell shares the same copy of the entire genome, but with different expression intensities. Therefore, it is also reasonable to learn disentangled features in computer vision tasks: for example, to adjust high-level semantic representations during ImageNet pre-training, while maintaining low-level information in other feature dimensions as required by downstream tasks such as object detection. (such as the position of the edge).

Figure (b) above outlines the main idea: RevCol, which is heavily inspired by the big picture of GLOM. The network consists of N sub-networks (named columns) of the same structure (but not necessarily their weights), each sub-network receives a copy of the input and produces a prediction. Therefore, multi-level embeddings, i.e., from low-level to high-level semantic representations, are stored in each column. Furthermore, a reversible transformation is introduced to propagate multi-level features from column i to column (i+1) without information loss. During propagation, the quality of all feature levels is expected to gradually improve due to increasing complexity and non-linearity. Thus, the last column (column N in the figure) 1(b)) predicts the final decoupled representation of the input.

02d3a1ec113b65738fd9082913a1bdd8.gif

03

new frame

Next, we introduce the design details of RevCol. Figure b above illustrates the top-level architecture. Note that for each column in RevCol, for simplicity, existing structures such as ConvNeXt are directly reused, so in the following we mainly focus on how to build reversible connections between columns. Furthermore, plug-and-play intermediate supervision is introduced on top of each column, which further improves training convergence and feature quality

MULTI-LEVEL REVERSIBLE UNIT

In the newly proposed network, reversible transformation plays a key role in decoupling features without loss of information, with insights from reversible neural networks. Among them, a masterpiece of RevNet is reviewed first. As shown in Figure a below, RevNet first divides the input x into two groups, x0 and x1.

11fc8a4d456fb0e52c0a466311be1d82.png

Then, for a later block, say block t, it takes as input the outputs xt−1 and xt−2 of the previous two blocks and produces an output xt. The mapping of block t is reversible, i.e. xt−2 can be reconstructed from two posterior blocks xt−1 and xt. Formally, the forward and reverse calculations follow the equation†:

10d2ef3c56514d0a385e9c040e5df93b.png

In order to solve the problems mentioned above, the above equation can be summarized as the following form:

f485c59da1f37433fb745fa6275d2ddb.png

Therefore, the above equation can be reorganized into a multi-column form, as shown in Figure b below. Each column consists of m feature maps within a group and their parent network. Name it the multilevel reversible unit, which is the fundamental building block of RevCol.

9d82bcd79a0d925ed86c94e87cbfe8bf.png

REVERSIBLE COLUMN ARCHITECTURE

  • macro design

c6e1b3321c2e99831e21d5808ab00d61.png

Figure 2c above illustrates the framework design. Following the common practice of recent models, the input image is first segmented into non-overlapping patches by a patch embedding module. Then, the patches are fed into each sub-network (column). Columns can be implemented with any traditional single-column architecture, such as ViT or ConvNeXt. Four-level feature maps are extracted from each column to propagate information across columns; for example, if the columns are implemented with widely used hierarchical networks, multi-resolution features can simply be extracted from the output of each stage.

For classification tasks, only the feature maps of the last level (level 4) in the last column are used to obtain rich semantic information.

For other downstream tasks such as object detection and semantic segmentation, all four levels of feature maps are used in the last column, since they contain low-level and semantic information.

  • micro design

0d0cf2b2037b0d9062b602109fb4a08d.png

In each level, first use a Fusion unit to adjust the input of different sizes to the same shape, and then pass through a bunch of ConvNeXt Blocks to get the output. These are the Ft(·) in the formula, and then add the input of the Reversible operation to get the final result.

It is worth noting that the kernel size of 7x7 in the original ConvNeXt block is changed to 3x3. The benefit of the large kernel is limited on Revcol, but the small kernel is very fast.

1ee8463d197570bbd1b2d25fb42ad849.gif

04

experiment

5d6e179765beedb84d7526cf1ad85641.png

In addition to the 2B parameter model, a private data set of 168Million is also collected, and the weakly-label label is used for pre-training. The XL model (800M param) can reach 88.2% under 22k, and can rise to 89.4% after Megdata-168M training. Huge 224 pretrain, 640x640 Finetune can reach 90.0%Top-1 Accuracy. The training overhead of this model: pre-training a total of 1600 ImageNet Epochs, training once using 80 blocks of A100, 14 days.

4b136e2fa6e8d88ab87e880e9388f301.png

cf72aeb05a320d6ddcf2c0e82cd72e15.png

© THE END 

For reprinting, please contact this official account for authorization

93491488fd6a57891706cc12c1c31dc5.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as object detection, object tracking, image segmentation, OCR, model quantization, and model deployment. The research institute shares the latest paper algorithm new framework every day, provides one-click download of papers, and shares actual combat projects. The research institute mainly focuses on "technical research" and "practice implementation". The Institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

3a335b8b810c5fba3bef81dbe74ad1e9.png

Past review

01

Tsinghua University proposes LiVT to solve unbalanced labeling data

02

Transformer industrial deployment landed!

03

The big AI model is coming soon

04

Huawei Noah's minimalist network achieved 83% accuracy with 13 layers

Guess you like

Origin blog.csdn.net/gzq0723/article/details/131335767