Super detailed interpretation of MobileViT v3 paper (translation + intensive reading)

Preface 

Today, read the MobileViT v3 paper "MOBILEVITV3: MOBILE-FRIENDLY VISION TRANS-
FORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES". The
experimental part of this paper is very well written and is worth learning from.

Original paper:   https://arxiv.org/abs/2209.15159

Source address: https://github.com/micronDLA/MobileViTv3

Table of contents

Preface 

ABSTRACT—Summary

1. INTRODUCTION—Introduction

2. RELATED WORK—Related work

3. NEW MOBILEVIT ARCHITECTURE—new mobile architecture

3.1 MOBILEVITV3 BLOCK—MobileViTV3 module

(1) Replace the 3x3 convolutional layer with a 1x1 convolutional layer in the fusion block

(2) Local and global feature fusion

(3) Fusion of input end features

(4) Use deep convolutional layers in local representation blocks

3.2 SCALING UP BUILDING BLOCKS—model building blocks

4. EXPERIMENTAL RESULTS—Experimental results

4.1 IMAGE CLASSIFICATION ON IMAGENET1K—Image classification on IMAGENET1K

4.1.1 IMPLEMENTATION DETAILS—Experimental details

4.1.2 COMPARISON WITH MOBILEVITS—Comparison with previous MobileViT versions

4.1.3 COMPARISON WITH VITS—Comparison with ViT series

4.1.4 COMPARISON WITH CNNS—Comparison with CNN series

4.2 SEGMENTATION—Segmentation

4.2.1 IMPLEMENTATION DETAILS—Experimental details

4.2.2 RESULTS—Conclusions

4.3 OBJECT DETECTION—Object detection

4.3.1 IMPLEMENTATION DETAILS—Experimental details

4.3.2 RESULTS—Conclusions

4.4 IMPROVING LATENCY AND THROUGHPUT—Improving latency and throughput

4.4.1 IMPLEMENTATION DETAILS—Experimental details

4.4.2 RESULTS—Conclusion

4.5 ABLATION STUDY OF OUR PROPOSED MOBILEVITV3 BLOCK—Ablation experiment

4.5.1 IMPLEMENTATION DETAILS—Experimental details

4.5.2 WITH 100 TRAINING EPOCHS—The results of 100 rounds of Epoch

4.5.3 WITH 300 TRAINING EPOCHS—The results of 300 rounds of Epoch

5. DISCUSSION AND LIMITATIONS—Conclusion and Outlook

 

ABSTRACT—Summary

translate

        MobileViT (MobileViTv 1) combines convolutional neural networks (CNN) and visual transformers (ViT) to create lightweight models for mobile vision tasks. While the main MobileViTv 1 block helps achieve competitive state-of-the-art results, the fused blocks within the MobileViTv 1 block pose scaling challenges and have complex learning tasks. We propose simple and effective variations of the fusion block to create the MobileViTv 3 block, which solves scaling and simplified learning tasks. Our proposed MobileViTv 3-block is used to create MobileViTv 3-XXS, XS and S models, which outperforms MobileViTv 1 on ImageNet-1 k, ADE 20 K, COCO and PascalVOC 2012 datasets. On ImageNet-1 K, MobileViTv 3XXS and MobileViTv 3-XS outperformed MobileViTv 1-XXS and MobileViTv 1-XS by 2% and 1.9% respectively. The recently released MobileViTv 2 architecture removes the fusion block and uses linear complexity transformers to perform better than MobileViTv 1. We add our proposed fusion blocks to MobileViTv 2 to create MobileViTv 3 - 0.5, 0.75 and 1.0 models. Compared to MobileViTv 2, these new models give better accuracy on ImageNet-1 K, ADE 20 K, COCO and PascalVOC 2012 datasets. MobileViTv 3 -0.5 and MobileViTv 3 -0.75 outperform MobileViTv 2 -0.5 and MobileViTv 2 -0.75 by 2.1% and 1.0% respectively on the ImageNet-1 K dataset. For segmentation tasks, the mIOU of MobileViTv 3 -1.0 on the ADE 20 K dataset and PascalVOC 2012 dataset are 2.07% and 1.1% higher than MobileViTv 2 -1.0 respectively. Our code and trained model are available at https://github.com/micronDLA/MobileViTv3.


intensive reading

main content:

(1) Shortcomings of MobileViTv1:  Fusion blocks create complex structures

(2) Improvements of MobileViTv2 compared to v1:  the fusion block is deleted and a linear complexity Transformer is used

(3) Main idea of ​​MobileViTv3:  Propose a simpler and more effective fusion block and add it to v2

(4) Main results:  MobileViTv 3-XXS, XS and S models and MobileViTv 3-0.5, 0.75 and 1.0 models are proposed

(5) Improved effect:  performs better on ImageNet-1 k, ADE 20 K, COCO and PascalVOC 2012 data sets


1. INTRODUCTION—Introduction

translate

       Convolutional neural networks (CNN) [such as ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and EfficientNet (Tan & Le, 2019)] are widely used in vision tasks such as classification, detection, and segmentation, This is due to their strong performance on established benchmark datasets such as Imagenet (Russakovsky et al., 2019). 2015), COCO (Lin et al., 2014), PascalVOC (Everingham et al., 2015), ADE 20 K (Zhou et al., 2017) and other similar datasets. When deploying CNNs on edge devices such as mobile phones, which are often resource constrained, lightweight CNNs suitable for such environments are from the MobileNets model family (MobileNetvl, MobileNetv 2, MobileNetv 3) (Howard et al., 2019), ShuffleNets (ShuffleNetv1 and ShuffleNetv2) (Ma et al., 2018) and lightweight versions of EfficientNet (Tan & Le, 2019) (EfficientNet-B 0 and EfficientNet-B1). These lightweight models lack accuracy compared to models with large parameters and FLOPs.

       Recently, visual transformers (ViTs) have emerged as powerful alternatives to CNNs in these vision tasks. Due to its architectural design, CNN interacts with local neighboring pixels/features to produce feature maps embedded with local information. In contrast, the self-attention mechanism in ViTs interacts with all parts of the image/feature map to produce features with global information embedded in them. This has been shown to produce comparable results to CNNs, but with large amounts of pre-training data and advanced data augmentation (Dosovitskiy et al., 2020). Furthermore, this global processing matches the performance of CNNs at the expense of large parameters and FLOPs, as seen in ViT (Dosovitskiy et al. 2020), and its different versions such as DeiT Touvron et al. 2021), SwinT (Liu et al. , 2021), MViT (Fan et al., 2021), Focal-ViT (Yang et al., 2021), PVT (Wang et al., 2021), T2 T-ViT (Yuan et al., 2021b), XCiT (Ali et al., 2021) People, 2021). (Xiao, 2021) showed that ViT suffers from high sensitivity to hyperparameters such as the choice of optimizer, learning rate, weight decay, and slow convergence. To address these issues (Xiao et al., 2021) propose to introduce convolutional layers in ViT.

        Many recent works have introduced convolutional layers in the ViT architecture to form hybrid networks to improve performance, achieve sample efficiency and make models more efficient in terms of parameters and FLOPs, such as MobileViT (MobileViTvl (Mehta & Rastegari, 2021), MobileViTv 2 (Mehta & Rastegari, 2022)), CMT (Guo et al., 2012, 2013, 2014, 2015, 2015, 202022), CvT (Wu et al., 2021), PVTv 2 (Wang et al., 2022), ResT (Zhang & Yang, 2021), MobileFormer (Chen et al., 2022), CPVT (Chu et al., 2021), MiniViT (Zhang et al., 2022), CoAtNet (Dai et al., 2021), CoaT ( Xu et al., 2021a). The performance of many of these models on ImageNet-1 K is shown in Figure 1, along with parameters and FLOPs. Among these models, only MobileViT and MobileFormer are specifically designed for resource-constrained environments such as mobile devices. These two models achieve competitive performance with fewer parameters and FLOPs compared to other hybrid networks. Although these small hybrid models are crucial for on-device vision tasks on the move, little work has been done in this area.

       Our work focuses on improving a lightweight family of models called MobileViTs (MobileViTv 1 (Mehta & Rastegari, 2021) and MobileViTv 2 (Mehta & Rastegari, 2022)). Compared to models with parameter budgets of 6 million (M) or less, MobileViTs achieves competitive state-of-the-art results on classification tasks with a simple training recipe (basic data augmentation). It can also be used as an efficient backbone across different vision tasks such as detection and segmentation. While focusing only on models with 6 M or less parameters, we asked the following question: Is it possible to change the model architecture to improve its performance by keeping similar parameters and FLOPs? To this end, we study the challenges of MobileViT-block architecture and propose simple and effective methods to fuse input, local (CNN) and global (ViT) features to achieve better performance on Imagenet-1 K, ADE 20 k, PascalVOC and COCO data. Significantly improve performance on the set.

       We propose four main variations wrt MobileViTv 1 block (three variations wrt MobileViTv 2 block), as shown in Figure 2. There are three changes in the fusion block: First, the 3x 3 convolutional layers are replaced with 1x 1 convolutional layers. Second, the features of local and global representation blocks are fused together instead of input and global representation blocks. Third, the input features are added to the fusion block as a final step before generating the output of the MobileViT block. The fourth change is to represent blocks locally, where normal 3x 3 convolutional layers are replaced by depthwise 3x 3 convolutional layers. These changes resulted in parameter and FLOP reductions for MobileViTv 1 blocks and allowed scaling (increasing the width of the model) to create new MobileViTv 3-S, XS and XXS architectures that outperformed on classification (Figure 1), segmentation and detection tasks MobileViTv1. For example, MobileViTv 3-XXS and MobileViTv 3-XS outperform MobileViTv 1-XXS and MobileViTv 1-XS by 2% and 1.9% respectively on the ImageNet-1 K dataset using similar parameters and FLOPs. In MobileViTv 2, there is no fusion block. Our proposed fusion block introduces the MobileViTv 2 architecture to create the MobileViTv 3-1.0, 0.75 and 0.5 architectures. MobileViTv 3 -0.5 and MobileViTv 3 -0.75 are 2.1% and 1.0% higher than MobileViTv 2 -0.5 and MobileViTv 2 -0.75 respectively, with similar parameters and FLOPs on the ImageNet-1 K dataset.


intensive reading

Insufficiency of previous research:

(1) Lightweight CNN networks, such as MobileNetvl-3, ShuffleNets and EfficientNet models have insufficient accuracy

(2) ViT is highly sensitive to hyperparameters (such as optimizer selection, learning rate, weight decay and slow convergence)

The main purpose of this article:

Improve the previous MobileViT series to make it more accurate with fewer parameters

This article mainly improves:

First, the 3x 3 convolutional layers are replaced with 1x1 convolutional layers

Secondly, the features of local and global representation blocks are fused together instead of the input and global representation blocks

Third, input features are added to the fusion block as a final step before generating the output of the MobileViT block

Fourth, the normal 3x3 convolutional layer in the local representation block is replaced by a deep 3x3 convolutional layer

Improved results:

Less parameters and FLOPs than v1, similar to v2 parameters and FLOPs (actually more than v2)

Better performance on classification, segmentation and detection tasks on ImageNet-1 K dataset


2. RELATED WORK—Related work

translate

ViT:Introduce the Transformer model for natural language processing tasks into the visual field, especially for image recognition. Later, its different versions such as DeiT (Touvron et al., 2021) improved performance by introducing new training techniques and reducing reliance on large amounts of pre-training data. Works focused on improving self-attention mechanisms to improve performance include XCiT (Ali et al., 2021), SwinT (Liu et al., 2021), ViL (Zhang et al., 2021), and Focal-transformer (Yang et al., 2021) . XCiT introduces cross-covariance attention, where self-attention operates on feature channels instead of tokens, and the interaction is based on the cross-covariance matrix between keys and queries. SwinT modifies ViT to become a general architecture that can be used for various vision tasks such as classification, detection and segmentation. This is achieved by replacing the self-attention mechanism with shift window-based self-attention, which allows the model to adapt to different input image scales, and to do so efficiently by achieving a linear computational complexity relationship with the input image size. ViL improves on ViT by encoding images at multiple scales and uses a self-attention mechanism that is a variant of Longformer (Beltagy et al., 2020). Recent works such as T2 T-ViT (Yuan et al., 2021b) and PVT (PVTvl) (Wang et al., 2021) also focus on introducing CNN-like features by reducing the spatial resolution or token size of the output after each layer. Hierarchical feature learning. T2 T-ViT proposes hierarchical token-to-token translation, where adjacent tokens are recursively aggregated into one token to capture local structure and reduce token length. PVT is a pyramid visual transformer that continuously reduces feature map resolution size, reduces computational complexity, and achieves competitive results on ImageNet-1 K. There are few like CrossViT (Chen et al., 2021), MViT (Fan et al., 2021), MViTv 2 (Li et al., 2022) and Focal-transformer (Yang et al., 2021) learn local features (features learned specifically from neighboring pixels/features/patches) and global features (learned using all pixels/features/patches) feature). The focus converter replaces self-attention with focal self-attention, where each token is able to focus on its nearest surrounding tokens at a fine-grained level, and can also focus on distant tokens at a coarse-grained level, capturing both short- and long-range visual dependencies. CrossViT handles small-patch and large-patch tokens separately and fuse them together through multiple attentions to complement each other. Designed specifically for video and image recognition, MViT learns multi-scale pyramid features, where early layers capture low-level visual information and deep layers capture complex and high-dimensional features. MViTv 2 further improves MViT by incorporating location embedding and residual pooling connections in its architecture.

CNN: ResNet (He et al., 2016) is one of the most widely used general-purpose architectures in vision tasks such as classification, segmentation, and detection. The ResNet architecture, due to its residual connections, helps optimize deeper layers, allowing the construction of deep neural networks (deep CNNs). These deep CNNs are able to achieve state-of-the-art results on various benchmarks. DenseNet (Huang et al., 2017) is inspired by ResNet and uses skip connections to connect each layer to the next layer in a feed-forward manner. Other CNNs such as ConvNeXt (Liu et al., 2022), RegNetY (Radosavovic et al., 2020), SqueezeNet (Iandola et al., 2016), and Inception-v3 (Szegedy et al., 2016) also achieved competitive state-of-the-art performance . However, the best performing CNN models are usually high in terms of number of parameters and FLOPs. Lightweight CNNs that achieve competitive performance with fewer parameters and FLOPs include EfficientNet (Tan & Le, 2019), MobileNetV 3 (Howard et al., 2019), ShuffleNetv 2 (Ma et al., 2018), and ESPNetv 2 (Mehta et al., 2019). EfficientNet studied model scaling and developed a series of efficientnet models that remain among the most efficient CNNs in terms of parameters and FLOPs. MobileNetV 3 belongs to the category of models developed specifically for resource-constrained environments such as mobile phones. The building blocks of the MobileNetV 3 architecture use MobileNetv 2 (Sandler et al., 2018) blocks and Squeeze-and-Excite (Hu et al., 2018) networks in it. ShuffleNetv 2 studies and proposes guidelines for effective model design and produces shufflenetv 2 series of models, which are also competitive with other lightweight CNN models. ESPNetv 2 uses depth-extended separable convolutions to create EESP (Extremely Efficient Spatial Pyramid) units, which helps reduce parameters and FLOPs and achieve competitiveness Sexual results.

mix:Recently, many different models have been proposed that combine CNN and ViT in one architecture to capture long-distance dependencies using ViT's self-attention mechanism and local information using local kernels in CNN to improve vision tasks. performance. MobileViT (MobileViTvl, MobileViTv 2) (Mehta & Rastegari, 2021) and MobileFormer (Chen et al., 2022) have been specifically designed for use in constrained environments like mobile devices. MobileViTv 1 and MobileViTv 2 achieve state-of-the-art results compared to models with parameter budgets of 6 M or less. The MobileFormer architecture combines MobileNetv 3 and ViT and also achieves competitive results. The CMT (Guo et al., 2022) architecture has convolutional stems, convolutional layers before each Transformer block and alternately stacks convolutional layers and transformer layers. CvT (Wu et al., 2021) uses convolutional token embeddings instead of the linear embeddings used in ViT, as well as convolutional Transformer layer blocks that exploit these convolutional token embeddings to improve performance. PVTv 2 (Wang et al., 2022) uses convolutional feedforward layers, overlapping patch embeddings and linear complexity attention layers in the transformer to obtain improvements over PVT. RestT (Zhang & Yang, 2021) uses deep convolutions (for memory efficiency) and patch embeddings in self-attention as a stack of overlapping convolution operations with strides on the labeled graph. CoAtNet (Dai et al., 2021) uses simple relative attention to unify depth convolution and self-attention, and also stacks convolutional and attention layers vertically. PiT's (Heo et al., 2021) pooling layer uses deep convolutions to achieve spatial reduction to improve performance. LVT (Yang et al., 2022) introduces convolutional self-attention, where local self-attention is introduced within the convolution kernel, and recursive self-attention is also introduced to include multi-scale context to improve performance. ViTAE (Xu et al., 2021b) has convolutional layers in parallel with the multi-head self-attention module, and the two are fused and fed into the feed-forward network. ViTAE also uses convolutional layers to embed inputs into tokens. CeiT (Yuan et al., 2021 a) Introducing locally enhanced feedforward by using depthwise convolutions with additional changes to achieve competitive results. RVT (Mao et al., 2022) uses convolutional stems to generate patch embeddings and a convolutional feed-forward network in the Transformer to achieve better results.


intensive reading

  • ViT
  • CNN
  • Mix of the two: MobileViT series, MobileFormer series, CvT, PVT, RestT, CoAtNet, LVT, ViTAE, CeiT and RVT, etc.

I won’t go into detail about these~


3. NEW MOBILEVIT ARCHITECTURE—new mobile architecture

3.1 MOBILEVITV3 BLOCK—MobileViTV3 module

(1) Replace the 3x3 convolutional layer with a 1x1 convolutional layer in the fusion block

translate

Replacing 3x3 convolutional layers with 1x1 convolutional layers in fusion block: There are two main motivations for replacing 3x3 convolutional layers in fusion. First, local and global features are fused independently of other locations in the feature map to simplify the learning task of fused blocks. Conceptually, a 3x3 convolutional layer is fusing input features, global features, and input and global features present at other locations in the receptive field, which is a complex task. The goal of the fusion block can be simplified by allowing it to fuse input and global features independently of other locations in the feature map. For this, we use 1x1 convolutional layers instead of 3x3 convolutional layers in fusion. Second, is to remove one of the main limitations in the extension of the MobileViTv1 architecture. Extending MobileViTv1 from XXS to S is done by changing the width of the network while keeping the depth constant. Changing the width of the MobileViTv1 block (the number of input and output channels) leads to an increase in the number of parameters and FLOPs. For example, if the input and output channels are doubled (2x) in the MobileViTvl block, the number of input channels to the 3x3 convolutional layer inside the fusion block increases by 4x and the output channels increase by 2x because of the input to the 3x3 convolutional layer is a concatenation of input and global representation block features. This results in a significant increase in the parameters and FLOPs of the MobileViTv1 block. Using 1x1 convolutional layers avoids a large increase in parameters and FLOPs when scaling.


intensive reading

Purpose:

1. Fusion of local and global features , independent of other positions in the feature map, to simplify the learning task of the fusion block

2. Eliminate one of the main limitations in the extension of the MobileViTv1 architecture, avoiding a large increase in parameters and FLOPs when scaling


(2) Local and global feature fusion

translate

Local and global feature fusion: In the fusion layer, features from local and global representation blocks are connected in our proposed MobileViTv3 block instead of input and global representation features. This is because local representation features are more closely related to global representation features than input features. The output channels of the local representation block are slightly higher than the channels in the input features. This results in an increase in the number of input feature maps to the 1x1 convolutional layer of the fusion block, but the total number of parameters and FLOPs is significantly smaller than the baseline MobileViTv1 block due to the change from 3x3 convolutional layer to 1x1 convolutional layer.


intensive reading

v1 method:  v1 is the fusion of the input end and the global representation module

v3 method:  v3 isintegrate the local representation module and the global representation module

reason:

  • Local representation module features are more closely related to global representation module features

  • The output channel of the local representation block is slightly higher than the input channel


(3) Fusion of input end features

translate

Fused input features: The input features are added to the output of a 1x1 convolutional layer in a fusion block. Residual connections in models like ResNet and DenseNet have been shown to help optimize deeper layers in the architecture. We introduce this residual connection in the new MobileViTv3 architecture by adding input features to the output in the fusion block. The ablation study results shown in Table 6 show that this residual connection contributes to a 0.6% accuracy gain.


intensive reading

Method:  Input features are added to the output of a 1x1 convolutional layer in the fusion block

Inspiration:  Residual connections in models like ResNet and DenseNet have been shown to help optimize deeper layers in the architecture


(4) Use deep convolutional layers in local representation blocks

translate

Deep convolutional layers in the local representation block: To further reduce parameters, the 3x 3 convolutional layers in the local representation block are replaced by deep 3x 3 convolutional layers. As shown in Table 6 of the ablation study results, this change does not have a large impact on the Top-1 ImageNet-1 K accuracy gain and provides a good parameter and accuracy trade-off.


intensive reading

Method:  3x3 convolutional layers in local representation blocks are replaced by deep 3x3 convolutional layers

Purpose:  further reduce parameters


3.2 SCALING UP BUILDING BLOCKS—model building blocks

translate

Applying the changes proposed in Section 3.1 allows scaling our MobileViTv 3 architecture by increasing the width of the layers (number of channels). Table 1 shows the MobileViTv 3-S, XS and XXS architecture with output channels, scaling factors, parameters and FLOPs in each layer

intensive reading


4. EXPERIMENTAL RESULTS—Experimental results

translate

Our work shows the results on the classification task using ImageNet-1 K in Section 4.1, the segmentation task using ADE 20 K and PASCAL VOC 2012 dataset in Section 4.2, and the COCO dataset in Section 4.3 Detection task results. We also discuss changes to our proposed MobileViTv 3 architecture to improve latency and throughput in Section 4.4.


intensive reading

data set

Classification —ImageNet-1 K

Segmentation —ADE 20 K and PASCAL VOC 2012

Detection —COCO


4.1 IMAGE CLASSIFICATION ON IMAGENET1K—Image classification on IMAGENET1K

4.1.1 IMPLEMENTATION DETAILS—Experimental details

translate

Except for batch size, the hyperparameters used for MobileViTv 3-S, Due to resource constraints, we were limited to using a total batch size of 384 (32 images per GPU) for experiments on MobileViTv 3-S and XS. To maintain batch consistency, MobileViTv 3-XXS is also trained on batch 384. MobileViTv 3 - 0.5, 0.75 and 1.0 training uses a batch size of 1020 (85 images per GPU).

MobileViTv 3-S , , 288), (320, 320)), learning rate increased from 0.0002 to 0.002 for the first 3 K iterations, then annealed to 0.0002 using a cosine schedule, L2 weight decay 0.01, basic data augmentation, i.e. randomly resized Crop and flip horizontally.

MobileViTv 3 -1.0, 0.75 and 0.5: Default hyperparameters used by MobileViTv 2 include using AdamW as optimizer, batch sampler (S = (256, 256)), learning rate from 1 e-6 in the first 20 K iterations increased to 0.002, then annealed to 0.0002 using cosine scheduling, L2 weight decay 0.05, advanced data augmentation i.e. random resize crop, horizontal flip, random boost, random wipe, blend and shear blend. Performance was assessed using single-crop top-1 accuracy and inference was performed using exponential moving averages of model weights. All classification models are trained from scratch on the ImageNet-1 K classification dataset. This dataset contains 1.28M and 50K images for training and validation, respectively.


intensive reading

  • Hyperparameters: Hyperparameters of MobileViTv 3-S ,
  • Batch size: MobileViTv 3-S, XS and XXS: 384; MobileViTv 3 - 0.5, 0.75 and 1.0: 1020
  • Optimizer: AdamW
  • Learning rate: MobileViTv 3-S, XS and XXS first 3 K iterations: 0.0002 increased to 0.002; MobileViTv 3 -1.0, 0.75 and 0.5 first 20 K iterations: increased from 1 e-6 to 0.002
  • Cosine annealing: MobileViTv 3-S, XS and XXS: 0.0002; MobileViTv 3 -1.0, 0.75 and 0.5: 0.0002

  • L2 weight decay: MobileViTv 3-S, XS and XXS: 0.01; MobileViTv 3 -1.0, 0.75 and 0.5: 0.05

  • Data Augmentation: MobileViTv 3-S , Enhance, random erase, blend and cut blending


4.1.2 COMPARISON WITH MOBILEVITS—Comparison with previous MobileViT versions

translate

Table 2 shows that all versions of MobileViTv 3 outperform MobileViTv 1 and MobileViTv 2 versions with similar parameters and FLOPs and smaller training batch sizes. The impact of training batch size on MobileViTv 3 is also shown in Table 2. Increasing the total batch size from 192 to 384 improved the accuracy of MobileViTv 3-S, XS and XXS models. This shows that it is possible to further improve accuracy with a batch size of 1024 on MobileViTv 3-XXS, XS and S models. It is also important to note that the MobileViTv 3-S, XS, and XXS models trained with basic data augmentation not only outperformed MobileViTv 1-S, XS, XXS, but also outperformed MobileViTv21.0, 0.75, and 0.5 performance. The image size fine-tuning of MobileViTv 3 -1.0, 0.75 and 0.5 is 384, which is also better than the fine-tuning version of MobileViTv 2 -1.0, 0.75 and 0.5.


intensive reading

Table 2: Comparison of MobileViT V1, V2 and V3 in terms of Top-1 ImageNet-1 k-accuracy, parameters and operations.

Conclusion: Table 2 shows that all versions of MobileViTv 3 outperform MobileViTv 1 and MobileViTv 2 versions 


4.1.3 COMPARISON WITH VITS—Comparison with ViT series

translate

Figure 1 compares the performance of our proposed MobileViTv 3 model with other ViT variants and hybrid models. Following MobileViTv 1, we mainly compare models with parameter budgets around 6M or less. Furthermore, we limit the FLOP budget to 2 GFLOPs or less when comparing to models with >6M parameters, since our largest model in this work has 2 GFLOPs.

Models at 2 million parameters: To the best of our knowledge, only MobileViT variants exist in this range. MobileViTv 3-XXS and MobileViTv 3 -0.5 outperform other MobileViT variants. MobileViTv 3-0.5 achieves the best accuracy of 72.33% so far in 1-2 million parameter budget models (ViT or hybrid).

Models between 2-4 million parameters: MobileViTv 3-XS and MobileViTv 3 -0.75 outperform all models in the series. The Top-1 accuracy of MobileViTv 3-XS on ImageNet-1 k is 76.7%, which is 3.9% higher than Mini-DeiT-Ti (Zhang et al.2022), and 4.5% higher than XCiT-N12 (Ali et al., 2021 ), and was 6.2% higher than PVTv 2-B 0 (Wang et al. 2022). While Mobile-Former-53 M (Chen et al., 2022) uses only 53 GFLOPs, it lags behind MobileViTv 3-XS in accuracy by 12.7%.

Models between 4-8 million parameters: MobileViTv 3-S achieves the highest accuracy in this parameter range. MobileViTv 3-S with simple training recipe and 300 epochs is 0.7% better than XCiT-T12 trained with distillation, advanced data augmentation and 400 epochs. are 1.8%, 2.6% and 2.9% better than Coat-Lite-Tiny (Xu et al., 2021a), ViL-Tiny-RPB (Zhang et al., 2021) and CeiT-Ti (Yuan et al., 2021a). Compared to CoaT-Tiny, MobileViTv 3-S is 1% better at 0.5x FLOP and similar parameters (Xu et al., 2021a).

Models with over 8M parameters: We also compared our designed models with existing models, which have over 8M parameters and approximately 2 GFLOPs. When compared to MobileViTv 3-S trained with basic data augmentation and 300 epochs, CoaT-Lite-Mini achieved 79.1% competitive accuracy with 2x more parameters, similar FLOPs and advanced data augmentation, MobileFormer- 508M achieves 79.3% similar accuracy with 2.5x more parameters, 3.5x fewer FLOPs, advanced data augmentation and 450 training epochs. Rest-Small (Zhang & Yang, 2021) achieves a similar accuracy of 79.6% with 2.5 times more parameters, similar FLOPs and advanced data augmentation. PVTv 2-B1 (Wang et al., 2022) achieves 78.7%, with 2.3 times more parameters, similar FLOPs, and advanced data augmentation. CMT-Ti (Guo et al., 2022) achieves 79.1% with a 1.6x increase in parameters, 2.9x reduction in FLOPs (due to an input image size of 160x160) and advanced data augmentation.


intensive reading

Figure 1: Comparison of Top-1 accuracy of MobileViTv 3, ViT variants and hybrid models on the ImageNet-1 K dataset.

in conclusion:

  • Models at 2 million parameters: MobileViTv 3-XXS and MobileViTv 3-0.5 outperform other MobileViT variants
  • Models between 2-4 million parameters: MobileViTv 3-XS and MobileViTv 3 -0.75 outperform all models in the series
  • Models between 4-8 million parameters: MobileViTv 3-S achieves the highest accuracy in this parameter range
  • Model with over 8 million parameters: MobileViTv 3-S has a small number of parameters and high accuracy

4.1.4 COMPARISON WITH CNNS—Comparison with CNN series

translate

Figure 3 compares our proposed model with a CNN model, which is lightweight with a parameter budget of 6M or less, similar to MobileViTv 1 (Mehta & Rastegari, 2021).

Models in the 1-2 million parameter range: MobileViTv 3 -0.5 and MobileViTv 3-XXS are 72.33% and 70.98% respectively, the best accuracy in this parameter range. MobileViTv 3-0.5 achieves more than 2.5% improvement compared to MobileNetv 3-small (0.5) (Howard et al., 2019), MobileNetv 3-small (0.75), ShuffleNetv 2 (0.5) (Ma et al., 2018), ESPNetv 2 - 28 M (Mehta et al., 2019), ESPNetv 286 M, and ESPNetv 2 - 123 M.

Models with 2-4 million parameters: MobileViTv 3-XS improves performance by 4% compared to MobileNetv 3-Large (0.75), ShuffleNetv 2 (1.5), ESPNetv 2 - 284 M, and MobileNetv 2 (0.75) above.

Models with 4-8 million parameters: MobileViTv 3-S has better accuracy than EfficientNet-B 0 (Tan & Le, 2019), MobileNetv 3-Large (1.25), ShuffleNetv 2 (2.0), ESPNetv 2 - 602 M and MobileNetv 2 (1.4) improved by more than 2%. EfficientNet-B1 has 1.3 times more parameters, 2.6 times fewer FLOPs, and an accuracy of 79.1%, while MobileViTv 3-S has an accuracy of 79.3%.


intensive reading

Figure 3: Top 1 accuracy comparison between MobileViTv 3 model and existing lightweight CNN models on the ImageNet-1 K dataset.

in conclusion:

  • Models in the 1-2 million parameter range: MobileViTv 3 -0.5 and MobileViTv 3-XXS are the best accuracy in this parameter range
  • Models with 2-4 million parameters: MobileViTv 3-XS improves performance by more than 4%
  • Model with 4-8 million parameters: The accuracy of MobileViTv 3-S is more than 2% higher than that of models with the same range of parameters.

4.2 SEGMENTATION—Segmentation

4.2.1 IMPLEMENTATION DETAILS—Experimental details

translate

PASCALVOC 2012 dataset: Following MobileViTvl, MobileViTv 3 is integrated with DeepLabv 3 (Chen et al. 2017) for segmentation tasks on the PASCAL VOC 2012 dataset (Everingham et al. 2015). Additional annotations and data come from (Hariharan et al. 2011) and (Lin et al. 2014), which is standard practice for training on the PascalVOC 2012 dataset (Chen et al. 2017); (Mehta et al. 2019 Year). For MobileViTv 3-S, XS, and XXS, training hyperparameters are similar to MobileViTv 1 except for batch size. A smaller batch size of 48 (12 images per GPU) is used compared to MobileViTv 1's 128 (32 images per GPU). Other default hyperparameters include, use adamw as optimizer, weight decay 0.01, cosine learning rate scheduler, cross-entropy loss and 50 training epochs. For MobileViTv 3 -1.0, 0.75 and 0.5, all hyperparameters remain the same as MobileViTv 2 -1.0, 0.75 and 0.5 training. Default hyperparameters include using adamw as optimizer, weight decay of 0.05, cosine learning rate scheduler, cross-entropy loss and 50 epochs training. Segmentation performance is evaluated on the validation set and reported using mean intersection over union (mIOU).

ADE 20 K dataset (Zhou et al., 2019): Contains a total of 25 K images, 150 semantic categories. Out of 25K images, 20K images are used for training, 3K images are used for testing, and 2K images are used for validation. The MobileViTv 2 model uses the same training hyperparameters as the MobileViTv 3 - 1.0, 0.75 and 0.5 models. Training hyperparameters include using SGD as optimizer, weight decay of 1 e-4, momentum of 0.9, cosine learning rate scheduler, 120 epoch training, cross-entropy loss, batch size 16 (4 images per GPU). Segmentation performance is evaluated on the validation set and reported using mean intersection over union (mIOU).


intensive reading

PASCALVOC 2012 data set:

  • Hyperparameters: MobileViTv 3 -S ,
  • Optimizer: Adamw
  • Weight decay: MobileViTv 3-S, XS and XXS: 0.01; MobileViTv 3 -1.0, 0.75 and 0.5: 0.05

ADE 20K data set:

  • Hyperparameters: MobileViTv 2 model uses the same training hyperparameters as MobileViTv 3 - 1.0, 0.75 and 0.5 models
  • Optimizer: SGD
  • Weight decay: 1 e-4
  • Momentum: 0.9
  • epoch:120
  • Batch size: 16

4.2.2 RESULTS—Conclusions

translate

PASCAL VOC 2012 Dataset: Table 3a demonstrates that the MobileViTv 3 model with a lower training batch size of 48 outperforms the corresponding corresponding models of MobileViTvl and MobileViTv 2 trained on the higher batch size of 128. MobileViTv 3 -1.0 achieved 80.04% mIOU, 1.1% higher than MobileViTv 2 -1.0. MobileViTv 3-XS is 1.6% better than MobileViTv 1XS, and MobileViTv 3 -0.5 is 1.41% better than MobileViTv 2 -0.5. ADE 20 K dataset: Table 3b shows the results of MobileViTv 3 -1.0, 0.75 and 0.5 models on the ADE 20 K dataset. MobileViTv 3 -1.0, 0.75 and 0.5 models outperformed MobileViTv 2 -1.0, 0.75 and 0.5 models by 2.07%, 1.73% and 1.64% respectively.


intensive reading

Table 3: Comparing MobileViTv 3 segmentation task results on PASCAL VOC 2012 and ADE 20 K datasets.

in conclusion:

  • On the PASCAL VOC 2012 dataset, MobileViTv 3 outperforms the corresponding counterpart models of MobileViTvl and MobileViTv 2 trained on the higher batch size of 128
  • On the ADE 20 K data set, MobileViTv 3 -1.0, 0.75 and 0.5 perform better than the corresponding model of MobileViTv 2 respectively

4.3 OBJECT DETECTION—Object detection

4.3.1 IMPLEMENTATION DETAILS—Experimental details

translate

The detection performance of the MobileViTv 3 model is evaluated using the MS-COCO dataset with 117 K training and 5 K validation images. Similar to MobileViTvl, we integrate the pretrained MobileViTv 3 as the backbone network in the single-shot detection network (SSD) (Liu et al., 2016), and the standard convolutions in the SSD header are replaced with separable convolutions to create SSDLite network. SSDLite is also used by other lightweight CNNs to evaluate the performance of detection tasks. This SSDLite has pre-trained MobileViTv 3 -1.0, 0.75 and 0.5, fine-tuned on the MS-COCO dataset. The hyperparameters used to train MobileViTv 3 -1.0, 0.75 and 0.5 remain the same as MobileViTv 2 -1.0, 0.75 and 0.5. Default hyperparameters include using an input size of 320 x 320 images, AdamW optimizer, weight decay of 0.05, cosine learning rate scheduler, total batch size of 128 (32 images per GPU), smooth L1 and cross-entropy losses respectively Used for object localization and classification. The training hyperparameters of MobileViTv 3-S, XS and XXS remain the same as MobileViTv 1-S, XS and XXS. Default hyperparameters include using images with input resolution 320 x 320, AdamW optimizer, weight decay of 0.01, cosine learning rate scheduler, total batch size of 128 (32 images per GPU), smooth L1 and cross-entropy loss used for object localization and classification respectively. Performance evaluation is performed on the validation set using mAP@IoU of 0.50:0.05:0.95 metric.


intensive reading

  • Dataset: MS-COCO
  • Hyperparameters: Hyperparameters of MobileViTv 3 -1.0, 0.75 and 0.5 remain the same as MobileViTv 2 -1.0, 0.75 and 0.5; Training hyperparameters of MobileViTv 3-S, XS and XXS remain the same as MobileViTv 1-S, XS and XXS
  • Optimizer: AdamW
  • Weight decay: MobileViTv 3 -1.0, 0.75: 0.05; MobileViTv 1-S, XS and XXS: 0.01
  • Total batch size: 128

4.3.2 RESULTS—Conclusions

translate

Tables 4a and 4b show the detection results of the COCO dataset. The #params of the MobileViT model only indicate the number of parameters of the encoder/backbone architecture. The comparison of MobileViTv 3 with other lightweight CNN models is shown in Table 4a. MobileViTv 3-XS performs 0.8% better than MobileViTv 1-XS and 2.6% better than MNASNet. Comparison with the heavy CNN detailed in Table 4b. MobileViTv 3-XS and MobileViTv 3 -1.0 exceed MobileViTv 1-XS and MobileViTv 2 -1.0 by 0.8% and 0.5% mAP respectively.


intensive reading

Tables 4a and 4b show the detection results of the COCO dataset.

in conclusion:

MobileViTv 3-XS performs 0.8% better than MobileViTv 1-XS and 2.6% better than MNASNet

MobileViTv 3-XS and MobileViTv 3 -1.0 exceed MobileViTv 1-XS and MobileViTv 2 -1.0 respectively by 0.8% and 0.5% mAP


4.4 IMPROVING LATENCY AND THROUGHPUT—Improving latency and throughput

4.4.1 IMPLEMENTATION DETAILS—Experimental details

translate

Implementation details: We use a GeForce RTX 2080 Ti GPU to obtain latency times. Results are averaged over 10,000 iterations. Timing results may vary ±0.1 ms. The throughput of XXS, XS and S is calculated in 1000 iterations with a batch size of 100. "Blocks" in Table 5 represents the number of MobileViTv3 blocks in "Layer 4" of the MobileViTv3 architecture (Table 1). To improve latency, we reduced the number of MobileViT blocks in "layer4" from 4 to 2.


intensive reading

  • Hardware: GeForce RTX 2080 Ti GPU
  • epoch:10000
  • Batch size: 100

4.4.2 RESULTS—Conclusion

translate

Results: Table 5 shows the latency and throughput results. MobileViTv 3-XXS, parameters and FLOPs are similar to the baseline MobileViTv 1-XXS along with a 1.98% improvement in accuracy and achieve a similar latency of 7.1ms. MobileViTv 3-XXS has two MobileViT blocks instead of four, reducing FLOPs by 30% and achieving 6.24 ms latency, which is 1 ms faster than the baseline MobileViTv 1-XXS. Similar changes were made in the MobileViTv 3-XS and MobileViTv 3-S architectures, with FLOPs reduced by 13.5% and 17.82% respectively, and latency reduced by 1 ms and 0.7 ms respectively.


intensive reading

Table 5 shows the latency and throughput results.

in conclusion:

The MobileViTv 3-XS has reduced SFLOPs and latency compared to the MobileViTv 3-XS


4.5 ABLATION STUDY OF OUR PROPOSED MOBILEVITV3 BLOCK—Ablation experiment

4.5.1 IMPLEMENTATION DETAILS—Experimental details

translate

We investigate the impact of four proposed changes on the MobileViTv 1-S block by adding one change after the other. The final model of all four variations is our unscaled version, which we name: MobileViTv 3-S (unscaled). To match the number of parameters of MobileViTv 1, we increase the width (unscaled) of MobileViTv 3-S, resulting in MobileViTv 3-S. The top-1 accuracy on ImageNet-1 K for each change is recorded and compared with other proposed changes. In this ablation study, we trained the model for 100 epochs, using a batch size of 192 (32 images per GPU), and other hyperparameters were default as shown in Section 4.1.1. All proposed changes in our work are applied in MobileViTv 1 block, which is composed of local representation, global representation and fusion block. In Table 6, 'conv-3x 3' means 3x 3 convolutional layers in the fusion block, 'conv-1x 1' means 1x 1 convolutional layer in the fusion block, 'Input-Concat' means fusion of input features with The global representation in the block is concatenated, and 'Local-Concat' means concatenating the local representation block output features with the global representation in the fusion block. "InputAdd" means adding input features to the output of the fusion block, "DWConv" means using depth convolutional layers instead of normal convolutional layers in the local representation block, "Top-1" means Top on the ImageNet-1 K dataset -1 accuracy.


intensive reading

  • epoch:100
  • Dataset: ImageNet-1K
  • Batch size: 192

4.5.2 WITH 100 TRAINING EPOCHS—The results of 100 rounds of Epoch

translate

The results are shown in Table 6. Baseline MobileViTv 1-S, in the fusion block, input features are concatenated with global representation block features and use 3x 3 convolutional layers. Furthermore, it uses normal 3x 3 convolutional layers in the local representation block. This baseline achieves 73.7% accuracy. By replacing 3x 3 convolutions with 1x 1 convolutional layers in the fusion block, MobileViTv 3S (unscaled) achieves a 1.1% improvement. This result supports the hypothesis that simplifying the task of fusing blocks (allowing the fusion layer to fuse local and global features independently of local and global features at other locations) should aid optimization for better performance. Together with the 1x1 convolutional layer in the fusion block, concatenating local representation features instead of input features leads to a similar performance gain of 1% compared to concatenating input features. This allows us to incorporate the next change, which is to add the input features to the output of the fusion block to create residual connections to help optimize deeper layers in the model. With this change, MobileViTv 3-S (unscaled) achieved a 1.6% accuracy gain over the baseline MobileViTv 1-S and a 0.6% gain over the previous change, demonstrating the clear advantage of this residual connection. To further reduce the number of parameters and FLOPs in the MobileViTv 3 block, depth-directed convolutional layers are used instead of ordinary convolutional layers in the local representation block. MobileViTv 3-S (unscaled) maintains high accuracy gains by achieving a 1.3% gain over the baseline. A 0.3% accuracy drop can be observed compared to the previous change. We adopt this change because it reduces parameters and FLOPs without significantly impacting performance and helps with the scaling of MobileViTv 3 blocks.


intensive reading

Table 6: Ablation study of MobileViTv3

in conclusion:

MobileViTv 3-S (unscaled) gains 1.6% accuracy gain over baseline MobileViTv 1-S and 0.6% gain over last change


4.5.3 WITH 300 TRAINING EPOCHS—The results of 300 rounds of Epoch

translate

When trained for 300 epochs with a batch size of 192, the baseline MobileViTv 1-S achieved a Top1 accuracy of 75.6%, which is 2.8% lower than the accuracy reported on MobileViTv 1-S trained on a batch size of 1024. The results are shown in Table 7. With all four proposed changes implemented in the MobileViTv 1-S architecture to form MobileViTv 3-S (unscaled), the model achieved a Top-1 accuracy of 77.5%, 1.9% higher than the baseline, parameter FLOP respectively 22.7% and 18.6% less.

The MobileViTv 3-S (unscaled) architecture, while better than the baseline MobileViTv 1-S trained with a batch size of 192, performs worse than MobileViTv 1-S trained with a batch size of 1024. Therefore, MobileViTv 3-S, XS, and XXS models were scaled to have similar parameters and FLOPs to MobileViTv 1-S, XS, and XXS, and trained with a batch size of 384. Table 7 shows that after scaling, MobileViTv 3-S is able to outperform MobileViTv 1-S by achieving 79.3% accuracy under similar parameters and FLOPs. Table 2 shows that MobileViTv 3-XS and XXS can also exceed the performance of MobileViTv 3-XS and XXS by 1.9% and 2.0% respectively under similar parameters and FLOPs.


intensive reading

Table 7: MobileViTv 3-S (unscaled), MobileViTv 1-S and MobileViTv 3-S Top-1 ImageNet-1 K accuracy comparison.

In the case of MobileViTv 3-S (unscaled), the model achieved a Top-1 accuracy of 77.5%, 1.9% higher than the baseline, and parameter FLOPs were 22.7% and 18.6% less respectively.


5. DISCUSSION AND LIMITATIONS—Conclusion and Outlook

translate

This work is an effort to improve the performance of the model in resource-constrained environments such as mobile telephony. We studied reducing memory (parameters), computation (FLOPs), latency while increasing accuracy and throughput. With the proposed changes to the MobileViT module, we achieve higher accuracy with the same memory and computation as the baseline MobileViTv1 and v2, as shown in Section 4.1. Table 2 shows the fine-tuning results, which also perform better than the fine-tuned MobileViTv2 model. Section 4.4 shows how to achieve better latency and throughput with minimal impact on model accuracy. Although MobileViTv3 has higher accuracy and lower or similar parameters compared to other mobile CNNs, its higher FLOPs may be an issue for edge devices (Figure 3). This limitation of the MobileViTv3 architecture is inherited from the self-attention module of ViTs. To address this issue, we will further explore the optimization of self-attention blocks.

Table 2 shows the results for Imagenet-1 K. The accuracy of MobileViTv 3-XXS, XS and S models reported on Imagenet-1 K can be further improved by increasing the training batch size to 1024, similar to the baseline model. The proposed fusion of input features, local features (CNN features) and global features (ViT features) shown in this paper can also be explored in other hybrid architectures.


intensive reading

The main work of this article is to make the following four improvements based on MobileViTv1 and v2:

(1) Replace the 3x3 convolution layer with a 1x1 convolution layer in the fusion block

(2) Global and local feature fusion

(3) Fusion of input end features

(4) Transform general convolution into depth-separable convolution in the fusion module

Finally MobileViTv3 improves accuracy and throughput.

Outlook: This limitation of the MobileViTv3 architecture, which is inherited from the self-attention module of ViTs, will continue to be studied.


Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/132742052