Deep learning paper: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search and its PyTorch implementation

Deep learning paper: TurboViT: Generating Fast Vision Transformers via Generative Architecture Search and its PyTorch implementation
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search
PDF: https://arxiv.org/pdf/2308.11421.pdf
PyTorch code: https:// github.com/shanglianlm0525/CvPytorch
PyTorch code: https://github.com/shanglianlm0525/PyTorch-Networks

1 Overview

This paper explores the generation of fast visual Transformer architectural designs via Generative Architecture Search (GAS) to achieve a strong balance between accuracy and architectural and computational efficiency. During this generative architecture search process, TurboViT was created, an efficient hierarchical visual Transformer architecture design based on Mask unit attention and Q-pooling design patterns.

Experimental results show that the TurboViT architecture design is significantly lower in terms of architectural computational complexity (more than 2.47 times smaller in size compared to FasterViT-0, while achieving the same accuracy). In terms of computational complexity, TurboViT is also lower (FLOPs reduced by more than 3.4 times and accuracy increased by more than 0.9% compared to MobileViT2-2.0). Compared with 10 other state-of-the-art efficient visual Transformer network architecture designs on the ImageNet-1K dataset, TurboViT performs well within a similar range of accuracy.

2 TurboViT

This article uses generative architecture search (GAS) to conduct TurboViT architecture search. The figure below shows the TurboViT architecture design generated through generative architecture search. Overall, it can be observed that the architectural design is quite simple and smooth, mainly consisting of a series of ViT blocks. Compared with other state-of-the-art efficient visual Transformer architecture designs (especially the more complex hybrid convolutional-Transformer architecture designs), TurboViT has a relatively low hidden dimension and head count (especially compared to ViT), thus helping To achieve higher architectural and computational efficiency.
Insert image description here
The TurboViT architecture design uses Q-pooling at three different locations to achieve architectural and computational efficiency through spatial dimensionality reduction. Most of the layers are located after the second Q-pooling. At the same time, in the TurboViT architecture design, the early ViT blocks took advantage of local attention achieved through mask unit attention, while the later ViT blocks took advantage of global attention, thereby achieving significant gains in computational efficiency. Global attention is not used when it contributes little to model performance. A particularly interesting observation about the TurboViT architecture design is that a hidden dimensionality compression mechanism is introduced at the beginning of the architecture design, which greatly reduces the hidden dimensions in the second ViT block, forming a highly compressed embedding that is consistent with the first ViT block. Compared to a ViT block, the hidden dimensions gradually increase as you move down the architecture. This compression mechanism appears to greatly reduce computational complexity while still enabling a high degree of representational power in the overall architecture design.

3 Experiments

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/shanglianlm/article/details/132742770