[Transformer&CNN&TiDE] From CNN to ViT, and then from ViT to TiDE, review the development process of Attention self-attention, Conv convolution mechanism and the latest TiDE model published in top journals and conferences in the past ten years

Table of contents

1. Introduction of Transformer in CV

2. Attention mechanism enhances CNN 

Foreword:

1、 Attention Augmented Convolutional Networks(ICCV 2019)

2、 Stand-Alone Self-Attention in Vision Models(NIPS 2019)

3、CMT: Convolutional Neural Networks Meet Vision Transformers(CVPR 2022,CMT)

4、Conformer: Convolution-augmented Transformer for Speech Recognition(2020,Conformer)

3. Transformer completely replaces CNN

1、 End-to-End Object Detection with Transformers(ECCV 2020,DERT)

2、 Generative Pretraining from Pixels(ICML 2020)

3、 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE(ICLR 2021,VIT)

 4. Efficiency and effect optimization stage of CV Transformer

1、Transformer in Transformer(NIPS 2021)

2、Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions(ICCV 2021,PVT)

3、 Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet(ICCV 2021,T2T-VIT)

4. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021 Best Paper Award)

5、A Time Series is Worth 64 Words: Long-term Forecasting with Transformers(ICLR 2023) 

5. TiDE model based on MLP, without using attention mechanism, CNN and RNN


1. Introduction of Transformer in CV

        With the establishment of Transformer's mainstream position in the NLP field, more and more work has begun to try to apply Transformer to the CV field. The development of CV Transformer has mainly gone through the following three stages; first, the Attention mechanism was introduced in CNN to solve the problem that the CNN model structure can only extract local information and lack the ability to consider global information; next, related research gradually began to use a complete Transformer The model replaces CNN to solve problems in the image field; currently Transformer has achieved initial results in solving CV problems, and more work has begun to study the optimization of CV Transformer details, including how to improve the operating efficiency of high-resolution images and how to better convert images In order to maintain the structural information of the image, how to balance the operating efficiency and effect, etc. By summarizing Transformer-related papers in recent years and introducing the application of the Attention mechanism in the field of computer vision, from ViT to Swin Transformer, we can fully understand the development process of CV Transformer.

2. Attention mechanism enhances CNN 

Foreword:

Convolutional neural networks (CNNs) have achieved great success in a large number of computer vision applications, especially image classification. The design of the convolutional layer needs to ensure locality through a limited receptive field and translation equivariance through weight sharing . Research shows that these two properties are key inductive biases when designing image processing models . However, the inherent locality of the convolution kernel makes it impossible to obtain the global context in the image; and in order to better recognize objects in the image, the global context is essential.

Inductive bias: In fact, it is a kind of prior knowledge , a kind of assumption made in advance. Inductive bias can be understood as inducing certain rules (heuristics) from observed phenomena in real life, and then placing certain constraints on the model, so that it can play the role of "model selection", similar to Bayesian learning The "priority". For example, the deep neural network prefers to believe that hierarchical processing of information has a better effect; the convolutional neural network believes that the information has spatial locality (Locality), and the parameter space can be reduced by sliding convolution to share weights; the recurrent neural network is Take timing information into account, emphasizing order importance.

Self-attention (self-attention) mechanism is a recent progress in capturing long-range interactivity, but it is mainly used in sequence modeling and generative modeling tasks. The key idea behind the self-attention mechanism is to take a weighted average of the values ​​computed by the hidden units. Unlike pooling or convolution operators, the weights used in the weighted average operation are dynamically derived from the similarity function between hidden units. Thus, the interaction between input signals is determined by the signals themselves rather than being predetermined by their relative positions. In particular, this enables the self-attention mechanism to capture long-distance interactivity without increasing parameters .

Self-attention (self-attention) mechanism: a simple understanding of the attention mechanism - Zhihu (zhihu.com)

A Brief Reading of "Attention is All You Need" (Introduction + Code) - Scientific Spaces Scientific Spaces (kexue.fm)

1、 Attention Augmented Convolutional NetworksICCV 2019)

CNN's model structure is characterized by modeling local information aggregation, and its disadvantage is that it is difficult to model long periods. The Attention model has strong long-term modeling capabilities, so Attention Augmented Convolutional Networks (ICCV 2019) proposes to use Attention to make up for the shortcomings of CNN in ultra-long-term modeling. This paper investigates the problem of using self-attention (used as a surrogate for convolution) for discriminative vision tasks. The researchers developed a novel two-dimensional relative self-attention mechanism, which can maintain translational equivalence while incorporating relative position information, which makes it very suitable for images. Research shows that this self-attention scheme is competitive enough to completely replace convolutions. Nonetheless, controlled experiments show that combining both self-attention and convolutions gives the best results. Therefore, it is not appropriate to completely abandon the idea of ​​​​convolution, but to use this self-attention mechanism to enhance convolution. This is achieved by connecting convolutional feature maps (enforcing locality) to self-attentional feature maps (capable of modeling longer-range dependencies). This method converts the input image [H, W, F] into a two-dimensional [H*W, F] as the input of the Attention part, and the Attention model adopts the form of multi-head attention. The figure below shows the improvement of this attention enhancement method on the image classification task.

In order to make up for the lack of Transformer's ability to extract spatial position information, this paper uses the idea of ​​​​Self-Attention with Relative Position Representations (2018) , and uses relative position encoding in the two dimensions of width and height to enhance Attention. ability. Finally, the author uses the information obtained from the Attention part and the information obtained from the CNN part to splicing together to carry out follow-up tasks together, forming the complementary advantages of the two.

Relative positional embeddings Relative positional embeddings: 

Here is a brief introduction to relative position encoding, which is a way to replace the position embedding in Transformer. For any two position elements i and , the relative position embedding of the two will be added to the process of calculating attention. If the distance between i and j is n, it is represented by a learnable embedding corresponding to the distance n, and a certain threshold is set at the same time. If the distance between i and j exceeds k, it is represented by the embedding corresponding to the distance k. The left side of the following formula represents how the relative position embedding aij of i and j is used in multi-head attention, and the right side shows the calculation method of the relative position embedding of two elements. By introducing relative position embedding, translation invariance can also be achieved in the attention model (because the relative position remains unchanged after translation) 

Attention Enhanced Convolution Attention Augmented Convolution: 

Multiple previously proposed attention mechanisms on images show that convolutional operators are limited by their locality and lack of understanding of global context. These methods capture long-range dependencies by recalibrating convolutional feature maps. In particular, Squeeze-and-Excitation (SE) and GatherExcite (GE) perform channel reweighting, while BAM and CBAM reweight channel and spatial location independently.

In contrast to these methods, we 1) use an attention mechanism to jointly attend to the space and feature subspace (one for each head) and 2) introduce additional feature maps instead of refining them. The figure below summarizes our proposed augmented convolution.

Paper title: Attention Augmented Convolutional Networks (ICCV 2019)

arXiv:https://arxiv.org/abs/1904.09925

github:https://github.com/leaderj1001/Attention-Augmented-Conv2d

Note recommendation:

https://blog.51cto.com/smilecat/5936599

Paper title: Self-Attention with Relative Position Representations (2018)

arXiv:https://arxiv.org/abs/1803.02155

2、 Stand-Alone Self-Attention in Vision Models(NIPS 2019)

Stand-Alone Self-Attention in Vision Models (NIPS 2019) further proposes to completely use Attention+relative position embedding to replace the convolution module in ResNet, realizing a full Attention image model. The difference from the previous work is that this paper proposes to use local attention, which is similar to convolution, and each pixel only performs attention operations with a few pixels around it, which reduces the computational overhead. The model structure for realizing local attention in the image is shown in the figure below. 

Paper title: Stand-Alone Self-Attention in Vision Models (NIPS 2019)

arXiv:https://arxiv.org/abs/1906.05909

github: https://github.com/leaderj1001/Stand-Alone-Self-Attention

3、CMT: Convolutional Neural Networks Meet Vision Transformers(CVPR 2022,CMT)

In recent years, Transformer has attracted more and more attention in the visual field, and a question naturally arises: Which one is better, CNN or Transformer? CMT: Convolutional Neural Networks Meet Vision Transformers (CVPR 2022, CMT) Researchers at Huawei Noah Labs believe that it is best to join forces and propose a new visual network architecture CMT. The network obtained by simply combining traditional convolution and Transformer The performance is better than Google's EfficientNet, ViT and MSRA's Swin Transformer. Based on the multi-level Transformer, the paper inserts traditional convolution between the layers of the network, aiming to extract local and global features of the image hierarchically through convolution + global attention. The simple and effective combination proves that in the current vision field, using traditional convolution is the fastest way to improve the performance of the model. In the ImageNet image recognition task, CMT-Small has a Top-1 accuracy rate of 83.5% under similar calculation conditions, much higher than Swin's 81.3% and EfficientNet's 82.9%. The overall structure is shown in the figure below:

The model mainly includes 3 modules:

CMT stem (reduce image size, extract local information) is used to solve the modeling problem of in-patch information, reduce image size, and extract fine-grained features and local information. The first is a 3×3 convolution with a stride of 2, the number of output channels is 32, which is used to reduce the image size, and then two 3×3 convolutions with a stride of 1 for better local information extraction.

Conv Stride (used to reduce feature map, increase channel) convolution + layer norm, reduce the size of intermediate features (resolution downsampled by 2 times), and project it to a larger dimension (dimension magnified by 2 times), to Generate a hierarchical representation.

The CMT block (capturing global and local relationships) helps to simultaneously capture local and global structural information in intermediate features, improving the representation ability of the network, including local perception units, lightweight multi-head self-attention and reverse residual feedforward networks.

Paper title: CMT: Convolutional Neural Networks Meet Vision Transformers (CVPR 2022, CMT) Huawei Noah's Ark Laboratory

arXiv:https://arxiv.org/abs/2107.06263

Link to the paper: CMT: Convolutional Neural Networks Meet Vision Transformers

github:https://github.com/ggjy/CMT.pytorch

4、Conformer: Convolution-augmented Transformer for Speech Recognition(2020,Conformer)

The model based on transformer and convolutional neural network cnn has achieved good results on ASR, which are better than RNN. Transformer can capture long-sequence dependencies and content-based global interaction information, and CNN can effectively utilize local features. Therefore, Conformer: Convolution-augmented Transformer for Speech Recognition (2020, Conformer) combines transformer and cnn to model both local and global dependencies of audio sequences, and proposes a convolution-enhanced transformer model for speech recognition problems, called conformer, the model performance is better than transformer and cnn, and it has become a new sota (that year).

This paper proposes a new way of combining self-attention and cnn, which achieves the best results in both aspects: self-attention learns global interaction, cnn learns local correlation based on relative position offset, and combines the two The latter are combined in a sandwiched manner and sandwiched between a pair of feedforward neural network modules. The model structure is shown in the figure below. The proposed model is called conformer, which consists of two macaron style (feed-forward at both ends, multi-head self attention and convolution in the middle) feed-forward residual structure and multi-head self attention, convolution connection, Followed by a layernorm layer for normalization.

Paper title: Conformer: Convolution-augmented Transformer for Speech Recognition (2020, Conformer)

arXiv: https://arxiv.org/abs/2005.08100

3. Transformer completely replaces CNN

1、 End-to-End Object Detection with Transformers(ECCV 2020,DERT)

End-to-End Object Detection with Transformers (ECCV 2020, DERT) proposes the DERT model and introduces Transformer for object detection tasks. This paper proposes a set prediction method to solve the target detection task, by letting the model predict N elements (each element indicates whether there is a target, if there is a target, the category and position of the predicted target), where N is greater than the target actually contained in the image, let this The predicted N results and the ground truth do a matching calculation loss. Students who are interested in the details of this part can read the original text. We will mainly introduce the model structure of DERT below. The goal of DERT is to input N prediction results to match the ground truth, using the model architecture of CNN+Transformer Encoder+Transformer Decoder. The dimension of the input image is [3, W, H]. First, CNN is used to extract the feature map to obtain a high-level representation of [C, W', H']. Then use 1*1 convolution to compress the first dimension to form 2-dimensional sequence data that can be input into the Transformer model, and at the same time introduce the position embedding and feature map of each block to splicing and input to the Encoder. The dimension input to the Transformer Encoder is [d, W'*H']. The output of the Transformer Encoder will enter the Decoder, and the input of the Decoder is also trainable N position embeddings, corresponding to predicting N target detection results. The outputs of the final N are respectively subjected to FFN for final result prediction. The model structure of DERT is shown in the figure below.

Paper title: End-to-End Object Detection with Transformers (ECCV 2020, DERT)

arXiv:https://arxiv.org/abs/2005.12872

github: https://github.com/facebookresearch/detr​

2、 Generative Pretraining from Pixels(ICML 2020

Generative Pretraining from Pixels (ICML 2020 ) proposes to use GPT for unsupervised pre-training on large-scale image samples to generate image representations for downstream tasks. The method used in this work is similar to the application of GPT in NLP. The image is preprocessed, converted into a lower resolution, and then converted into a one-dimensional sequence and input into the GPT model. In this paper, two pre-training optimization objectives, Autoregressive (one-way prediction of the next pixel) and Bert (mask off some pixels and predict with context information), are tried. The image representation obtained by pre-training using this method, after finetune, achieved an accuracy rate of 72% on ImageNet.

Steps: 1. Divide the RGB values ​​into 512 clusters by k-means. Each pixel is then assigned to the nearest cluster center. 2. Reduce the image resolution, and then reshape to 1D input. 3. Select autoregressive or BERT objective function for unsupervised training. 4. Final test results.

The author directly uses the model structure of GPT-2, ignores the two-dimensional structural information of the image, and directly converts the image into a one-dimensional sequence as input. Through unsupervised generative training in this way, GPT-2 can also learn To a very good expression, the model obtained by pre-training is not weaker than or even exceeds the model obtained by supervised learning in subsequent tasks.

Paper title: Generative Pretraining from Pixels (ICML 2020 )

Paper link: https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf

Open source code: https://link.zhihu.com/?target=https%3A//github.com/openai/image-gpt

3、 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE(ICLR 2021,VIT)

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (ICLR 2021, VIT) proposed the VIT model, which fully reuses the model structure of Transformer in NLP to solve image problems. The specific method of VIT is to process the image into a form similar to the input token sequence in NLP, and input it into the basic Transformer to realize the use of Transformer in the image field. The specific method is to divide the image into multiple patches. Assuming that the dimension of the original image is [W, H, C] (width*height*channel), then the dimension of the converted image is [N, P*P*C] , where P represents how many blocks the image is divided into, and P corresponds to the length of the sequence into which the image is converted. Next, each patch will be mapped into a fixed dimension using a NN network and input to the subsequent Transformer Encoder. In addition, position embedding also uses one-dimensional. For image classification, similar to Bert, an identification token for classification is added at the beginning of the sequence. It is also mentioned in the article that the main problem of using Transformer to solve image problems is the lack of image-related inductive bias, the translation invariance in CNN, the proximity relationship in two-dimensional space, etc., which cannot be introduced in Transformer. However, it is verified through experiments in this paper that if pre-training is performed on a large amount of sufficient data, VIT can still achieve good results. At the same time, it is also mentioned in the article that if the training data is insufficient, the effect of VIT will be greatly reduced.

Paper title: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021, VIT)

arXiv:https://arxiv.org/abs/2010.11929

github:https://github.com/lucidrains/vit-pytorch

Nowadays, models derived from Vit are in full bloom and emerge in endlessly. Even if Google launches a new generation of MLP models, its recognition accuracy has surpassed that of the Transformer architecture, and new models iterated through the Vit framework will be included in CVPR conferences, for example. The research focus of field personnel, so for more information about the Vit model, please refer to the following articles (the quality of Zhihu’s articles is really high, and the content is also worth pondering):

Vision Transformer, Universal Vision Backbone Super Detailed Interpretation (Principle Analysis + Code Interpretation) (Category) - Zhihu (zhihu.com)

What are the improved algorithms of ViT (Vision Transformer) in the past two years? - Zhihu (zhihu.com) 

Cutting edge dynamics|Vision Transformer in the past two years (qq.com)

 4. Efficiency and effect optimization stage of CV Transformer

1、Transformer in TransformerNIPS 2021)

On the basis of VIT, there have been a series of improvement methods from the perspectives of efficiency and effect. Transformer in Transformer (NIPS 2021) proposed the TNT model. The core idea of ​​this method is to further subdivide the patch in the original VIT into sub-patches, and regard each patch as a sentence, and the elements obtained by further subdividing the patch are regarded as words. First, an Inner Transformer is used to represent the sub-patch within a patch. Then fuse the patch and sub-patch information, and use the Outer Transformer to generate representations at the patch granularity. This method is similar to the combination of word-level and char-level information in NLP. Its essential idea is that the granularity of dividing images into patches is too coarse. Due to the diversity of images in a data set, if the patch is decomposed into granularity With finer sub-patch, more similar sub-patch groups can be found in the data, improving the learning ability and generalization of the model on the data.

Paper title: Transformer in Transformer (NIPS 2021) Huawei Noah's Ark Laboratory

arXiv:https://arxiv.org/pdf/2103.00112.pdf

github:https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/tnt_pytorch

2、Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions(ICCV 2021,PVT)

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (ICCV 2021, PVT) proposes a Transformer model suitable for pixel-level image tasks (need to predict a label for each pixel). The previous VIT divided the image into patches, resulting in a lower resolution of the output results, and the calculation overhead of the Transformer model increases with the length of the sequence. Direct application of pixel-level prediction tasks will lead to skyrocketing calculation and memory consumption. Therefore, VIT is only suitable for image classification tasks. In order to solve the above problems, this paper proposes to introduce the idea of ​​Pyramid CNN into Transformer. By increasing the initial resolution and reducing the resolution layer by layer, it can capture finer-grained information and reduce operating overhead. PVT divides the model into 4 stages, each stage model will be composed of patch embedding+Transformer, the difference is that the resolution of each layer output is reduced layer by layer, from 4*4 high resolution to 32*32 low resolution, suitable for a variety of tasks. The model structure diagram and the comparison with VIT are as follows:

论文标题:Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions(ICCV 2021,PVT)

arXiv:​https://arxiv.org/pdf/2102.12122.pdf​

github:https://github.com/whai362/PVT

3、 Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet(ICCV 2021,T2T-VIT)

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (ICCV 2021, T2T-VIT) proposed the T2T-ViT model. This paper believes that the reason why ViT cannot directly train with a medium number of data sets to achieve better results is because ViT's method of patching and splicing images into sequences is too simple, and the model cannot learn the structural information of the image. This is also confirmed by the representation of the output of each layer in the middle of CNN. Therefore, this paper proposes a new Tokenize method, which restores the output of each layer into an image, and then performs soft split on the image. Soft split refers to overlapping patches, so that the relationship between different patches on the previous layer is established. This method also reduces the length of the network input sequence layer by layer. The specific structure of the network is as follows: 

Paper title: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (ICCV 2021, T2T-VIT)

arXiv:https://arxiv.org/abs/2101.11986

github: https://github.com/yitu-opensource/T2T-ViT 

4.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows ( ICCV 2021 Best Paper Award)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021 Best Paper Award) is similar to PVT in that it divides images into smaller batches and merges them layer by layer to reduce resolution.

If VIT applies Transformer from the NLP field to the visual field, but it only does classification work, the proposal of Swin Transformer completely applies Transformer to various subdivisions of the visual field, making Transformer a backbone network in the visual field.  The comparison between Swin-T and Vit is shown in the figure below.

Swin Transformer uses local attention to divide patches into windows, and the attention between patches is only carried out in windows to improve operating efficiency. However, the purely window-based self-attention calculation method loses the correlation between adjacent windows, which limits its modeling ability. In order to maintain the efficient calculation of non-overlapping windows while introducing cross-window connections, shift-window, a new window division method, is proposed: two division settings are alternately used in two consecutive Swin-T Blocks.

​The first module uses the conventional window segmentation strategy starting from the upper left patch to evenly divide the 8×8 feature map into 2×2 windows of 4×4 (here M=4). The starting point of the second module window division moves M/2 patches to the lower right (upper left), crosses the boundary of the window, and provides the connection between them.

​After the window is shifted, the number of windows increases. When the number of windows in the regular partition is small (for example, 2×2), the increased calculation amount is very large (2×2)→(3×3), which is 2.25 times larger, as shown in the figure above. (Generally, the number of windows is not too much) The solution is to fill the smaller windows to the size of M×M, and shield the filling value when calculating the attention. Fill the circular shift of the upper left small block to the lower right, as shown in the figure above. After this shift, some windows may consist of several sub-windows that are not adjacent in the feature map, so a masking mechanism is used to limit the self-attention calculation to each sub-window (this sub-window refers to the different Small patches of color, these colored parts will be covered). By cyclic shifting, the number of windows is the same as the number of regular window partitions.

 

 The main contributions of this paper are:

1. Attention based on local windows

2. Construct a hierarchical feature representation

3. The key part is to propose the Shift window moving window (W-MSA, SW-MSA), which improves the problem of ignoring the correlation between local windows in ViT.

4. And use the cyclic-shift cyclic displacement and mask mechanism to ensure that the calculation amount remains unchanged, and ignore the attention weight of irrelevant parts.

5. Added relative position bias B.

Paper title: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021 Best Paper Award)

arXiv:https://arxiv.org/abs/2103.14030

github:https://github.com/microsoft/Swin-Transformer

Code explanation: https://blog.csdn.net/qq_52302919/article/details/123988764

Video explanation: Intensive lecture on Swin Transformer paper and its PyTorch line-by-line reproduction_哔哩哔哩_bilibili

Note recommendation: https://blog.csdn.net/qq_45122568/article/details/124659955

5、A Time Series is Worth 64 Words: Long-term Forecasting with Transformers(ICLR 2023) 

Since the time series forecasting paper Are transformers effective for time series forecasting? (2022) defeated the complex Transformer model with a simple model, whether Transformer is suitable for time series forecasting tasks has become a major debate in the academic community. In an article on ICLR 2023, A  TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS (ICLR 2023) , a new method based on Transformer time series prediction and time series representation learning is proposed to convert time series data into something similar to Vision Transformer The patch form has achieved very significant results.

The model proposed in this paper is called PatchTST, and the overall structure is shown in the figure below. For the original variable time series, it is first split into multiple univariate time series, and each univariate series is independently input into the Transformer with shared parameters, and the prediction results are produced separately, and finally spliced ​​together to obtain the final multivariate forecast result.

For a univariate time series, split it into overlapping or non-overlapping patches. Each patch uses a fully connected map to the representation space. In addition, each patch will also add position embedding to mark the order of each patch. Subsequently, each univariate patch sequence is input into the Transformer model for encoding, and the output encoding is fully connected to obtain the prediction result.

Paper title: A TIME SERIES IS WORTH 64 WORDS: LONG-TERM FORECASTING WITH TRANSFORMERS (ICLR 2023)

arXiv:https://arxiv.org/abs/2211.14730

5. TiDE model based on MLP, without using attention mechanism, CNN and RNN

Following the previous research on whether Transformer is really effective in time series prediction and using MLP to defeat complex models, Google published the latest time series prediction work in the best, Long-term Forecasting with TiDE: Time-series Dense Encoder (2023) proposed the TiDE model, the whole model does not have any attention mechanism, RNN or CNN, and is composed entirely of full connections.

The whole model can be divided into four parts : Feature Projection, Dense Encoder, Dense Decoder, and Temporal Decoder .

Feature Projection maps external variables to a low-dimensional vector, which is implemented using Residual Block. The main purpose is to reduce the dimension of external variables.

The Dense Encoder part stitches together the low-dimensional vectors of the historical sequence, attribute information, and external variable mapping, and uses multi-layer Residual Block to map them, and finally obtains an encoding result e.

The Dense Decoder part maps e to g using the same multi-layer Residual Block, and reshape g into a [p, H] matrix. Among them, H corresponds to the length of the prediction window, and p is the output dimension of the Decoder, which is equivalent to obtaining a vector at each moment of the prediction window.

Temporal Decoder stitches together the g in the previous step and the external variable x according to the time dimension, uses a Residual Block to map the output results at each moment, and then adds the direct mapping results of the historical sequence to make a residual connection to obtain the final prediction result.

Paper title: Long-term Forecasting with TiDE: Time-series Dense Encoder (2023)

arXiv:https://arxiv.org/abs/2304.08424

In addition to reading the literature by myself, I also refer to the following high-quality articles, and friends who have time can read them carefully:

https://mp.weixin.qq.com/s/e14myypqfYrPuBCv2rPCfA

In addition, there are more journal conference related papers:

2020 Top Conference Review | Summarize research trends from these best papers (the paper download link is attached in the article) - Tencent Cloud Developer Community - Tencent Cloud (tencent.com)

(18 messages) Vision Transformer pre-training model_DeepWWJ's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/Next_SummerAgain/article/details/130307459