2021
1 intro
- The research question of the paper is: whether ViT can complete the image generation task without using convolution or pooling
- That is, instead of CNN, use ViT to complete the image generation task
- Integrating the ViT architecture into GAN, it was found that the existing GAN regularization method interacts poorly with the self-attention mechanism, resulting in serious instability during the training process.
- ——>Introduced new regularization techniques to train GAN with ViT
- The ViTGAN model is far superior to Transformer-based GAN models, and the performance is comparable to CNN-based GANs (such as Style-GAN2) without using convolution or pooling.
- The ViTGAN model is one of the first to leverage visual Transformers in GANs
2 methods
- Directly using ViT as the discriminator makes training unstable.
- The paper introduces new techniques for both the generator and the discriminator to stabilize training dynamics and promote convergence.
- (1) Regularization of ViT discriminator;
- (2) New architecture of generator
- The paper introduces new techniques for both the generator and the discriminator to stabilize training dynamics and promote convergence.
2.1 Regularization of ViT discriminator
- Lipschitz continuity is important in GAN discriminator
- However, a recent work shows that the Lipschitz constant of standard dot product self-attention layers can be unbounded, making Lipschitz continuous violated in ViTs.
- —>1, use Euclidean distance instead of dot product similarity
- —>2, multiply the normalized weight matrix of each layer by the spectral norm during initialization
- For any matrix A, its Spectral Norm is defined as:
- It can also be defined as the maximum singular value of matrix A
- Spectral Norm for σ calculation matrix
- For any matrix A, its Spectral Norm is defined as:
- —>1, use Euclidean distance instead of dot product similarity
2.2 Design Generator
3 experiments