Paper notes: ViTGAN: Training GANs with Vision Transformers

2021

1 intro

  • The research question of the paper is: whether ViT can complete the image generation task without using convolution or pooling
    • That is, instead of CNN, use ViT to complete the image generation task
  • Integrating the ViT architecture into GAN, it was found that the existing GAN regularization method interacts poorly with the self-attention mechanism, resulting in serious instability during the training process.
    • ——>Introduced new regularization techniques to train GAN with ViT
    • The ViTGAN model is far superior to Transformer-based GAN models, and the performance is comparable to CNN-based GANs (such as Style-GAN2) without using convolution or pooling.
    • The ViTGAN model is one of the first to leverage visual Transformers in GANs

2 methods

  • Directly using ViT as the discriminator makes training unstable.
    • The paper introduces new techniques for both the generator and the discriminator to stabilize training dynamics and promote convergence.
      • (1) Regularization of ViT discriminator;
      • (2) New architecture of generator

 2.1 Regularization of ViT discriminator

  • Lipschitz continuity is important in GAN discriminator
  • However, a recent work shows that the Lipschitz constant of standard dot product self-attention layers can be unbounded, making Lipschitz continuous violated in ViTs.
    • —>1, use Euclidean distance instead of dot product similarity
    • —>2, multiply the normalized weight matrix of each layer by the spectral norm during initialization
      • For any matrix A, its Spectral Norm is defined as:
        • It can also be defined as the maximum singular value of matrix A
        • Spectral Norm for σ calculation matrix

2.2 Design Generator

3 experiments

 

Guess you like

Origin blog.csdn.net/qq_40206371/article/details/133267199