SAGAN: Self-Attention Generative Adversarial Networks - 1 - Paper Learning

Abstract

In this paper, we propose a self-generated attention against the network (SAGAN), it is allowed to drive attention for image generation task, long-distance-dependent modeling. Conventional convolutional GANs generate only a high-resolution details of a spatially localized point on the low resolution. In SAGAN, the clue may be used to generate the feature locations from all the details. Further, it is determined whether the minutiae can check the far portion of the same image. In addition, recent studies have shown that the generator conditions will affect the performance of GAN. With this view, we will apply the spectrum normalized to the GAN generator, we found that improves the dynamic training. SAGAN presented better results than previous studies, the challenge ImageNet dataset, the best Inception score increased from 36.8 to 52.52, the Fréchet initial distance was reduced from 27.62 to 18.65. Visual display of the focus layer, using a generator with the target shape corresponding to a neighborhood, local area rather than a fixed shape.

 

1. Introduction

Image synthesis is an important problem in computer vision. With the advent of the formula confrontation Network (GANs), which has made significant progress (Goodfellow et al., 2014), although there are still many open questions (Odena, 2019). In particular, based on the success GANs depth convolutional network (Zhang et al Radford et al, 2016;.; Karras et al, 2018..). However, by careful examination of these samples generated models, we can observe the convolution GANs (Odena et al, 2017;. Miyato et al, 2018;. Miyato & Koyama, 2018) in the multi-class training data set, some image class modeling modeling is much more difficult than other categories (e.g., ImageNet (Russakovsky et al., 2015)). For example, when the advanced model ImageNet GAN (Miyato & Koyama, 2018) and a good number of structural composite image based limitations (e.g., sea, air and the like are more landscape distinguished by a texture, rather than by the geometry), can not capture geometric or in some structural model class continuous occurrence (for example, often with real fur texture, but there is no clear definition of separate foot while drawing a dog). One possible explanation is heavily dependent on the previous convolution model to model the correlation between different image regions. Since the convolution operator has to accept a local domain, long-distance dependencies must go through several layers to the convolution processing. For various reasons, which will prevent long-distance learning dependence, there are disadvantages:

  • A small model may not represent the long-range dependence
  • Optimization can be difficult to find a closer coordination of multi-layered, and can capture the values ​​of these parameters depend on, and when it is not visible before these are applied to the input of parameters, statistics may be fragile and prone to failure
  • Increase the size of the convolution kernel can increase the representation capability of the network, but it will also lose the use of local computing architecture to get the convolution and statistical efficiency

On the other hand, since the note (Cheng et al, 2016;. Parikh et al, 2016;. Vaswani et al, 2017.) Exhibits a better relationship between the ability to build long-term dependence of calculated and statistical efficiency balance. Note that since the response module calculates a location, for all positions of features and weights, which weights - or attention Vector - requires only a very small computational cost can be calculated.

In this work, we propose to generate attention from confrontation Network (SAGANs), it will be a self-attention mechanism is introduced into the convolution GANs.

Its benefits include:

  • Note from the convolution module is a supplement to help long-distance across the image area of ​​modeling, multi-level dependencies.
  • Note equipped with self-generated fine details may be drawn at each position are carefully coordinated with the fine detail image of the distant parts of the image.
  • Further, the discrimination can also be performed more accurately complex geometric constraints on the structure of the global image.

In addition to self-attention, we combine the latest research results for the performance of the network regulating GAN. (Odena et al., 2018) studies indicate that good performance status generator is often better. We recommend the use of previously applied only arbiter of spectrum normalization techniques to strengthen GAN generator in good condition (Miyato et al., 2018) .
We have done a lot of experiments on ImageNet data set, the effectiveness of self-attention mechanism and stabilization techniques to verify raised. 52.52, reducing the initial distance by Fréchet best Inception fraction increased from 36.8 to 27.62 to 18.65 from the image synthesis described in previous work significantly better than SAGAN. Visual display of the focus layer, using a generator with the target shape corresponding to a neighborhood, local area rather than a fixed shape. Our code can be found in https://github.com/ brain-research / self-attention -gan.

 

2. Related Work

Generating network against various GANs acquires image generating tasks with great success, converted into an image including the image (Isola et al, 2017;. Zhu et al, 2017;. Taigman et al, 2017;. Liu & Tuzel, 2016 ; Xue et al, 2018; Park et al, 2019), the super-resolution image (Ledig et al, 2017;..... Snderby et al, 2017), and text - image synthesis (Reed et al, 2016b; a; zhang et al, 2017;. Hong et al, 2018).. Despite this success, but GANs training is unstable and very sensitive to the choice of ultra-parameters. Some work trying to stabilize and improve the training of dynamic GAN sample diversity by designing a new network architecture (Radford et al, 2016;. Zhang et al, 2017;. Karras et al, 2018;. 2019), modified learning objectives and dynamic (Arjovsky et al, 2017;. Salimans et al, 2018;. Metz et al, 2017;. Che et al, 2017;. Zhao et al, 2017;. Jolicoeur-Martineau, 2019) was added regularization method (Gulrajani et al ., 2017; Miyato et al, 2018) and the introduction of heuristic techniques (Salimans et al, 2016;.. Odena et al, 2017;. Azadi et al, 2018).. Recently, Miyato et al (Miyato et al., 2018) in order to restrict Lipschitz constant discriminator function, raised a number of spectral norm limit discriminator in the weight matrix. Discriminator based on a combination of items (Miyato and Koyama, 2018), the normalized spectral model greatly improved the condition of the image based on the generated ImageNet.

Note that the model recently, attention has become an integral part of the mechanism (Bahdanau et al, 2014 capture model of global dependencies;. Xu et al, 2015;. Yang et al, 2016;. Gregor et al, 2015;. Chen et al., 2018). Especially since attention (Cheng et al, 2016;. Parikh et al, 2016.), Also known as the note is calculated in response to a location in the sequence of interest by the same sequence in all positions. Vaswani et al (Vaswani et al., 2017) demonstrated conversion machine model is only through self-attention model can be achieved using the most advanced results. Parmar et al (Parmar et al., 2018) proposed a model of the image converter, added to the note from the image generated from the regression model. Wang et al., (Wang et al., 2018) Note that since the operation as a non-local dependence to model the temporal video sequence. Despite these advances, since the note has not been explored in the GANs background. (AttnGAN (Xu et al., 2018) using the noted mechanism embedded in the input sequence of words, without using the internal state of the model from the note). SAGAN learned within the image representation efficiently find global, long-distance dependencies.

 

3. Self-Attention Generative Adversarial Networks

Most models GAN (Radford et al, 2016 Karras et al, 2018.;; Salimans et al, 2016..) To construct an image based on the generated convolution layer. A convolution processing information within a local neighborhood, so long layer alone convolution dependence modeling images is computationally inefficient. In this section, we used (Wang et al., 2018) describes a model from non-local GAN ​​attention-frame, so that the generator can be effectively discriminator and the relationship between widely separated regions of space modeling. Because of its self-attention module (see Figure 2), we note that the proposed method is called self-generation combat network (SAGAN).


A front hidden layer x ∈ R C × N image features into the first two feature space f, g calculated noted, where F (X) = W is F X, G (X) = W is G X:

 

beta] j, i represents a model of the degree of concern in the i-th position of the j-th region synthesis. Wherein, C is the number of channels, N being the number of previous location features hidden layer features. Note that the output layer is = O (O . 1 , O 2 , ..., O J , ..., O N ) ∈R C × N , wherein:

In the above formula, W is G ∈R C̄ × C , W is F ∈R C̄ × C, W is H ∈R C̄ × C and W is V ∈R C̄ × C is the weight of the weight matrix may be learned, for implementing the 1 × 1 matrix. After some iterations the number of channels is reduced from C̄ ImageNet to C / k, k = 1, 2, 4, 8, we did not notice any significant performance degradation. To improve memory efficiency, we select k = 8 is provided in all our experiments (i.e. C̄ = C / 8). 

 

To summarize, CxN = Cx (WxH), so in order to perform matrix multiplication conversion in this way, the operation that is flat. By f (x), and outputs the operation G (x) is [C / 8, N], and then multiplying the two transposed results f (x) to obtain a matrix s [N, N] size, It represents a relationship between each pixel can be regarded as a correlation matrix. h (x) operation is slightly different, the output is [C, N]

Then use the matrix after softmax beta] s to obtain the normalized matrix, beta] j, i represents a model in the synthesis of the j-th pixel of the i-th level of interest position, i.e. an attention map

Characteristically FIG attention map obtained is then applied to a h (x) of the output, will affect the j-th pixels generated H (X I ) corresponding to the degree of influence beta] j, I is multiplied, and then seek and, j this makes it possible to generate a pixel according to the degree of influence, the result of this convolution is performed to obtain one note feature is added on the results of FIG o

 

Further, the ratio of output parameters we will focus multiplying layer, and inputs the characteristic of FIG added back. Therefore, the final output is:

 The results thus obtained is the original feature of FIG x plus the result of increased attention mechanism o

 

γ is the scalar learning available, it is initialized to zero. First introduced γ allows the network can learn the local field dependence leads - as this is easier, and then gradually learn to assign more weight to the non-local evidence. We do this intuition is simple: we want to learn simple tasks, and then gradually increase the complexity of the task. In SAGAN, the proposed attention module has been generated and applied to discriminator, training (Lim & Ye, 2017 by minimizing loss against hinged alternating manner; Tran et al, 2017;. Miyato et al, 2018. )

 

4. Techniques to Stabilize the Training of GANs

We also studied the two technologies go on challenging dataset stable training GANs. First, we use the spectral generator and the discriminator normalization (Miyato et al., 2018). Secondly, we confirmed the time scale update rule (TTUR) (Heusel et al., 2017) is valid, we stand in regularization discriminator in specially to use it to solve a slow learning problems.

4.1. Spectral normalization for both generator and discriminator

Miyato et al (Miyato et al., 2018) originally proposed by normalized spectral discriminator network to stabilize the applied train of GANs. Lipschitz constant constrained to do so by limiting the discriminators each layer spectral norm. Compared with other normalization techniques, no additional spectrum normalized ultra parameter adjustment (the norm of spectrum ownership layer 1 is set to perform well in practice). Further, the computational cost is relatively small.

We believe that the generator can also benefit from the spectrum normalization based on recent evidence suggests that adjustment generator is an important factor affecting the performance of GANs (Odena et al., 2018) . Spectral normalization can be prevented from increasing the amplitude of the parameters, avoiding abnormal gradient. Our experience has found that the spectrum normalized discriminator generator and makes it possible to use less discriminating update each update generator, which greatly reduces the computational cost of training. This method also showed more stable behavior training.

 

4.2. Imbalanced learning rate for generator and discriminator updates

In previous work, the discriminator regularization (Miyato et al, 2018;. Gulrajani et al, 2017.) GANs often slows the learning process. In practice, use is regularized discriminant during the training process usually takes a plurality of (e.g., five) Steps after update is determined, a step is updated for each generator. Heusel et al. (Heusel et al., 2017) advocates the use of a separate learning rate (TTUR) and for generating a discriminator . We recommend TTUR to compensate for positive discrimination is a problem in the study of slow, so that each generator using fewer steps arbiter step possible. Using this method, we can get better results in the same wall clock time.

 

5. Experiments

In order to evaluate the proposed method, we have carried out on LSVRC2012 (ImageNet) a large number of experimental data set (Russakovsky et al., 2015). First, in Section 5.1, we designed a number of experiments to evaluate the effectiveness of both the proposed technology for stable GANs training. Next, the researchers note that since the proposed mechanism in Section 5.2. Finally, we will SAGAN with the most advanced methods (Odena et al, 2017;. Miyato & Koyama, 2018), i.e., the image generating section 5.3 in comparison task. Model using synchronous SGD (As we all know, there are some difficulties asynchronous SGD, such as (Odena, 2016)), using 4 gpu, for about two weeks of training on each.

 

评价指标。我们选择Inception分数(IS)(Salimans et al ., 2016)和Fréchet初始距离(FID) (Heusel et al ., 2017)进行定量评价。尽管存在可替代的选择(Zhou et al., 2019; Khrulkov & Oseledets, 2018; Olsson et al., 2018),但是它们没有被广泛使用。Inception分数(Salimans et al., 2016)计算了条件类分布和边缘类分布之间的KL散度。更高的Inception分数意味着更好的图像质量。我们包含了Inception分数,因为它被广泛使用,因此可以将我们的结果与之前的工作进行比较。然而,重要的是要明白,Inception分数有严重的局限性—— 其主要目的是确保模型生成的样本,可以明确地被识别为是属于一个特定的类的,然而模型生成来自许多类的样本,不是非要评估细节的真实性或内部类的多样性的。FID是一个更有原则和更全面的度量标准,在评估生成样本的真实性和变化方面,它已被证明与人类评估更一致(Heusel et al., 2017)。FID计算生成的图像与Inception-v3网络特征空间中的真实图像之间的Wasserstein-2距离。此外,FID计算了整个数据分布(即在ImageNet中的1000个类的图像) ,我们还计算每个类中生成的图像和数据集图像之间的FID(称为intra FID (Miyato & Koyama, 2018))。较低的FID和内部FID值意味着合成数据分布和实际数据分布之间的距离更近。在我们所有的实验中,每个模型随机生成50k个样本来计算Inception分数,FID和intra FID。

 

网络结构和实现细节。我们训练的所有SAGAN模型都被设计成生成128×128的图像。默认情况下,光谱归一化(Miyato et al., 2018)用于生成器和判别器中的层。与(Miyato & Koyama, 2018)类似,SAGAN在生成器中使用条件batch normalization,在判别器中使用投影。对于所有的模型,我们使用Adam优化器 (Kingma & Ba, 2015) 在训练中设置β1 = 0和β2 = 0.9。默认情况下,判别器的学习率为0.0004,生成器的学习率为0.0001。

 

5.1. Evaluating the proposed stabilization techniques

在本节中,进行了实验来评估所提出的稳定技术的有效性,即,将光谱归一化(SN)应用于生成器,利用不平衡学习率(TTUR)。在图3中,我们的模型“SN on G/D”和“SN on G/D+TTUR”与基线模型进行了比较,基线模型是基于最先进的图像生成方法实现的(Miyato et al., 2018)。

在这个基线模型中,仅在判别器中使用了SN。当我们对判别器(D)和生成器(G)进行1:1的平衡更新训练时,训练变得非常不稳定,如图3中最左边的子图所示。它在训练中很早就表现出模式崩溃。例如,图4的左上子图演示了基线模型在第10k次迭代时随机生成的一些图像。

虽然在最初的论文(Miyato et al., 2018)中,这种不稳定的训练行为通过对D和G使用5:1的不平衡更新得到了极大的缓解,但是为了提高模型的收敛速度,需要使用1:1的平衡更新来稳定地训练。因此,使用我们提出的技术,意味着该模型可以产生更好的结果给定相同的壁钟时间。因此,不需要为生成器和判别器搜索合适的更新比率。如图3的中间子图所示,在生成器和判别器上同时添加SN,使得我们的模型“SN on G/D”得到了极大的稳定,即使是经过1:1的均衡更新训练。然而,在训练过程中,样本的质量并不是单调地提高。例如,由FID和IS测量的图像质量在第260次迭代时开始下降。该模型在不同迭代下随机生成的示例图像如图4所示。当我们将不均衡的学习速率应用于训练判别器和生成器时,“SN on G/D+TTUR”模型生成的图像质量在整个训练过程中单调提高。如图3和图4所示,在一百万次的训练迭代中,我们没有观察到样本质量、FID或Inception分数有任何显著的下降。因此,定量结果和定性结果都证明了所提出的GANs训练稳定技术的有效性。他们还证明,这两种技术的效果至少在一定程度上是相加的。在剩下的实验中,所有的模型对生成器和判别器都使用光谱归一化,并使用不平衡的学习速率以1:1的更新来训练生成器和判别器。

 

5.2. Self-attention mechanism.

为了探讨所提出的自注意机制的效果,我们建立了几个SAGAN模型,将自注意机制添加到生成器和判别器的不同阶段。如表1所示,使用自注意机制建模的SAGAN在中级到高级级别的特征图(如f eat32和feat64)比自注意机制的模型在低级别特征映射(例如,feat8和feat16)中取得更好的性能:

例如,“SAGAN, feat8”模型的FID从22.98改进为“SAGAN, feat32”的18.28。原因是,自注意获得了更多的证据,并享有更多的自由去选择条件与更大的特征图(即对于大的特征图,它与卷积是互补的),但是对于小的特征图(例如8×8),它与局部卷积的作用类似。实验结果表明,该注意机制为生成器和判别器提供了更大的能力,可以直接对特征图中的长距离依赖关系进行建模。此外,将我们的SAGAN与不需要注意的基线模型(表1第二列)进行比较,进一步表明了所提出的自注意机制的有效性。

与具有相同数量参数的残差块相比,自注意块也取得了较好的效果。例如,当我们用8×8特征图中的残差块替换自注意块时,训练不稳定,导致性能显著下降(如FID从22.98增加到42.13)。即使在训练顺利进行的情况下,将自注意块替换为残差块仍然会导致FID和Inception分数的下降。(例如,在特征图32×32中,FID为18.28 vs 27.33)。这一比较表明,使用SAGAN所带来的性能改进不仅仅是由于模型深度和容量的增加。
为了更好地理解在生成过程中所学到的知识,我们在SAGAN中可视化了不同图像中生成器的注意权重。图5和图1显示了一些需要注意的示例图像。参见图5的说明,以了解所学习的注意图的一些属性。

 

 

 

5.3. Comparison with the state-of-the-art

我们的SAGAN还与最先进的GAN模型进行了比较 (Odena et al., 2017; Miyato & Koyama, 2018) ,其用于生成ImageNet上的类条件图像。如表2所示,我们提出的SAGAN获得了最好的Inception分数,intra FID和FID。

SAGAN raised significantly the release of Inception best score improved from 36.8 to 52.52. Lower FID (18.65) and intra FID (83.7) SAGAN obtained also shows that by using a self attention module long distance between the image region modeling dependencies, Sagan to better approximate distribution of the original image.
Figure 6 shows the results of some of the more representative ImageNet classes and the generated image.

We observed that our most advanced than SAGAN GAN model (Miyato & Koyama, 2018) in the synthesis of complex geometry or image class has a structural pattern, such as to achieve better performance (i.e., lower intra and Saint Bernard goldfish FID). For less structural constraints (such as valleys, coral stone walls and fungi, are distinguished by a more texture rather than geometric), we SAGAN compared with baseline model showed less superiority (Miyato and Koyama , 2018). Again, because of the self SAGAN attention and long distance for capturing the global level dependency convolution geometry or configuration consistent pattern appears are complementary, but in dependence on the simple texture modeling, the effect of local convolution is similar.

Therefore SAGAN more suitable for use in the synthesis of structures with complex geometry or pattern image

 

6. Conclusion

In this paper, we propose a self-generated attention against the network (SAGANs), it will be noted that since a mechanism integrated into the GAN framework. Note that since the module can be effectively modeled long distance dependencies. In addition, we also demonstrated that the spectral normalization techniques applied to the generator can be stable GAN training, and training TTUR can accelerate positive discriminator is based. SAGAN on ImageNet implements the latest performance class condition generated image.

 

Guess you like

Origin www.cnblogs.com/wanghui-garcia/p/11766406.html