[PMLR 2021] Zero-Shot Text-to-Image Generation: Zero-Shot Text-to-Image Generation

[PMLR 2021] Zero-Shot Text-to-Image Generation: Zero-Shot Text-to-Image Generation

Fig 1. Comparison of original image (top) and discrete VAE reconstructed image (bottom).  The encoder downsamples the spatial resolution by a factor of 8.  While details (for example, the texture of cat fur, text on storefronts, and thin lines in illustrations) are sometimes lost or distorted, the main features of the image are often still recognizable.  We use a large vocabulary of 8192 to mitigate the loss of information

Fig 1. Comparison of original image (top) and discrete VAE reconstructed image (bottom). The encoder downsamples the spatial resolution by a factor of 8. While details (for example, the texture of cat fur, text on storefronts, and thin lines in illustrations) are sometimes lost or distorted, the main features of the image are often still recognizable. We use a large vocabulary of 8192 to mitigate the loss of information

Original Link : [PMLR 2021] Zero-Shot Text-to-Image Generation: Zero-Shot Text-to-Image Generation (by) Frontiers of Small-Shot Vision and Intelligence

01 Insufficiency of existing work?

Text-to-image generation focuses on finding better modeling assumptions for training on fixed datasets. These assumptions may involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks provided during training.

02 What problem does the article solve?

We describe a simple approach based on a Transformer that autoregressively models text and image tokens as a single data stream, enabling zero-shot text-to-image generation.

03 What is the key solution?

In this work, we show that training a 12 billion parameter autoregressive transformer on 250 million image-text pairs collected from the Internet results in a flexible, high-fidelity image generation model that can be language control.

04 What is the main contribution?

  • We investigate a simple method for text-to-image generation based on autoregressive transformers.
  • The proposed method is able to perform complex tasks such as image-to-image translation at a rudimentary level. This previously required custom approaches (Isola et al., 2017), rather than emerging as a capability of a single large generative model.

05 How is the method implemented?

Our goal is to train a transformer to autoregressively model text and image tokens as a single data stream. However, for high-resolution images, using pixels directly as image markers would require too much memory. Likelihood goals tend to prioritize modeling short-range dependencies between pixels, so most of the modeling power will be used to capture high-frequency details rather than low-frequency structures that make objects visually recognizable.

We address these issues by using two-stage training:

  1. We train a discrete variational autoencoder (dVAE)1 to compress each 256 × 256 RGB image into a 32 × 32 grid of image markers, each of which can have 8192 possible values. This reduces the Transformer context size by a factor of 192 without a major drop in visual quality (see Figure 1).
  2. We concatenate up to 256 BPE-encoded text tokens with 32 × 32 = 1024 image tokens and train an autoregressive transformer to model the joint distribution of text and image tokens.

We model this distribution with factorization:

yields the lower bound:

Fig 4. Illustration of resblock-by-resblock gradient scaling for transformer resblocks. The solid line represents the sequence of operations for forward propagation, and the dashed line represents the sequence of operations for backpropagation. We scale the incoming gradient according to the gradient scaling of each resblock, and unscale the outgoing gradient before adding it to the sum of gradients of consecutive resblocks. Activations and gradients along identified paths are stored with 32-bit precision. The "filter" operation sets all Inf and NaN values ​​in the activation gradient to zero. Without this, a non-finite event in the current resblock would cause the gradient scales of all previous resblocks to drop unnecessarily, leading to underflow.

Fig 4. Illustration of resblock-by-resblock gradient scaling for transformer resblocks.  The solid line represents the sequence of operations for forward propagation, and the dashed line represents the sequence of operations for backpropagation.  We scale the incoming gradient according to the gradient scaling of each resblock, and unscale the outgoing gradient before adding it to the sum of gradients of consecutive resblocks.  Activations and gradients along identified paths are stored with 32-bit precision.  The "filter" operation sets all Inf and NaN values ​​in the activation gradient to zero.  Without this, a non-finite event in the current resblock would cause the gradient scales of all previous resblocks to drop unnecessarily, leading to underflow.

06 What are the experimental results and comparative effects?

Fig 2. With varying degrees of reliability, our model appears to be able to combine different concepts in a reasonable way, create anthropomorphic versions of animals, render text, and perform certain types of image-to-image translation.
Fig 2. With varying degrees of reliability, our model appears to be able to combine different concepts in a reasonable way, create anthropomorphic versions of animals, render text, and perform certain types of image-to-image translation.

Fig 3. Comparing samples from our model with samples from previous methods in MS-COCO. Each of our model samples is the best of the 512 samples ranked by the comparative model. We do not use any manual picking with any model titles or sample selections.
Fig 3. Comparing samples from our model with samples from previous methods in MS-COCO.  Each of our model samples is the best of the 512 samples ranked by the comparative model.  We do not use any manual picking with any model titles or sample selections.

Fig 7. Human evaluation of our model (evaluating zero-shots without temperature reduction) versus previous work (DF-GAN) on the MS-COCO title. In a one-of-five vote, samples from our model were selected as the most realistic 90.0% of the time and the image that best matched the shared caption 93.3% of the time.
Fig 7. Human evaluation of our model (evaluating zero-shots without temperature reduction) versus previous work (DF-GAN) on the MS-COCO title.  In a one-of-five vote, samples from our model were selected as the most realistic 90.0% of the time and the image that best matched the shared caption 93.3% of the time.

Fig 8. Zero-shot samples of our model on the CUB dataset.

Fig 8. Zero-shot samples of our model on the CUB dataset.

Fig 9. Quantitative results of MS-COCO and CUB. The solid line represents the FID computed for the original validation set, and the dashed line represents the FID computed for the validation set with overlapping images removed (see Section 3.2). For MS-COCO, we evaluate all models on a subset of 30,000 captions sampled from the validation set. For CUB, we evaluate all models on all unique titles in the test set.
Fig 9. Quantitative results of MS-COCO and CUB.  The solid line represents the FID computed for the original validation set, and the dashed line represents the FID computed for the validation set with overlapping images removed (see Section 3.2).  For MS-COCO, we evaluate all models on a subset of 30,000 captions sampled from the validation set.  For CUB, we evaluate all models on all unique titles in the test set.

07 What does the ablation study tell us?

Tab 1. We show the relationship between model size and the minimum compression level of gradients (up to a multiple of 128), which is required to avoid gaps in the training loss in the first 10% of training. These results show that in our setting we can achieve about 85% compression, independent of model size.
Tab 1. We show the relationship between model size and the minimum compression level of gradients (up to a multiple of 128), which is required to avoid gaps in the training loss in the first 10% of training.  These results show that in our setting we can achieve about 85% compression, independent of model size.

Fig 6. The effect of increasing the number of images on the MS-COCO caption during the reranking process.Fig 6. The effect of increasing the number of images on the MS-COCO caption during the reranking process.

08 Conclusion

We investigate a simple approach to text-to-image generation based on autoregressive transformers when it performs at scale. We find that scale can lead to improved generalization, both relative to the zero-shot performance of previous domain-specific approaches, and in terms of the range of features produced by a single generative model. Our results suggest that improving generalization as a function of size may be a useful driver of progress on this task.

Original Link : [PMLR 2021] Zero-Shot Text-to-Image Generation: Zero-Shot Text-to-Image Generation (by) Frontiers of Small-Shot Vision and Intelligence

Guess you like

Origin blog.csdn.net/NGUever15/article/details/131430402