Openai's masterpiece Dalle2 theory and code reproduction

Openai masterpiece Dalle2

Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Today, I would like to share with you the theory and code reproduction of Openai's masterpiece Dalle2 Paper
: https://cdn.openai.com/papers/dall-e-2.pdf
Code: https://github.com/lucidrains/DALLE2-pytorch
#博学谷IT Learning Technical Support#



foreword

What I want to share with you today is a 2022 Openai masterpiece Dalle2, how to input a sentence and generate a very interesting picture process.
Let's take a look at the final rendering of the paper.

insert image description here

Isn't it very interesting? Let's see how the Openai team did it.


1. The overall framework of the Dalle2 model

insert image description here
In fact, the overall structure of the paper is not very complicated. It mainly uses
1. CLIP based on comparative learning
2. The generation model Diffusion Model
to combine these two technology stacks. These two technology stacks have been explained in my previous blog. . You can take a look first. Otherwise it might be a bit abstract.
https://blog.csdn.net/weixin_53280379/article/details/125585445?spm=1001.2014.3001.5502
https://blog.csdn.net/weixin_53280379/article/details/126250598?spm=1001.2014.3001

2. Pre-training model CLIP

insert image description here
The upper part of the model here is the result of using the pre-trained model CLIP to get 2 vectors

  1. The first one is text_embed, which is a (2 by 512) tensor, 2 means batch_size, 512 means the dimension of the feature vector, text_embed can be regarded as a summary of the input sentence.
  2. The second is text_encodings, which is a (2 by 256 by 512) tensor, which is the feature of each word of the input sentence. 2 means batch_size, 256 means that the length of this sentence is not enough to replace with 0, and 512 means the dimension of the feature vector.

Corresponding to the original text:
insert image description here

So the role of the pre-training model CLIP here is: input a sentence, and get two vectors closer to the picture through CLIP. Apply these two vectors to actual downstream tasks.

1. Main code

text_embed, text_encodings = self.clip.embed_text(text)

3. Prior model

With the above two text feature vectors obtained through CLIP, the features are further processed by generating the model Diffusion Model.
insert image description here
This step is mainly to use the Diffusion Model, but it is a little different from the traditional Diffusion Model. It is said in the original text that the traditional Diffusion Model learns a noise through a network such as unet, and iterates step by step through the noise, while Dalle2 does not Learning noise, but directly learning to get x0, eliminating the intermediate calculation process.
insert image description here

insert image description here
2 text feature vectors obtained through CLIP, 1 random initial noise and a time step feature to learn this x0, with this x0, just like the Diffusion Model model, the distribution of xt-1 can be derived through xt, in the pair The distribution of xt-1 can be iteratively pushed forward step by step through the normal distribution resampling technique to obtain the final feature vector.
How to train to get x0, mainly through a transformer network:
insert image description here

1. Main code

Here x is the random initial noise, t is the time step feature, and text_cond is the CLIP text feature. Obtaining pred is to predict the x0 features required in the previous step.

pred = self.net.forward_with_cond_scale(x, t, cond_scale = cond_scale, **text_cond)

With the x0 feature, the distribution of xt-1 is derived according to the Diffusion Model formula, and the normal distribution is mainly an expectation and variance. It is exactly the same as the Diffusion Model formula.

def q_posterior(self, x_start, x_t, t):
    posterior_mean = (
        extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
        extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
    )
    posterior_variance = extract(self.posterior_variance, t, x_t.shape)
    posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped

model_mean, posterior_variance, posterior_log_variance = self.noise_scheduler.q_posterior(x_start=x_recon, x_t=x, t=t)

Here is the resampling operation to get the image_embed of the xt-1 feature map

def p_sample(self, x, t, text_cond = None, clip_denoised = True, cond_scale = 1.):
    b, *_, device = *x.shape, x.device
    model_mean, _, model_log_variance = self.p_mean_variance(x = x, t = t, text_cond = text_cond, clip_denoised = clip_denoised, cond_scale = cond_scale)
    noise = torch.randn_like(x)
    # no noise when t == 0
    nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
    return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise#采样 (0.5 * model_log_variance).exp()计算得到标准差
image_embed = self.p_sample(image_embed, times, text_cond = text_cond, cond_scale = cond_scale)

Finally, iterate n times to get the final feature map output of the prior model

for i in tqdm(reversed(range(0, self.noise_scheduler.num_timesteps)), desc='sampling loop time step', total=self.noise_scheduler.num_timesteps):
    times = torch.full((batch,), i, device = device, dtype = torch.long)
    image_embed = self.p_sample(image_embed, times, text_cond = text_cond, cond_scale = cond_scale)

Four, Decoder decoding model

insert image description here
The Decoder decoding model is similar to the prior model and also uses Diffusion Model, but it uses 2 Diffusion Models, and the effect of one may be average, but the speed will be slower.

insert image description here
At the end of the model, various comparative experiments were performed and it was found that the Diffusion Model would be better than the autoregressive method.

insert image description here


Summarize

Today, I would like to share with you the theory and code reproduction of Openai's masterpiece Dalle2. It is a bit difficult and a bit abstract. It mainly needs to have the foundation of CLIP based on comparative learning and the generation model Diffusion Model, and then combine the two. The article is mainly to share the main idea with you, and there are still details not written. If you have time, you can read it, and if you have any questions, you can leave a message for discussion.

Guess you like

Origin blog.csdn.net/weixin_53280379/article/details/126362076