GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation

GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation (Paper reading)

Jian Ma, OPPO Research Institute, CH, arXiv, Cited:1, Code, Paper

1 Introduction

Recently, impressive breakthroughs have been made in the field of language-guided image generation, enabling the generation of high-quality and diverse images based on user instructions. While the synthesis results are stunning, an important limitation of current image generation models is their insufficient ability to coherently generate text in images, especially for complex grapheme structures such as Chinese characters. To address this issue, we introduce GlyphDraw, a general learning framework designed to empower image generative models to generate images embedded with text, for any specific language. We first carefully design the construction strategy of the image-text dataset, then build our model based on the diffusion-based image generator, and carefully modify the network structure, so that the model can learn to draw language characters with the help of glyph and position information. Furthermore, we preserve the model's open-domain image synthesis ability to prevent catastrophic forgetting by using a parameter-efficient fine-tuning technique. Extensive qualitative and quantitative experiments demonstrate that our method not only accurately generates language characters that match the prompts, but also seamlessly blends the generated text into the background.

2. Holistic thinking

First of all, there must be a picture dataset with Chinese text. With the dataset, we can also build a prompt, use BLIP-2 to generate a title for the image (for the prompt), and now we have an image and title with text. If we want to generate text images with text, we need to make the diffusion model aware of the text and generate text based on the information in the text. The method used in this article is: in the training phase, we deduct the text in the picture into a mask through OCR recognition, and send the original picture, mask, and text prompt cancat to the denoiser, so that the diffusion model knows the specific text Location and text information, with the blessing of prompt, training this model can make it learn to generate text and content corresponding to the location according to the text. This article may be the first job, and the generated effect is very abrupt, just like the way of adding text to pictures 20 years ago.

3. Method

Current image synthesis methods still face many challenges in generating fine-grained and complex structures such as human hands and text content. Pioneering work Imagen demonstrates that English text can be rendered in images using frozen pre-trained general-purpose large-scale language models such as T5-XXL, without introducing specifically designed networks and training strategies. Another recent work proposes to utilize character-aware language models (such as the ByT5 family) to further enhance the visual-text rendering capabilities of image synthesis models. However, as demonstrated in this work, these methods are not sufficient for generating non-Latin characters such as Chinese. This is mainly due to the more complex two-dimensional spatial structure of Chinese characters, consisting of eight different types of basic strokes, and a large number of commonly used characters (up to thousands). Therefore, generating accurate and diverse Chinese characters is more difficult and remains an open research problem. Furthermore, by freezing a pre-trained general language model is inflexible for tuning an image synthesis model to render visual text in a user-specified downstream language, while training a specific language model from scratch is costly and data-intensive. Therefore, we are inspired to design a general and adaptable algorithm to tackle the challenge of visual text presentation by using a lightweight training strategy and dataset.

insert image description here

To address this problem, we propose GlyphDraw, a general framework designed to empower image generative models to generate coherent visual text in images. GlyphDraw uses character glyphs and text positions as auxiliary information to better control the process of character generation. Our method achieves impressive results, capable of generating diverse visual texts that precisely follow given instructions. It is worth noting that the generated text intelligently matches the font style most suitable for the context and blends seamlessly with the background while maintaining high-quality generation and avoiding problems such as overfitting and catastrophic forgetting, as shown in Figure 1 for Chinese and English example shown. Our main contributions are summarized as follows:

  1. We introduce GlyphDraw, a general and flexible framework for solving the problem of visual character generation for any specific language such as English or Chinese. GlyphDraw provides fine-grained guidance throughout the generation process, enabling high-quality complex characters to seamlessly blend into image environments in a variety of styles.
  2. We develop a parameter-efficient fine-tuning strategy based on pre-trained models to prevent overfitting and catastrophic forgetting, thus effectively maintaining the model's strong performance in open-domain generation and simultaneously achieving accurate visual text generation.
  3. We detail the construction process of the training dataset and the evaluation benchmarks, on which our GlyphDraw achieves excellent OCR accuracy for Chinese and English character rendering, reaching 74% and 75%, respectively, significantly outperforming previous images resolve resolution.

3.1 Related work

Many studies have explored the challenge of incorporating textual content into image synthesis. For example, research on font generation aims to create novel fonts by treating it as a problem of style transfer based on a given input font. Diff-font utilizes a diffusion model to handle font generation tasks. However, these works only focus on generating font glyphs without background, and are inconsistent with our goal of improving text generation in image synthesis. Another related work proposes character-aware diffusion models to improve text generation by incorporating character-level input features. However, character-aware methods perform poorly in generating non-Latin texts due to the complexity of their spatial structure. To the best of our knowledge, our paper is the first to address the difficult problem of non-Latin text (e.g., Chinese) generation in general-purpose image synthesis.

3.2 Introduction

In this section, we first briefly review the necessary notation for Stable Diffusion (SD), in order to more conveniently describe the algorithm we propose later. We then detail an overview of the GlyphDraw framework and explain how we utilize auxiliary information. Additionally, we will introduce our devised training strategy to prevent catastrophic forgetting. Finally, we will introduce the inference process, which is slightly different from the training phase.

In SD, the input image is transformed into a latent representation by an autoencoder, and the diffusion process is also performed in the latent space. where a conditional U-Net is used to predict the current step ttt , latent spatial noisezt z_tztand build condition CCThe noise ϵ \epsilonunder Cϵ , where the condition is the cross-attention module added to the U-Net module.
Attention ⁡ ( Q , K , V ) = softmax ⁡ ( QKT d ) ⋅ V where Q = WQ ( i ) ⋅ φ i ( zt ) , K = WK ( i ) ⋅ C , V = WV ( i ) ⋅ C . \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right) \cdot V \\ where \quad Q= W_{Q}^{(i)} \cdot \varphi_{i}\left(z_{t}\right), K=W_{K}^{(i)} \cdot C, V=W_{V} ^{(i)} \cdot C .Attention(Q,K,V)=softmax(d QKT)VwhereQ=WQ(i)Phii(zt),K=WK(i)C,V=WV(i)C .
Among them,φ i ( zt ) \varphi_{i}(z_{t})Phii(zt) is the flattened vector after the denoiser,WQ ( i ) , WK ( i ) , WV ( i ) W_{Q}^{(i)},W_{K}^{(i)},W_{ V}^{(i)}WQ(i),WK(i),WV(i)
is a learnable projection matrix. In the context of text-to-image generation, the condition C = τ θ ( y ) C=\tau_{\theta}(y)C=ti( y ) is the pre-trained CLIP text encoderτ θ \tau_{\theta}tito text yyy code obtained.

3.3 Model overview

insert image description here

The overall training framework of our proposed GlyphDraw method is shown in the figure. We focused on modifying the cross-attention mechanism in Stable Diffusion. Original input latent variable zt z_tztis replaced by the image latent variable zt z_tzt, text mask lm l_mlmand the glyph image lg l_glgThe concatenation. Furthermore, by using domain-specific fusion modules, the conditional CCC comes with mixed glyph and text features. The introduction of text masks and grapheme information enables fine-grained diffusion control throughout the training process, which is one of the key components to improve performance. For the image latent part, the character mask lm l_mdetected by OCRlmand a glyph image lg l_g containing only the visual information of the characterlgand image latent features zt z_tztto connect. Then, the combined latent features zt ′ z’_tztis used as input to UNet. As for the text conditioning part, the pre-trained CLIP model encodes hints and glyph images as embeddings et e_teteg e_geg. Then, a fusion module is employed to further fuse text and grapheme embeddings into conditional features CCC , the feature is used as the key and value components of the UNet cross-attention layer. During inference, an MLP-like mask prediction module is employed to estimate character mask maps.

3.4 Exploration of auxiliary information

The pixel representation of textual information, especially ideograms like Chinese characters, is quite different from the representation of natural objects. For example, the Chinese character "天天" is just a two-dimensional structure composed of multiple strokes, while the corresponding imagination of its natural image is "a huge blue screen dotted with white clouds". Visual text is a very fine-grained feature, and even small shifts or deformations can lead to wrong text rendering, resulting in unrealistic image generation. When embedding characters into natural image backgrounds, it is also necessary to consider how to precisely control the generation of text pixels without affecting adjacent natural image pixels. Therefore, to render perfect characters on realistic natural images without incongruities, we carefully design two key components, namely position control and grapheme control, in our diffusion-based synthesis model.

insert image description here

Position control : The distribution of latent features of character pixels is very different from that of natural image pixels. To prevent model learning collapse, we innovatively introduce fine-grained location region control to decouple the distribution among different regions. Specifically, a binary mask feature map is generated on the original image latent features and concatenated to the original image latent features. In the training stage, the quadrilateral shape masks are extracted by OCR detection information. In the inference stage, since there is no reference image available, the mask is generated by the mask prediction module in the early diffusion stage, which is further discussed in Section 3.5.

Glyph Control : In addition to the aforementioned positional control, another important challenge is fine-grained control over language character stroke synthesis. Given the complexity (often consisting of 1 to 20 2D strokes) and diversity (up to 10,000 commonly used characters) of Chinese characters, learning solely from large-scale image-text datasets without any explicit prior knowledge injection is extremely difficult. To generate Chinese characters accurately, we introduce explicit glyph images as additional conditional information into the model diffusion process. Specifically, as shown in Figure 2, a pre-extracted glyph image containing only Chinese characters located in the center of the image with a white background, such as “Beidai River (Beidai River)”, is injected into the image latent part and text embedding part. First, the grayscale glyph image lg l_g extracted by the glyph generatorlgis concatenated to the noisy image latent feature zt z_tztand a binary text mask lm l_mlmForm a new image latent feature zt ′ = concat ( zt , lg , lm ) z'_t = concat(z_t, l_g, l_m)zt=concat(zt,lg,lm) . After the dimension adjustment of the convolutional layer, the feature vectorz ~ t = convin ( zt ′ ) \tilde z_t = conv_{in}(z'_t)z~t=convin(zt) is input to UNet as a query component. In terms of conditional information,C = M [ concat ( eg , et ) ] C = M [concat(e_g, e_t)]C=M[concat(eg,et)] by fusion moduleMMM from grapheme embeddings (eg = I θ ( lg ) e_g = I_θ(l_g)eg=Ii(lg) ) and text embeddings (et = τ θ ( y ) e_t = τ_θ(y)et=ti( y ) ), where the glyph embeddings are extracted from a fixed CLIP image encoder (I θ I_θIi) extraction, text embeddings from the text encoder ( τ θ τ_θti) 提取因此此,gphdraw 的基目标是:
lgd b ε ( x 0 ) , y , , , ϼ ϼ ∼ Σ ψ cepti ZT , T , y , lg , lm ) ∣ ∣ 2 2 ] \mathcal{L}_{GD_b}=\mathbb{E_{\varepsilon(x_0),y,l_g,l_m,\epsilon\sim N(0,1), t}}=[||\epsilon-\epsilon_{\theta}(z_t,t,y,l_g,l_m)||^2_2]LGDb=Ee ( x0) , y , lg,lm,ϵN(0,1),t=[∣∣ϵϵi(zt,t,y,lg,lm)22]

3.5 training

During the learning phase, we only update the network parameters for learning language character generation, while freezing other parameters to preserve the overall capability of the model. To enable UNet to use position mask and glyph information as additional channels along with image latent variables, we adapt UNet's input "conv_in" module accordingly to accommodate the extra information and update it during learning. Similarly, the fusion module also needs to be updated, which modifies the generation condition C by integrating glyph information and embedding of text cues. Furthermore, and most importantly, when updating the mapping from a given text-to-image distribution, only each cross-attention block needs to be updated iii MediumWK i W^i_KWKiWV i W^i_VWViis sufficient since the text features are the only input to the key and value projection matrices. By curating the parameters to be updated, our method effectively maintains the model's generative performance and achieves coherent text generation while updating only 3% of the total parameters, greatly speeding up the model's convergence.

To further improve the visual text generation performance of our model, we implement a weighting strategy in the design of the training objective, which aims to emphasize the learning of language character generation ability during the learning process. Specifically, according to the location mask information lm l_mlmLet us generate the following equations for the quantitative equation:
LGD = E ε ( x 0 ) , y , lg , lm , ϵ ∼ N ( 0 , 1 ) , t = [ ∣ ∣ ϵ − ϵ θ ( zt , t , y , lg , lm ) ∣ ∣ 2 2 + α ∣ ∣ ( ϵ − ϵ θ ( zt , t , y , lg , lm ) ) ∗ ( 1 − lm ) ∣ ∣ 2 2 ] \mathcal{L} _{GD}=\mathbb{E_{\varepsilon(x_0),y,l_g,l_m,\epsilon\sim N(0,1),t}}=[||\epsilon-\epsilon_{\theta}( z_t,t,y,l_g,l_m)||^2_2+\alpha||(\epsilon-\epsilon_{\theta}(z_t,t,y,l_g,l_m))*(1-l_m)||^2_2 ]LGD=Ee ( x0) , y , lg,lm,ϵN(0,1),t=[∣∣ϵϵi(zt,t,y,lg,lm)22+α ∣∣ ( ϵϵi(zt,t,y,lg,lm))(1lm)22]

3.6 Reasoning

insert image description here
During inference, the mask information lm cannot be directly extracted by the OCR detector due to the lack of the original image x0. Therefore, we propose a mask prediction module (red lines and boxes) in Fig. 2 for estimating coarse masks with arbitrary shapes. As shown in Figure 3, we estimate character masks in the first few diffusion steps (t = {T,T−1,...,tearly}) by using a simple pixel-wise MLP network trained with an MSE loss of The loss between the estimated mask and the true mask. After obtaining the predicted mask, we regenerate the image through the full diffusion process (t = {T, T−1, ..., rT, ..., 1}) through the DDIM sampling strategy. We sample the first few steps ({T, T−1, ..., rT + 1}) using the Glyphdraw model and the remaining steps ({rT, ..., 1}) using the pre-trained stable diffusion model , Glyph and position priors are discarded, where r ∈ [0,1] is a hyperparameter that trades off between text rendering accuracy and open-domain generative power.

4. Experiment

Based on stable diffusion, GlyphDraw consists of VAE, UNet, CLIP and fusion modules, which contain a total of 1.9 billion parameters, of which only 1 billion parameters (fusion module, conv_in module and projection matrix W(i)K, W(i)V ) can be trained. The VAE and UNet are initialized from the checkpoint of the stable diffusion model, and the CLIP image and text encoders are loaded from the pre-trained CLIP checkpoint. After the CLIP encoder, the token lengths of image and text are 257 and 64, respectively. We adopt a Transformer with 6 layers, 8 attention heads and 1024 hidden dimensions as the fusion module. We set the learning rate to 2e-5 and the weight scale hyperparameter α in Eq. to 0.5. The entire model is trained on 24 A100 GPUs with a batch size of 25 per GPU for a total of 20 epochs.
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/130935198