UniControl: conditionally controllable image generation, universal unity

The source of this article is the editorial department of the heart of the machine

Researchers from Salesforce AI, Northeastern University, and Stanford University proposed MOE-style Adapter and Task-aware HyperNet to realize the multimodal condition generation capability in UniControl. UniControl is trained on nine different C2I tasks, demonstrating strong vision generation ability and zero-shot generalization ability.

170b22d59988e08e1e21c43a31b38e72.png

  • Paper address: https://arxiv.org/abs/2305.11147

  • Code address: https://github.com/salesforce/UniControl

  • Project homepage: https://shorturl.at/lmMX6

Introduction: Stable Diffusion demonstrates strong visual generation capabilities. However, they often underperform in generating images with spatial, structural or geometric control. Works such as ControlNet [1] and T2I-adapter [2] achieve controllable image generation for different modalities, but being able to adapt to various visual conditions in a single unified model remains an unsolved challenge. UniControl incorporates various controllable condition-to-image (C2I) tasks within a single framework. To make UniControl capable of handling diverse visual conditions, the authors introduce a task-aware HyperNet to adjust the downstream conditional diffusion model, making it adaptable to different C2I tasks simultaneously. UniControl is trained on nine different C2I tasks, demonstrating strong vision generation ability and zero-shot generalization ability. The author has open-sourced the model parameters and inference code, and the data set and training code will also be open-sourced as soon as possible. Welcome to exchange and use.

1d0c77ca2c608936a22a92cd0e66e0da.png

Figure 1: The UniControl model consists of multiple pre-training tasks and zero-shot tasks

Motivation: The existing controllable image generation models are designed for a single modality. However, Taskonomy [3] and other works have proved that different visual modalities share features and information. Therefore, this paper believes that a unified multimodal model has huge potential.

Solution: This paper proposes MOE-style Adapter and Task-aware HyperNet to realize the multimodal condition generation capability in UniControl. And the author established a new dataset MultiGen-20M, which contains 9 major tasks, more than 20 million image-condition-prompt triplets, and the image size is ≥512.

Advantages: 1) More compact model (1.4B #params, 5.78GB checkpoint), fewer parameters to achieve multiple tasks. 2) More powerful visual generation ability and control accuracy. 3) Zero-shot generalization ability on never-before-seen modalities.

1 Introduction

Generative base models are changing the way artificial intelligence interacts in areas such as natural language processing, computer vision, audio processing, and robotic control. In natural language processing, generative base models like InstructGPT or GPT-4 perform well on a variety of tasks, and this multitasking capability is one of the most attractive properties. In addition, they can also perform zero-shot or few-shot learning for unseen tasks.

However, this multitasking capability is not prominent among generative models in the vision domain. While textual descriptions provide a flexible way to control the content of generated images, they are often insufficient in providing pixel-level spatial, structural, or geometric control. Recent popular research such as ControlNet, T2I-adapter can enhance the Stable Diffusion Model (SDM) to achieve precise control. However, unlike linguistic cues, which can be handled by a unified module like CLIP, each ControlNet model can only handle the specific modality it was trained on.

To overcome the limitations of previous work, this paper proposes UniControl, a unified diffusion model that can simultaneously handle language and various visual conditions. The unified design of UniControl can enjoy the advantages of improving training and inference efficiency and enhancing controllable generation. On the other hand, UniControl benefits from the inherent connection between different visual conditions to enhance the generation effect of each condition.

The unified controllable generation ability of UniControl relies on two parts, one is "MOE-style Adapter" and the other is "Task-aware HyperNet". The MOE-style Adapter has about 70K parameters and can learn low-level feature maps from various modalities. Task-aware HyperNet can input task instructions as natural language prompts and output task embeddings embedded in the downstream network to modulate downstream models. parameters to adapt to different modal inputs.

The study pre-trained UniControl to obtain multi-task and zero-shot learning capabilities, including nine different tasks in five categories: edge (Canny, HED, Sketch), region mapping (Segmentation, Object Bound Box), skeleton (Human Skeleton), Geometry (Depth, Normal Surface) and Image Editing (Image Outpainting). The study then trained UniControl for over 5,000 GPU hours on NVIDIA A100 hardware (new models are currently still being trained). And UniControl showed zero-shot adaptability to new tasks.

The contributions of this study can be summarized as follows:

  • This research proposes UniControl, a unified model (1.4B #params, 5.78GB checkpoint) that can handle various visual conditions, for controllable visual generation.

  • The study collected a new multi-conditional visual generation dataset consisting of more than 20 million image-text-conditional triples covering nine different tasks across five categories.

  • The study conducts experiments to demonstrate that the unified model UniControl outperforms controlled image generation for each single task due to learning intrinsic relationships between different visual conditions.

  • UniControl demonstrates the ability to adapt to unseen tasks in a zero-shot manner, demonstrating its possibility and potential for widespread use in open environments.

2. Model Design

a18b39e4599ed468c966dc1960194e02.png

Figure 2: Model structure. In order to adapt to multiple tasks, the research designed MOE-style Adapter, each task has about 70K parameters, and a task-aware Task-aware HyperNet (about 12M parameters) to modulate 7 zero convolutional layers. This structure allows the realization of multi-task functions in a single model, which not only guarantees the diversity of multi-tasks, but also preserves the underlying parameter sharing. Significantly reduces model size compared to equivalent stacked single-task models (about 1.4B parameters per model).

The UniControl model design ensures two properties:

1) Overcome the misalignment between low-level features from different modalities. This helps UniControl learn necessary and unique information from all tasks. For example, 3D information may be ignored when the model considers segmentation maps as visual conditions.

2) Ability to learn meta-knowledge across tasks. This enables the model to understand the shared knowledge between tasks and the differences between them.

To provide these properties, the model introduces two novel modules: MOE-style Adapter and Task-aware HyperNet.

MOE-style Adapter is a set of convolutional modules, each Adapter corresponds to a separate modality, inspired by Mixture of Experts (MOE), used as UniControl to capture features of various low-level visual conditions. This adapter block has about 70K parameters and is extremely computationally efficient. After that, the visual features will be sent to the unified network for processing.

Task-aware HyperNet adjusts ControlNet's zero convolution module through task instruction conditions. HyperNet first projects task instructions into task embeddings, and then the researchers inject task embeddings into the zero convolution layer of ControlNet. Here the task embedding corresponds to the size of the convolution kernel matrix of the zero convolution layer. Similar to StyleGAN [4], this study directly multiplies the two to modulate the convolution parameters, and the modulated convolution parameters are used as the final convolution parameters. Therefore, the modulated zero convolution parameters of each task are different, which ensures the adaptability of the model to each modality. In addition, all weights are shared.

3. Model training

Unlike SDM or ControlNet, these models are conditioned on a single linguistic cue, or a single type of visual condition like canny. UniControl needs to handle various visual conditions from different tasks, as well as language cues. Therefore, the input of UniControl includes four parts: noise, text prompt, visual condition, task instruction. Among them, task instruction can be obtained naturally according to the modality of visual condition.

4991775f2b1f73a6eadede6fdd4462f7.png

With the training pairs thus generated, the study employs DDPM [5] to train the model.

4. Experimental results

ee661c1e89f89262740e598ec29118fc.png

Figure 6: Test set visual comparison results. The test data comes from MSCOCO [6] and Laion [7]

The comparison results with the official ControlNet reproduced in this study are shown in Figure 6. For more results, please refer to the paper.

5. Zero-shot Tasks Generalization

The model tests zero-shot capabilities in the following two scenarios:

Hybrid Task Generalization: This study considers two different visual conditions as input to UniControl, one is a blend of segmentation maps and human skeletons, and adds specific keywords "background" and "foreground" to text cues. Furthermore, the study rewrites mixed-task instructions as a mixture of instructions for the combined two tasks, such as "segmenting graphs and human skeletons into images".

Generalization to new tasks: UniControl needs to generate controllable images on new unseen visual conditions. To achieve this, it is crucial to estimate task weights based on the relationship between unseen and seen pre-trained tasks. Task weights can be estimated by manual assignment or computing similarity scores of task instructions in the embedding space. The MOE-style Adapter can be linearly assembled with estimated task weights to extract shallow features from new unseen visual conditions.

The visualized results are shown in Figure 7. For more results, please refer to the paper.

d1cab49a76e2026f82fae3b3c7177300.png

Figure 7: Visualization results of UniControl on Zero-shot tasks

6. Summary

Overall, the UniControl model provides a new foundational model for controllable vision generation through its control diversity. Such a model could open up the possibility of achieving higher levels of autonomy and human control for image generation tasks. This study looks forward to discussing and cooperating with more researchers to further promote the development of this field.

more visual effects

691fd6a2a6c446f0804fcadc266a1a61.png

422c8b0ada33ed5dccfce7b9ca858b4b.png

a6a5795c78302f05bc6723e9d7c71a2d.png

e1af8360efe35a87a248373060ab6b76.png

8e3ce7fd7c79bfbd050461b72acc27d0.png

5251ee9d97edaf8fc8cc4bd4dfdd17ab.png

eea7c229ee32a3fb3380a0e75855abe5.png

5e56bca2a53425306b03a0b7c4071bc5.png

7be122b302a0da0bd412def1d43105bb.png

bc74a0aca07c55c4f9960330fa1dc73b.png

b2db4b07b182457830ed943974dca7a2.png

[1] Zhang, Lvmin, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." arXiv preprint arXiv:2302.05543 (2023).

[2] Mou, Chong, et al. "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models." arXiv preprint arXiv:2302.08453 (2023).

[3] Zamir, Amir R., et al. "Taskonomy: Disentangling task transfer learning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[4] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[5] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. APA  

[6] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.

[7] Schuhmann, Christoph, et al. "Laion-400m: Open dataset of clip-filtered 400 million image-text pairs." arXiv preprint arXiv:2111.02114 (2021).

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read:

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

926f74325ebcfa15ba033a9e3c9971e8.png Click me to view GAN's series albums~!

Take out a lunch, become the frontier of CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click for  a lunch delivery and become the frontier of CV vision! , receive coupons, and join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/131078970