The source of this article is the editorial department of the heart of the machine
Researchers from Salesforce AI, Northeastern University, and Stanford University proposed MOE-style Adapter and Task-aware HyperNet to realize the multimodal condition generation capability in UniControl. UniControl is trained on nine different C2I tasks, demonstrating strong vision generation ability and zero-shot generalization ability.
Paper address: https://arxiv.org/abs/2305.11147
Code address: https://github.com/salesforce/UniControl
Project homepage: https://shorturl.at/lmMX6
Introduction: Stable Diffusion demonstrates strong visual generation capabilities. However, they often underperform in generating images with spatial, structural or geometric control. Works such as ControlNet [1] and T2I-adapter [2] achieve controllable image generation for different modalities, but being able to adapt to various visual conditions in a single unified model remains an unsolved challenge. UniControl incorporates various controllable condition-to-image (C2I) tasks within a single framework. To make UniControl capable of handling diverse visual conditions, the authors introduce a task-aware HyperNet to adjust the downstream conditional diffusion model, making it adaptable to different C2I tasks simultaneously. UniControl is trained on nine different C2I tasks, demonstrating strong vision generation ability and zero-shot generalization ability. The author has open-sourced the model parameters and inference code, and the data set and training code will also be open-sourced as soon as possible. Welcome to exchange and use.
Figure 1: The UniControl model consists of multiple pre-training tasks and zero-shot tasks
Motivation: The existing controllable image generation models are designed for a single modality. However, Taskonomy [3] and other works have proved that different visual modalities share features and information. Therefore, this paper believes that a unified multimodal model has huge potential.
Solution: This paper proposes MOE-style Adapter and Task-aware HyperNet to realize the multimodal condition generation capability in UniControl. And the author established a new dataset MultiGen-20M, which contains 9 major tasks, more than 20 million image-condition-prompt triplets, and the image size is ≥512.
Advantages: 1) More compact model (1.4B #params, 5.78GB checkpoint), fewer parameters to achieve multiple tasks. 2) More powerful visual generation ability and control accuracy. 3) Zero-shot generalization ability on never-before-seen modalities.
1 Introduction
Generative base models are changing the way artificial intelligence interacts in areas such as natural language processing, computer vision, audio processing, and robotic control. In natural language processing, generative base models like InstructGPT or GPT-4 perform well on a variety of tasks, and this multitasking capability is one of the most attractive properties. In addition, they can also perform zero-shot or few-shot learning for unseen tasks.
However, this multitasking capability is not prominent among generative models in the vision domain. While textual descriptions provide a flexible way to control the content of generated images, they are often insufficient in providing pixel-level spatial, structural, or geometric control. Recent popular research such as ControlNet, T2I-adapter can enhance the Stable Diffusion Model (SDM) to achieve precise control. However, unlike linguistic cues, which can be handled by a unified module like CLIP, each ControlNet model can only handle the specific modality it was trained on.
To overcome the limitations of previous work, this paper proposes UniControl, a unified diffusion model that can simultaneously handle language and various visual conditions. The unified design of UniControl can enjoy the advantages of improving training and inference efficiency and enhancing controllable generation. On the other hand, UniControl benefits from the inherent connection between different visual conditions to enhance the generation effect of each condition.
The unified controllable generation ability of UniControl relies on two parts, one is "MOE-style Adapter" and the other is "Task-aware HyperNet". The MOE-style Adapter has about 70K parameters and can learn low-level feature maps from various modalities. Task-aware HyperNet can input task instructions as natural language prompts and output task embeddings embedded in the downstream network to modulate downstream models. parameters to adapt to different modal inputs.
The study pre-trained UniControl to obtain multi-task and zero-shot learning capabilities, including nine different tasks in five categories: edge (Canny, HED, Sketch), region mapping (Segmentation, Object Bound Box), skeleton (Human Skeleton), Geometry (Depth, Normal Surface) and Image Editing (Image Outpainting). The study then trained UniControl for over 5,000 GPU hours on NVIDIA A100 hardware (new models are currently still being trained). And UniControl showed zero-shot adaptability to new tasks.
The contributions of this study can be summarized as follows:
This research proposes UniControl, a unified model (1.4B #params, 5.78GB checkpoint) that can handle various visual conditions, for controllable visual generation.
The study collected a new multi-conditional visual generation dataset consisting of more than 20 million image-text-conditional triples covering nine different tasks across five categories.
The study conducts experiments to demonstrate that the unified model UniControl outperforms controlled image generation for each single task due to learning intrinsic relationships between different visual conditions.
UniControl demonstrates the ability to adapt to unseen tasks in a zero-shot manner, demonstrating its possibility and potential for widespread use in open environments.
2. Model Design
Figure 2: Model structure. In order to adapt to multiple tasks, the research designed MOE-style Adapter, each task has about 70K parameters, and a task-aware Task-aware HyperNet (about 12M parameters) to modulate 7 zero convolutional layers. This structure allows the realization of multi-task functions in a single model, which not only guarantees the diversity of multi-tasks, but also preserves the underlying parameter sharing. Significantly reduces model size compared to equivalent stacked single-task models (about 1.4B parameters per model).
The UniControl model design ensures two properties:
1) Overcome the misalignment between low-level features from different modalities. This helps UniControl learn necessary and unique information from all tasks. For example, 3D information may be ignored when the model considers segmentation maps as visual conditions.
2) Ability to learn meta-knowledge across tasks. This enables the model to understand the shared knowledge between tasks and the differences between them.
To provide these properties, the model introduces two novel modules: MOE-style Adapter and Task-aware HyperNet.
MOE-style Adapter is a set of convolutional modules, each Adapter corresponds to a separate modality, inspired by Mixture of Experts (MOE), used as UniControl to capture features of various low-level visual conditions. This adapter block has about 70K parameters and is extremely computationally efficient. After that, the visual features will be sent to the unified network for processing.
Task-aware HyperNet adjusts ControlNet's zero convolution module through task instruction conditions. HyperNet first projects task instructions into task embeddings, and then the researchers inject task embeddings into the zero convolution layer of ControlNet. Here the task embedding corresponds to the size of the convolution kernel matrix of the zero convolution layer. Similar to StyleGAN [4], this study directly multiplies the two to modulate the convolution parameters, and the modulated convolution parameters are used as the final convolution parameters. Therefore, the modulated zero convolution parameters of each task are different, which ensures the adaptability of the model to each modality. In addition, all weights are shared.
3. Model training
Unlike SDM or ControlNet, these models are conditioned on a single linguistic cue, or a single type of visual condition like canny. UniControl needs to handle various visual conditions from different tasks, as well as language cues. Therefore, the input of UniControl includes four parts: noise, text prompt, visual condition, task instruction. Among them, task instruction can be obtained naturally according to the modality of visual condition.
With the training pairs thus generated, the study employs DDPM [5] to train the model.
4. Experimental results
Figure 6: Test set visual comparison results. The test data comes from MSCOCO [6] and Laion [7]
The comparison results with the official ControlNet reproduced in this study are shown in Figure 6. For more results, please refer to the paper.
5. Zero-shot Tasks Generalization
The model tests zero-shot capabilities in the following two scenarios:
Hybrid Task Generalization: This study considers two different visual conditions as input to UniControl, one is a blend of segmentation maps and human skeletons, and adds specific keywords "background" and "foreground" to text cues. Furthermore, the study rewrites mixed-task instructions as a mixture of instructions for the combined two tasks, such as "segmenting graphs and human skeletons into images".
Generalization to new tasks: UniControl needs to generate controllable images on new unseen visual conditions. To achieve this, it is crucial to estimate task weights based on the relationship between unseen and seen pre-trained tasks. Task weights can be estimated by manual assignment or computing similarity scores of task instructions in the embedding space. The MOE-style Adapter can be linearly assembled with estimated task weights to extract shallow features from new unseen visual conditions.
The visualized results are shown in Figure 7. For more results, please refer to the paper.
Figure 7: Visualization results of UniControl on Zero-shot tasks
6. Summary
Overall, the UniControl model provides a new foundational model for controllable vision generation through its control diversity. Such a model could open up the possibility of achieving higher levels of autonomy and human control for image generation tasks. This study looks forward to discussing and cooperating with more researchers to further promote the development of this field.
more visual effects
[1] Zhang, Lvmin, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." arXiv preprint arXiv:2302.05543 (2023).
[2] Mou, Chong, et al. "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models." arXiv preprint arXiv:2302.08453 (2023).
[3] Zamir, Amir R., et al. "Taskonomy: Disentangling task transfer learning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[4] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[5] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. APA
[6] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.
[7] Schuhmann, Christoph, et al. "Laion-400m: Open dataset of clip-filtered 400 million image-text pairs." arXiv preprint arXiv:2111.02114 (2021).
Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read:
In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm!
Classic GAN has to read: StyleGAN
Click me to view GAN's series albums~!
Take out a lunch, become the frontier of CV vision!
The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models
ECCV2022 | Summary of some papers on generating confrontation network GAN
CVPR 2022 | 25+ directions, the latest 50 GAN papers
ICCV 2021 | Summary of GAN papers on 35 topics
Over 110 articles! CVPR 2021 most complete GAN paper combing
Over 100 articles! CVPR 2020 most complete GAN paper combing
Dismantling the new GAN: decoupling representation MixNMatch
StarGAN Version 2: Multi-Domain Diversity Image Generation
Attached download | Chinese version of "Explainable Machine Learning"
Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"
Attached download | "Mathematical Methods in Computer Vision" share
"A review of surface defect detection methods based on deep learning"
A Survey of Zero-Shot Image Classification: A Decade of Progress
"A Survey of Few-Shot Learning Based on Deep Neural Networks"
"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."
Click for a lunch delivery and become the frontier of CV vision! , receive coupons, and join the planet of AI-generated creation and computer vision knowledge!