Meta's blockbuster new work CM3leon: another breakthrough in the performance of multi-modal models!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Video Understanding and Transformer] Exchange Group

Reprinted from: Xinzhiyuan | Editor: Aeneas So sleepy

【Introduction】 Peking University alumni work together, Meta releases the first single multimodal model in history! The 7B model defeated Diffusion, and the problem of perfect hand drawing was perfectly solved.

Meta is here again!

Just now, Meta launched a Transformer-based multimodal model - CM3leon, which has made absolute breakthroughs in the fields of Vincent graphs and image understanding, and can be called the best of its kind.

Moreover, this combination of multiple modalities into a single model is unprecedented in previously disclosed AI systems.

5552b4d0f4ed1daf539ce732b2644090.png

Obviously, this research by Meta defines a new standard for multimodal AI, which indicates that the AI ​​system can completely switch freely in tasks such as understanding, editing, and generating images, videos, and texts.

Meanwhile, the launch of CM3leon officially marks the first time an autoregressive model has matched the performance of leading generative diffusion models on key benchmarks.

d0904e0cf3d6c4cf5a33719e85ee94e2.png

Paper address: https://ai.meta.com/research/publications/scaling-autoregressive-multi-modal-models-pretraining-and-instruction-tuning/

Previously, the three star models that received the most attention in the Vincent graph field were Stable Diffusion, DALL-E and Midjourney. The Vinsen diagram technique basically relies on the diffusion model.

But the revolutionary significance of CM3leon is that it uses a completely different technology-an autoregressive model based on tokenizer.

The results show that the autoregressive model based on the tokenizer is not only more effective than the method based on the diffusion model, and achieves SOTA in the field of Vincentian graphs, but also requires five times less training calculation than the previous Transformer-based method!

a32ec33532260aaae5839c7685ed1765.jpeg

Get ready, a wave of cool effects is coming

Just looking at the raw performance indicators can't explain anything.

Where CM3leon really shines is in handling more complex prompting and image editing tasks.

Accurately rendered images with stunning results

For example, it can accurately render images from cues such as “little cactus in the Sahara wearing a straw hat and neon sunglasses.”

c20644abefa3c786cf538704450168c0.png

Any prompt, edit images as you like

CM3leon also has a unique ability to edit existing images based on free-form text instructions, such as changing the color of the sky, or adding objects at specific locations.

The above functions far exceed the effects that models such as DALL-E 2 can achieve.

d995612c7d72b75ec589c56481d4d6f1.png

Unprecedented multimodal single model

CM3leon's versatile architecture allows it to transition freely and smoothly between text, image and composition tasks.

In addition to the capabilities of Vincent diagrams, CM3leon can generate annotations for images, answer questions about image content, and even create images from textual descriptions of bounding boxes and segmentation maps.

This combination of modalities into a single model is unprecedented in previously disclosed AI systems.

prompt: What is the dog holding? The model replied: stick.

prompt: Describes the given image in detail. The model answers: In this image, a dog holds a stick in its mouth. There is grass on the ground. The image has trees in the background.

032b38477fb8724ec175f76fba0f3920.png

Given the text description of the image bounding box segmentation, indicating where a pool or a mirror is needed in the image, CM3leon can generate the corresponding image completely according to the prompt.

43fc84b98387cecfe1918e4cb49912fc.png

super high resolution

A separate super-resolution platform can be integrated with the CM3leon output, resulting in a dramatic increase in resolution and detail.

Enter the prompt "a small circular island in the middle of the lake, with forests around the lake, high contrast"——

e9ce348e1d97b672e83b0a8e8da4e5b6.png

Solve the problem of AI painter

Even the long-standing problem of AI not being able to draw hands was easily solved by CM3leon.

7ac742ec05582c7c73286dffa0b6465c.png

Autoregressive model beats Diffusion for the first time?

In the field of Vincent diagrams that has become popular in recent years, Midjourney, DALL-E 2 and Stable Diffusion all use diffusion technology.

While the Diffusion technique produces stunning results, it is computationally intensive, which makes it computationally intensive, expensive to run, and often lacks the speed required for real-time applications.

Interestingly, OpenAI wanted to explore the possibility of Transformer as an image generation through a model called Image GPT a few years ago. But it eventually dropped the idea in favor of Diffusion.

The CM3leon takes a completely different approach. As a Transformer-based model, it leverages an attention mechanism to weigh the relevance of input data (whether text or images).

This architectural difference enables CM3leon to achieve faster training speed and better parallelization, thus being more efficient than traditional diffusion-based methods.

With only a single TPU, CM3leon is efficiently trained on the image dataset and achieves a FID score of 4.88 on the MS-COCO dataset, surpassing Google's text-to-image model Parti.

At the same time, the efficiency of CM3leon is more than 5 times that of the similar Transformer architecture.

378b6657ffca8fcb2a3d65b28d3c8c4e.png

The reason why CM3leon is so successful can be attributed to its unique architecture and training method.

A key to its powerful performance is the technique of supervised fine-tuning (SFT).

SFT has previously been used to train text generation models like ChatGPT to good effect, but Meta argues that it can also be useful when applied to images.

In fact, instruction fine-tuning improved CM3Leon's performance not only in image generation, but also in image annotation writing, enabling it to answer questions about images and improve the performance of images by following text instructions such as "Change the color of the sky to bright blue." ”) to edit the image.

CM3leon employs only a decoder-transformer architecture, similar to established text-based models, but adds the ability to process text and images.

The training process involves retrieval augmentation, as well as instruction fine-tuning across various image and text generation tasks.

By applying cross-modal supervised fine-tuning techniques, Meta significantly improves the performance of CM3leon in image annotation, visual QA and text editing.

Although CM3leon is only trained on 3 billion text tokens, it matches or even surpasses the results of other models trained on up to 100 billion tokens.

As the first multimodal model tuned in a similar way to text language models, Meta incorporates a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage in CM3leon.

How does CM3leon perform

With CM3leon's capabilities, image generation tools can produce more consistent images that follow input cues better.

The following is a demonstration of CM3leon's capabilities in various tasks, all completed by a single model:

text to image generation

Text-based image editing (e.g., "change the color of the sky to bright blue") is challenging because it requires the model to understand both textual instructions and visual content.

Especially when complex objects are involved or when the hint contains multiple constraints that must be present in the output.

From prompt text with a highly compositional structure, CM3leon can generate an image to follow.

4e29c652c3c2cf920a9195591992bf76.png

(1) Small cactus wearing a straw hat and neon sunglasses in the Sahara Desert

(2) Close-up of a human hand

(3) Raccoon main character in the anime preparing to engage in an epic battle wielding a katana

(4) A stop sign in fantasy style with the words "1991"

Text-based image editing

Given an image and a text prompt, CM3leon can edit the image according to the instructions in the text.

Due to the generality of the CM3leon model, this is achieved with the same model as the other tasks, unlike previous models (such as InstructPix2Pix) that were only fine-tuned for text-based image editing.

01c79009dc1c5fde1fabe50b06d68113.png

At the same time, the CM3leon model can also generate short or long descriptions and answer questions about images based on a series of different prompts.

Take a chestnut.

- hint:

Question: What do these two objects have in common?

Context: Choose the best answer.

Options: (A) Shiny (B) Salty

Answer: Let's think step by step:

- CM3Leon output:

Think about each object. Chips are salty. Pretzels are salty. So the answer is (B).

debf68ff5841117396651e19708ad17f.png

- hint:

Describe in detail all objects in a given image.

- CM3Leon output:

A street sign sits on a metal pole. The logo is blue with white text on it. There is a red light on the pole. The sky is bright blue. There are cars parked on the street. There are trees on both sides of the street. There are buildings on both sides of the street.

d9f5ad10cc897f7da0b5b57461a91d86.png

Similarly, CM3leon can easily handle the following "picture questions and answers".

b554f23a2c7d808e025b25fb3d1360ee.png

To learn more about CM3leon, Meta empirically evaluates instruction-fine-tuned models on image caption generation and visual question answering tasks, and compares them with previous SOTA benchmarks.

Compared with Flamingo (100B) and OpenFlamingo (40B), the text data of the CM3leon model is significantly less (about 3B tokens).

But in terms of MS-COCO picture description and VQA2 question answering, CM3leon achieved the same performance as zero-sample OpenFlamingo, and even beat Flamingo by nearly 10 percentage points on the VizWiz task.

b672a49dc22269f90a6b2b59b70b30c2.png

Structure-guided image editing

Structure-guided image editing aims to understand and interpret provided textual instructions along with structural or layout information.

This enables CM3leon models to create visually consistent and contextually appropriate image compilations while adhering to given structural or layout instructions.

In an image containing only segmentations (no text categories), generate an image. The input here represents the image from which the segmentation is extracted.

8f5e02ebc61ba90849c710900d60c5c6.png

super resolution

In addition to this, there is a common trick in the field of image generation - utilizing a separately trained super-resolution stage to generate higher-resolution images from the original model output.

For this type of text-to-image generation task, CM3leon also performs very well.

edcc410a4d3037621dbe4bf8c93393e9.png

(1) A cup of steaming coffee with mountains in the background, resting on the road

(2) At sunset, the beautiful and majestic highway

(3) A circular island in the center of the lake surrounded by forests

And some "fantasy" style generation.

aef074ec0f69cbed020ed2e73829fa63.png

(1) Sea turtles swimming underwater

(2) Elephants swim underwater

(2) A flock of sheep

How to build CM3Leon


architecture

In terms of architecture, CM3Leon uses a decoder-only Transformer similar to a mature text model.

But the difference is that CM3Leon is able to input and generate text and images.

train

By adopting the training retrieval enhancement technology proposed in the paper "Retrieval-Augmented Multimodal Language Modeling", Meta greatly improves the efficiency and controllability of the CM3Leon model.

At the same time, Meta also fine-tuned the CM3Leon model on various image and text generation tasks.

af9728033fd2bc2f27df3b910d71709c.png

Left: common inputs for various tasks; right: corresponding model outputs.

During training, Meta concatenates model inputs and outputs and trains with the same objective as in the pre-training stage.

As the AI ​​industry continues to grow, generative models like CM3Leon are becoming more complex.

These models learn the relationship between vision and text by training on millions of example images, but they can also reflect biases present in the training data.

Therefore, Meta adopts the licensed dataset to train CM3Leon.

The results also demonstrate that CM3Leon still achieves strong performance although the distribution of the data is quite different from previous models.

In this regard, Meta hopes that through everyone's joint efforts, a more accurate, fair, and fair model can be created.

Paving the way for multimodal language models

Overall, Meta believes that the excellent performance of CM3Leon on various tasks is an important step towards more realistic image generation and understanding.

And such a model can ultimately help enhance creativity and achieve better applications in the metaverse.

about the author

Lili Yu, Bowen Shi and Ramakanth Pasunuru are co-authors of the paper.

Among them, Lili Yu obtained a bachelor's degree from the Department of Physics of Peking University, and a doctorate degree in electrical engineering and computer science from MIT.

8a4251aa241fc903694a67f8817366b2.png

References:

https://ai.meta.com/blog/generative-ai-text-images-cm3leon/

https://www.maginative.com/article/meta-unveils-cm3leon-a-breakthrough-ai-model-for-advanced-text-to-image-generation-and-image-understanding/

https://techcrunch.com/2023/07/14/meta-generative-transformer-art-model/

 
  

Click to enter —> [Video Understanding and Transformer] Exchange Group

ICCV/CVPR 2023 Paper and Code Download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
视频理解和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-视频理解或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如视频理解或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watchfae120905355626ba2600fe8152974d6.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131746206