Midjourney's rival is coming! Google's StyleDrop trump card "custom master" detonated the AI art circle

2137afbf039d474fe77451906ca03ae8.jpeg


  Xinzhiyuan Report  

As soon as Google StyleDrop came out, it instantly swept the Internet.

Given Van Gogh's starry sky, AI incarnates as Master Van Gogh, and after a top-level understanding of this abstract style, he made countless similar paintings.

7e8b8e3b215ea073219b29c5d92679f3.png

Another cartoon style, the objects I want to draw are much cuter.

3853e68c5622d239590ee5815e59b304.png

Even, it can precisely control the details and design the original style of logo.

8ba3825346046da92d7fa48ead2f0434.png

The charm of StyleDrop is that only one picture is needed as a reference, no matter how complicated the art style is, it can be deconstructed and reproduced.

Netizens have said that it is the kind of AI tool that eliminates designers.

d7bfd4672516c50bf3de631949c0e8b7.jpeg

The StyleDrop explosion research is the latest product from the Google research team.

00954abad1239456161cb70530a22679.png

Paper address: https://arxiv.org/pdf/2306.00983.pdf

Now, with tools like StyleDrop, not only can you paint with more control, but you can also do previously unimaginable fine work, such as drawing a logo.

Even Nvidia scientists called it a "phenomenal" result.

309990ad0bd8c4e91c32db6ded2a8fb7.png

Master of Customization


The author of the paper introduced that the source of inspiration for StyleDrop is Eyedropper (a color-absorbing/color-picking tool).

Similarly, StyleDrop also hopes that you can quickly and effortlessly "pick" a style from a single/few reference images to generate an image of that style.

c4a497b1c9d3391c27a9df1ae601d57f.png

A sloth can have 18 styles:

f69c0f86ac1865071f6576e90716a13b.png

A panda has 24 styles:

587781b90826be4f514f167a20d603b4.png

The watercolors drawn by the children were perfectly controlled by StyleDrop, and even the wrinkles of the paper were restored.

I have to say, it's too strong.

bd906faece06870749c3159b42d5805a.png

There is also StyleDrop referring to the design of English letters in different styles:

1098e5e5a0fb1899af5f2232b1aeb9e2.png

The same is the letter of Van Gogh style.

43c77bc8d0586d6443b8c811fc0d4253.png

There are also line drawings. Line drawing is a high level of abstraction of images, and it has very high requirements for the rationality of the composition of the screen generation. The past methods have been difficult to succeed.

7ba06aac8969c316d931304357142bc5.png

The strokes of the cheese shadow in the original image are restored to the objects in each image.

0e14a75c4135ee09665407867ab45111.png

Refer to Android LOGO Creation.

14f7a0abb80db5216cb5d93bc3ecad9d.png

In addition, the researchers also expanded the ability of StyleDrop, not only to customize the style, combined with DreamBooth, but also to customize the content.

For example, still in the style of Van Gogh, generate a similar style of painting for Corgi:

85ecc038c655eea7a5b4c62658a4c15b.png

Here's another one, the Corgi below has the feeling of the "Sphinx" on the Egyptian pyramid.

febb85975dc3fb788f751c176052dc71.png

how to work?


StyleDrop is built on top of Muse and consists of two key parts:

One is effective fine-tuning of the parameters of the generated visual Transformer, and the other is iterative training with feedback.

Afterwards, the researchers synthesized images from the two fine-tuned models.

Muse is a state-of-the-art text-to-image synthesis model based on mask-generated image Transformer. It contains two synthesis modules for base image generation (256 × 256) and super-resolution (512 × 512 or 1024 × 1024).

c9e999645179717d915ad2c7f344d630.png

Each module consists of a text encoder T, a transformer G, a sampler S, an image encoder E and a decoder D.

T maps text prompts t ∈ T to a continuous embedding space E. G processes text embeddings e ∈ E to generate logarithms l ∈ L of visual token sequences. S extracts a sequence of visual tokens v ∈ V from the logarithm by iterative decoding that runs several steps of transformer inference, conditioned on the text embedding e and the visual tokens decoded from the previous steps.

Finally, D maps the discrete token sequence to the pixel space I. In general, given a text prompt t, an image I is synthesized as follows:

53a26b95ee578bd1d8ec98abc7215426.png

Figure 2 is a simplified architecture of the Muse transformer layer, which has been partially modified to support parameter efficient fine-tuning (PEFT) and adapters.

The sequence of visual tokens shown in green conditioned on the text embedding e is processed using a L-layer transformer. The learned parameters θ are used to construct the weights for adapter tuning.

cb61bdf43647b643ec506d6987111f28.png

To train θ, in many cases, researchers may only be given pictures as style references.

Researchers need to manually attach text prompts. They propose a simple, templated approach to construct text prompts consisting of a description of the content followed by a phrase describing the style.

For example, researchers describe an object with "cat" in Table 1 and append "watercolor painting" as a style description.

bd18cda80f24b03865c5eaf9687b3464.png

Including descriptions of content and style in text cues is crucial, as it helps to separate content from style, which is the main goal of researchers.

Figure 3 shows iterative training with feedback.

When trained on a single style reference image (orange box), some images generated by StyleDrop may exhibit content extracted from the style reference image (red box, image with a house similar to the style image in the background).

Other images (blue boxes) do a better job of separating style from content. Iterative training of StyleDrop on good examples (blue boxes) results in a better balance between style and text fidelity (green boxes).

3aa50a97ac0f19806836e2e4da0e3ffa.png

Here the researchers also used two methods:

-CLIP score

This method is used to measure the alignment of images and text. Therefore, it can evaluate the quality of generated images by measuring the CLIP score (i.e., the cosine similarity of visual and textual CLIP embeddings).

Researchers can select the CLIP image with the highest score. They call this method Iterative Training with CLIP Feedback (CF).

In experiments, the researchers found that using the CLIP score to assess the quality of synthetic images is an effective way to improve recall (i.e., text fidelity) without too much loss in style fidelity.

On the other hand, however, CLIP scores may not fully align with human intentions, nor capture subtle stylistic attributes.

-HF

Human Feedback (HF) is a more straightforward way to directly inject user intent into synthetic image quality assessment.

HF has proven to be powerful and effective in LLM fine-tuning for reinforcement learning.

HF can be used to compensate for the inability of CLIP scores to capture subtle stylistic attributes.

At present, a large number of studies have focused on the personalization problem of text-to-image diffusion models to synthesize images containing multiple personal styles.

The researchers showed how to combine DreamBooth and StyleDrop in a simple way, allowing both style and content to be personalized.

This is done by sampling from two modified generative distributions, guided by θs for style and θc for content, respectively, the adapter parameters trained independently on style and content reference images.

Unlike existing off-the-shelf products, the team's approach does not require joint training of learnable parameters on multiple concepts, which leads to greater combinatorial power, as pre-trained adapters are separately trained on a single topic and style trained on.

The researchers' overall sampling process follows the iterative decoding of Equation (1), with the logarithms sampled differently in each decoding step.

Let t be the text hint and c be the text hint without the style descriptor, and the logarithm is computed at step k as follows:

f9a91d212887af9d02c7fff40de4f043.png

d99675f5284a9959ce0b1811ca3950e5.png

Where: γ is used to balance StyleDrop and DreamBooth - if γ is 0, we get StyleDrop, if γ is 1, we get DreamBooth.

By setting γ reasonably, we can get a suitable image.

experiment settings

So far, no extensive research has been done on style adjustment for text-to-image generative models.

Therefore, the researchers proposed a new experimental protocol:

-data collection

The researchers collected dozens of pictures in different styles, ranging from watercolor and oil paintings, flat illustrations, 3D renderings to sculptures in different materials.

-model configuration

The researchers tuned Muse-based StyleDrop using adapters. For all experiments, the adapter weights were updated for 1000 steps using the Adam optimizer with a learning rate of 0.00003. Unless otherwise stated, the researchers use StyleDrop to denote the second-round model trained on more than 10 synthetic images with human feedback.

-Evaluate

The quantitative evaluation of research reports is based on CLIP, which measures stylistic consistency and text alignment. In addition, the researchers conducted a user preference study to evaluate style consistency and text alignment.

As shown in the figure, 18 pictures of different styles collected by the researchers, the result of StyleDrop processing.

As you can see, StyleDrop is able to capture the nuances of texture, shading, and structure in a variety of styles, enabling greater control over style than before.

6338fa1358fdf575e8414ee57f86ae4a.png

For comparison, the researchers also present the results of DreamBooth on Imagen, DreamBooth's LoRA implementation on Stable Diffusion and the results of text inversion.

ce1f2193c2d6d5d5746372288b91e601.png

The specific results are shown in the table, the evaluation indicators of human score (top) and CLIP score (bottom) of image-text alignment (Text) and visual style alignment (Style).

4364f7d4dc0849608f99db0c016e9747.png

Qualitative comparison of (a) DreamBooth, (b) StyleDrop, and (c) DreamBooth + StyleDrop:

67f75756071869af99518f2c334aaa20.png

Here, the researchers applied the two metrics of the CLIP score mentioned above - text and style scores.

For text scores, the researchers measure the cosine similarity between image and text embeddings. For the style score, the researchers measure the cosine similarity between the style reference and the embedding of the synthetic image.

The researchers generated a total of 1520 images for 190 text cues. Although the researchers hoped that the final score would be higher, in fact, these indicators are not perfect.

And iterative training (IT) improves the text score, which is in line with the researchers' goal.

However, as a trade-off, they suffer from reduced style scores on the first-pass models, since they are trained on synthetic images, where styles may be biased by selection bias.

DreamBooth on Imagen is inferior to StyleDrop in style score (0.644 vs. 0.694 for HF).

The researchers noticed that the increase in style score of DreamBooth on Imagen was not significant (0.569 → 0.644), while the increase of StyleDrop on Muse was more obvious (0.556 → 0.694).

The researchers analyzed that the style fine-tuning on Muse is more effective than that on Imagen.

Plus, for fine-grained control, StyleDrop captures subtle stylistic differences, such as color shifts, layers, or sharp corners.

61a61f0b03540a988492e3804e928870.png

Hot comments from netizens


If designers have StyleDrop, 10 times faster work efficiency has already taken off.

10cd470658a313991f6fb6655ebe7db4.png

One day in AI, 10 years in the world, AIGC is developing at the speed of light, the kind of light speed that blinds people's eyes!

1fa8cd4f11c96f1aadb316d728045852.png

Tools just follow the trend, and what should be eliminated has already been eliminated.

bec10a36772de286f12c362ff22fabd8.png

This tool is much better than Midjourney for making Logo.

f8f351b17e5c1b16d2b07c79c5252823.png

References:

https://styledrop.github.io/

Guess you like

Origin blog.csdn.net/spider_py/article/details/131058667