Intensive reading series of Li Mu's papers four: CLIP and improvement work series lectures (LSeg, GroupViT, VLiD, GLIPv1, GLIPv2, CLIPasso)

Article Directory

Portal:

1. CLIP

reference:

1.1 Introduction

1.1.1 Preface

  CLIPIt is an article published by OpenAI in February 2021. Its full name Contrastive Language-Image Pre-trainingis a pre-training method based on contrasting text-image pairs. CLIPUsing text as a supervisory signal to train a transferable visual model makes the final model's zero-shot effect comparable to ResNet50, with very good generalization and many CLIPinteresting applications.

  zero-shotIt is direct reasoning, using the features of the seen pictures to judge the category of the unseen pictures, without fine-tuning the downstream task training set at all. (equivalent to using the model as feature extraction, but without the classification head)
  The author benchmarks on more than 30 different computer vision datasets, (these datasets cover OCR, action recognition in video, geolocation and many types of tasks such as fine-grained object classification) CLIPare usually comparable to the baseline effect of supervised models.
  For example, on the ImageNet dataset, CLIPwithout using ImageNetany picture in the dataset for training, the final model accuracy can be on par with a supervised trained model ResNet-50(the accuracy on ImageNet zero-shotis 76.2%, which is was once considered impossible).

1.1.2 Model structure

Training process :
  As shown in the figure below, CLIPthe input is a pair of paired picture-text pairs (for example, the input is a picture of a dog, and the corresponding text also indicates that it is a dog). These texts and pictures pass through Text Encoderand Image Encoderoutput corresponding features respectively. Then, comparative learning is performed on these output text features and image features.
  If the model input is na picture-text pair, then the npaired image-text pair is a positive sample (the blue part on the diagonal of the output feature matrix in the figure below), and the other n 2 − nn^2-nn2The n pairs of samples are all negative samples. In this way, the training process of the model is to maximize the similarity of n positive samples while minimizingn 2 − nn^2-nn2The similarity of n negative samples.

  Text EncoderCommonly used text transformermodels in NLP can be used; and commonly used models or other models Image Encodercan be used .   Similarity is to calculate the cosine similarity of text features and image features.   For training , a total of 400 million text-image pairs were collected from the Internet, which the paper calls ( . The quality is very high, and the cleaning is very good, and its scale is equivalent to , which is one of the reasons why it is so powerful (the model was also bred on WIT later ).CNNvision transformer
cosine similarity
CLIPOpenAIWITWeb Image TextWITJFT-300MCLIPDALL-E

insert image description here
Classification
  CLIP can directly implement zero-shotimage classification, that is, without any training and fine-tuning, which is also the highlight and strength of CLIP. Implementing zero-shot classification with CLIP requires only two simple steps:

  • Construct the description text of each category according to the classification label of the task: A photo of {label}, and then send these texts Text Encoderto get the corresponding text features. If the number of categories is n, then ntext features will be obtained;
  • Send the image to be predicted Image Encoderto get the image feature, and then ncalculate the scaled cosine similarity with the text feature ( consistent with the training process ), and then select the category corresponding to the text with the largest similarity as the image classification prediction result. Further, these similarities can be regarded as logits, and after being sent to softmax, the predicted probability of each category can be obtained.

  We no longer need a pre-defined label (category) list, and directly feed the picture to different text sentences to know whether there is an object of interest in the picture. That is, the multimodal nature of CLIP (using text supervision signals) builds a dynamic classifier for specific tasks, making the model no longer limited to pre-defined categories, and more general and usable.

  For example, when adding a picture of a tricycle, you only need to add the category of tricycle in the text part, and the model is likely to directly zero-shotinfer that the picture belongs to the category of tricycle. The previous model will never predict classes other than ImageNet's 1000 classes, which is CLIPthe most attractive place. There are two ways
  to turn category words into sentences to further improve the accuracy of the model, which will be discussed later in the paperprompt engineeringprompt ensemble

1.1.3 Model effect

1.1.3.1 Robustness to natural distribution shifts

  As shown in the figure below, the author also compares zero-shot CLIPthe performance with the existing ImageNet model on the natural distribution shift to verify its robustness.
insert image description here

  • The horizontal and vertical coordinates of the left image are the distribution offset of ImageNet. The black dashed line is the ideal robust model, which is linear and proportional. Ordinary models cannot achieve such an ideal effect, and the drawn curve will only be below the black dotted line. But zero-shot CLIPthe robustness that can be seen here is better than the standard ImageNet trained model.
  • ImageNetV2It is to filter out a new data set from the ImageNet data set, which is closer to the original test set. However, the test performance of the pre-trained model on ImageNet has dropped a lot on ImageNetV2 (76.2→64.3)
  • The picture on the right ImageNet Sketchis a picture of a sketch, ImageNet-Acontaining a lot of confrontational samples

  CLIPAnd supervised training based on ImageNet can be achieved ResNet101in the verification set , but on the remaining five data sets, the performance of ResNet101 has dropped very sharply, but CLIP can still maintain a large accuracy. For example, on the data set, the accuracy is only , but can be achieved .   This also shows that the learned visual features have a strong connection with language. Whether it is a natural banana, a banana in an anime, a sketched banana, or a banana with an adversarial example, we all know that the picture corresponds to the word banana.ImageNet76.2%ImageNet-AResNet1012.7%CLIP77.1%
CLIPCLIP

1.1.3.2 StyleCLIP

  As the name suggests, this is a work of CLIP+styleGAN, which can guide the generation of images through text changes. For example, in the example below, enter "Mohawk hairstyle" to change Obama's hairstyle; enter "Without makeup" to uninstall it with one click; enter "Cute cat" (cute cat), and the cat's eyes will widen. CLIP can also understand various abstract makeup, such as smoky makeup, vampire makeup.

insert image description here

1.1.3.3 CLIPDraw

论文《CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders》

insert image description here

  This is also a guide for image generation using the CLIP pre-trained model. CLIPDrawWithout training, it is possible to synthesize some stick figure images from text by performing gradient descent on a set of RGBA Béezier curves. CLIP encodings(The goal is to minimize the cosine distance between the generated image and the text prompt).

It usually takes less than a minute to generate a stick figure on a normal GPU. The self in the last picture means a selfie

1.1.3.4 zero-shot detection

论文《Open-vocabulary Object Detection via Vision and Language Knowledge Distillation》(ICLR 2022)

  CLIP can be applied to target detection tasks to achieve zero-shot detection, that is, to detect categories that are not included in the training data set. For example, one and a half months after the emergence of CLIP, Google proposed ViLD(see Chapter 3.1 of this article) object detection based on CLIP Open-vocabulary. Its main structure is as follows. Its basic idea is similar to zero-shot classification, but here it uses text features. and ROI features to calculate the similarity.
insert image description here

  In the example below, if the traditional target detection algorithm is used, the model will only judge that these objects are toys, which is the basic class in blue in the figure. After using CLIP, you can get rid of the limitations of the basic class ( Open-vocabulary Object), and you can detect new classes (marked in red in the figure), such as colors and animal categories.
insert image description here
Meta AI's latest work, Detic, can detect 2000 classes, and CLIP is also used behind it.

1.1.3.5 CLIP Video Retrieval

The johanmodin/clifs repository   on github demonstrates work on video retrieval using CLIP. You can directly find the corresponding objects that appear in the video by entering text. For example, if you enter "a truck with odwalla", you will find the truck in the video (CLIP turns this sentence into a text feature, and then treats each frame in the video as a visual feature, and then frame by frame To compare with the text features, and then pick out the frame with the highest similarity).insert image description here
insert image description here

1.1.4 Introduction

  Existing CV models are basically trained based on manually labeled datasets, and then used to predict a set of pre-defined object categories. This pre-defined set of labels will greatly simplify the problem itself (such as the fixed 1000 classes of ImageNet, the fixed 80 classes of the COCO dataset, etc.). But because of this, this restricted supervisory signal limits the generalization and usability of the model. For example, most models can only predict known image categories. For unseen image categories, additional information is required for recognition. In this way, every time some categories are added, data needs to be collected again to train a new model.

  And whether it is a supervised or self-supervised method (methods based on contrastive learning such as MoCo and SimCLR, and methods based on image masks such as MAE and BeiT), supervised fine-tuning is required during model migration, such as fine-tuning fixed categories. The softmax classifier cannot achieve zero-shot.

  The author believes that obtaining supervision information directly from natural language is a promising option because it covers a wider range (as long as it is an object described by language, it is possible for the visual model to recognize it). CLIP uses multi-modal contrastive learning, allowing natural language to guide the model to learn visual concepts, thereby achieving very flexible zero-shotmigration (transforming classification problems into cross-modal retrieval problems).

  There is little work on image representation learning using natural language supervision, and the performance is often inferior to supervised models, mainly for two reasons:

  1. Early nlp models were not easy to learn.
    For example, the early n-gram model was very complex and not easy to train across modalities. But with the rise of transformers, self-supervised training models with contextual representations like BERT and GPT are getting better and better, and the nlp model finally has inexhaustible text supervision signals, and it is easy to use and generalizable Well, paved the way for multimodal training.
  2. The size of the dataset or model is insufficient .
    For example, VirTex and ICMLM only trained hundreds of thousands of pictures; ConVIRT is very similar to CLIP, but only pre-trained on medical images. In essence, CLIPthere is not much innovation. It just simplifies the ConVIRT method and uses a larger text-image pair dataset for training. It can also be said that compared with the previous comparative learning, CLIP only replaces single-modal samples with multi-modal samples.

1.2 Method

1.2.1 Advantages of Natural Language Supervision

Using natural language supervision signals to train vision models has two most important advantages:

  • There is no need to use special label data, and the scalability is stronger.
    For example, ImageNet needs to define 1,000 classes first, then download pictures according to these classes, clean up the data set, and then label all pictures. The process is very complicated. InsteadCLIP of requiring this classic ""machine learning compatible"" annotation format, only the text-image pair needs to be downloaded; and after there is no n-select 1 label, the input and output freedom of the model is much greater.

  • CLIPWhat is learned is the multimodal feature of image combined with text, so as to achieve flexible zero-shot transfer. If it is only a single-modal feature, whether it is similar to MOCO or MAE, it is difficult to do this (zero-shot must be added to the text feature to do it).

1.2.2 Pre-training method (training efficiency is crucial)

  The models in the CV field are large and expensive to train. For example, noise student has always dominated the list in ImageNet before, but this model needs to be trained on a TPUv3 for 33 years, which is only pre-trained on ImageNet containing 1000 categories, and only trains visual features.
  Due to the large amount of training data and model computation, training efficiency becomes a crucial factor. The author made a lot of attempts, and finally chose contrastive learning:

  • VirTexModel: predictive text, corresponding to the blue line in the figure belowTransformer Language Model
    • Image EncoderUsing the CNN model, Text Encoderusing the transformer model, the two models are trained from scratch together, and the task is to predict the text corresponding to the picture (image caption).
    • The training efficiency of this method is too slow, because there are too many possibilities for text description based on pictures, and you can describe a picture from various angles.
  • Bag of Words Prediction(orange line): It is not required that each word is predicted in order, all words are predicted. This relaxes the constraints and speeds up training by a factor of three.
  • CLIP: Simplified version ConVIRT, based on contrastive learning.
    • It only needs to judge whether the picture and text are matched, which further simplifies the training task, and the training efficiency is increased by 4 times at once (green line)
    • The training tasks are more reasonable. Because the text-image pairs contained in the training data are collected from the Internet, they have certain noise, and the two do not match exactly. Appropriately reducing the training target can achieve better convergence.

insert image description here

  OpenAI is a GPT-based company. From the GPT series, DALL-E to Image-GPT, etc., they are all based on GPT. Only CLIPbecause of efficiency, they chose comparative learning for training.

  In the end Text Encoder, a text transformer model with 63M parameters was selected, and Image Encodertwo different architectures were used. Because although CLIP is a multimodal model, it is mainly used to train a transferable visual model .

  • Image Encoderarchitecture
    • ResNet: ResNet50, ResNet101, RN50x4, RN50x16 and RNx64 (the latter three models are obtained by increasing ResNet by 4x, 16x and 64x respectively according to the EfficientNet scaling rules)
    • ViT:ViT-B/32,ViT-B/16和ViT-L/14。
  • All models are trained for 32 epochs, using AdamW optimizer, batch size=32768.
  • Only train one epoch on ResNet50 for hyperparameter search without further tuning
  • The two largest models RN50x64 need to train for 18 days on 592 V100 cards, and ViT-L/14 need to train for 12 days on 256 V100 cards
  • ViT-L/14 has the best effect, so the author finetune an additional epoch at 336 resolution to enhance performance, which is recorded as ViT-L/14@336px. Unless otherwise specified in the following papers, the CLIP model for comparative experiments refers to this.
    insert image description here
  • training details
    • The data set is very large, and there is almost no overfitting, so Image Encoderthere Text Encoderis no need for pre-training in advance.
    • Only use linear projection layers (linear nonlinearity doesn't matter much).
    • Data augmentation only uses random crops of images because the dataset is very large.
    • Compared with learning the hyperparameters in the objective function τ, set them as scalars that can be learned, and automatically optimize them during training without slowly adjusting parameters (still because the data set is too large, training is very expensive).

  In addition, there are many training details that make CLIP really trainable. For training very large models, you can refer to the blog post from OpenAI: "How to Train Really Large Models on Many GPUs?" and the corresponding CSDN translation .

1.2.3 Pseudocode

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - 输入图片维度
# T[n, l] - 输入文本维度,l表示序列长度

# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

#  分别提取图像特征和文本特征
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# 对两个特征进行线性投射,得到相同维度的特征d_e,并进行l2归一化,保持数据尺度的一致性
# 多模态embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# 计算缩放的余弦相似度:[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n) #  对角线元素的labels
loss_i = cross_entropy_loss(logits, labels, axis=0) # image loss
loss_t = cross_entropy_loss(logits, labels, axis=1) # text loss
loss = (loss_i + loss_t)/2 # 对称式的目标函数

  In MOCO, the real labels are all 0, because the positive samples are placed first, so the index corresponding to the positive sample is always 0; but in CLIP, the positive samples are all on the diagonal, that is ( I 1 , T 1 I_1, T_1I1,T1 I 2 , T 2 I_2,T_2 I2,T2,…), so the real label is np.arange(n).

1.3 Experiment

1.3.1 zero-shot migration

  Motivation for research zero-shot: The previous self-supervised or supervised training models (MOCO, DINO, etc.) mainly learn a generalized feature, so when doing downstream tasks, supervised fine-tuning is still required, and it still exists many problems. For example, the data sets of downstream tasks are not easy to collect, and there are distribution shifts, etc. However, if you use text to guide the training of the visual model, you can migrate well zero-shot; the model can no longer be trained or fine-tuned.

How to implement zero-shot classification with CLIP?
  Here we give an example based on CLIP (refer to the official notebook), here there are 6 categories of tasks: "dog", "cat", "bird", "person", "mushroom", "cup", first we Create a textual description, then extract textual features:

# 首先生成每个类别的文本描述
labels = ["dog", "cat", "bird", "person", "mushroom", "cup"]
text_descriptions = [f"A photo of a {
      
      label}" for label in labels]
text_tokens = clip.tokenize(text_descriptions).cuda()

# 提取文本特征
with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

  Then we read the image to be predicted, input Image Encoder to extract image features, and calculate the cosine similarity with text features:

# 读取图像
original_images = []
images = []
texts = []

for label in labels:
    image_file = os.path.join("images", label+".jpg")
    name = os.path.basename(image_file).split('.')[0]

    image = Image.open(image_file).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(name)

image_input = torch.tensor(np.stack(images)).cuda()

# 提取图像特征  
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    image_features /= image_features.norm(dim=-1, keepdim=True)

# 计算余弦相似度(未缩放)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T

  Further, we can also calculate softmax on the obtained cosine similarity to get the probability value of each predicted category. Note that the similarity should be scaled here:

logit_scale = np.exp(model.logit_scale.data.item())
text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)
top_probs, top_labels = text_probs.cpu().topk(5, dim=-1)

  The predicted probabilities obtained are as follows. Six images can be seen, and the CLIP model can give correct classification results with absolute confidence:
insert image description here

1.3.2 Prompt Engineering and Ensembling

  1. Prompt Engineering

  The author also verified the effectiveness of using prompt in text description (accuracy improvement 1.3%). Simply put, prompt learningthe core is to enable the pre-trained model to be directly applied to downstream tasks by constructing a suitable prompt.

For inference, using only category labels as textual descriptions is not good enough for two reasons:

  1. Words are ambiguous
    If we directly use category labels as text descriptions, then many texts are just one word, lacking specific context, and cannot describe the content of the picture well.

    • For example, when doing object detection, one category is remote (remote control). But if it is directly fed to the text encoder, it is likely to be considered by the model as a distant meaning.

    • The same word may have different meanings in different datasets. For example, in the Oxford-IIIT Pets dataset, boxer refers to a species of dog, and in other datasets it refers to boxers.

    • So during CLIP pre-training, the text used to describe the content of the picture is a sentence, for example A photo of {label}. The label here can only be a noun, which eliminates ambiguity to a certain extent.

  2. Make inference and pre-training consistent (eliminate distribution gap).

  In addition, this template can also be adjusted according to different data sets to improve the performance of zero-shot.
  For example, when the data set is the Oxford-IIIT Pets data set (the categories are all animals), you can write the template as: ; A photo of a {label}, a type of pet.or when doing the OCR task, put double quotation marks on the text or number you want to find, and the model may know You are looking for the content inside the double quotes.

  1. prompt ensembling

  The author tried the effect of integrating multiple templates, that is, integrating on multiple zero-shot classifiers, which use different prompt templates to construct different texts. Computational cost is saved due to the integration over the embedding space instead of the probability space. On most datasets, prompt ensemblingthe model performance can be improved.

  In the end, the author used 80 templates for integration, and each template used different adjectives to describe different situations.
insert image description here
  The abscissa in the figure above represents the computing power of the model, and the ordinate represents the average score on multiple data sets. The green curve represents the result of using Prompt engineering and ensembling in this paper, and the blue curve represents the result of directly using the class name without prompt context.

3.3.3 Comparison of zero-shot classification effects (ResNet-50)

  In order to test the effect of CLIP's zero-shot classification, the author made a comparison chart of the classification effect on 27 data sets. The following figure is the comparison between CLIP and Linear Probe based on ResNet-50.
insert image description here

  • Linear Probe on ResNet-50:
    • Linear Probe is to freeze the pre-trained model and only train the classifier of the last layer, which is equivalent to using the pre-trained model as a feature extractor.
    • ResNet50 is pre-trained on ImageNet in a supervised manner
  • compare results:
    • Green + indicates how much it has improved compared to ResNet-50, and blue - indicates how much it has decreased compared to ResNet-50.

    • Finally, in 27 datasets, CLIP surpassed the supervised trained ResNet-50 on 16 datasets.

    • For common object classification tasks, CLIP can do zero-shot migration very well, such as data sets such as cars, food, and CIFAR10. Because there are objects that can be described in the image, the corresponding text also has this description. Therefore, it can be well matched;

    • However, CLIP is relatively weak for more complex or abstract tasks, such as satellite image classification, lymph node tumor detection and other classification tasks that require domain-specific knowledge. CLIP does not pre-train these label information.

1.3.4 Comparison of few-shot classification effects

  The author believes that this kind of particularly difficult task does not give any label information at all, which is a bit difficult and not very reasonable. So the paper also compares few-shotthe performance, that is, only a small number of samples are used to fine-tune the model. Here are 3 models compared:

  • Trained on ImageNet21K BiT-M( big transfer ), it is a strong baseline.
  • ResNet50 based on SimCLRv2 training,
  • ResNet50 with supervised training.

insert image description here

  • Abscissa: In each category of each data set, how many labeled samples are used for Linear Probe classifier training. 0 is equivalent zero-shot.
  • The vertical axis represents the average classification accuracy on 20 datasets (there are 7 datasets with less than 16 categories per category)
  • When there are 16 training samples per class, BiT-Mthe performance of the model is zero-shot CLIPon par.
  • The purple curve shows that when there are only 1 or 2 training samples in each class, the effect is not as good as zero-shot CLIP; but when the number of training samples in each class increases to 8 or 16, the effect exceeds zero-shot CLIP. This shows that for some difficult data sets, it is still necessary to have some training samples.

  When doing CLIP Linear Probe, you need to throw away the text encoder part, and then add a layer of linear classifier after the image encoder, so the classification method is no longer based on the image feature being the closest to the text feature, but retraining a linear classifier .
  The added layer of linear classifiers is randomly initialized, so 1 labeled sample per class is not enough. This is why the performance will be relatively poor at the beginning, but as the number of training samples increases, the classification performance of the model will gradually improve.

1.3.5 Linear probe CLIPComparison

  After comparing zero-shot and few-shot, the next step is to train all the training sets of downstream tasks and compare the effects. The way the author chooses here Linear probe CLIP.
  The reason for choosing Linear probeinstead of fine-tuning is that only the last layer of FC can be trained in Linear probe, and the learning space is relatively small, which is less flexible than fine-tuning. If the pre-training model is not well trained, it will be difficult to optimize a particularly good result after training on downstream tasks for a long time, so using a Linear probe can more accurately reflect the quality of the pre-training model. Another reason is that Linear probethere is no need to adjust the parameters (because there are too many adjustable parameters for different data sets if fine-tuning).
insert image description here

  • The abscissa indicates how much calculation it takes to do a forward process for an image
  • The vertical axis represents the average accuracy on multiple datasets.
  • The comparison models include supervised EfficientNet, EfficientNet with pseudo-labels, weakly supervised models trained on Instagram, self-supervised comparative learning models, and some classic supervised baseline models.
  • The closer the result is to the upper left corner, the better the performance of the model.
  • The left figure is the average result on 12 datasets, which are similar to ImageNet. Therefore, it is predictable that the supervised pre-trained model on ImageNet is better than CLIP.
  • The right figure is the average result on 27 datasets.

  As can be seen from the figure, on 12 data sets, the effect of CLIP with ViT structure is the best, and the effect of using ResNet is better than most models; on 27 data sets, the effect of CLIP beats all other models up. This result demonstrates the power of the CLIP model.

1.3.6 Comparison with Noisy Student EfficientNet-L2

  The authors also visualize the difference in performance between the CLIP model and EfficientNet trained with pseudo-labels (best on ImageNet) on 27 datasets.
  As can be seen from the figure, CLIP outperforms EfficientNet on 21 data sets, and many data sets have a large score. On the remaining 6 data sets that are not as good as EfficientNet, CLIP is only slightly lower than EfficientNet, and the gap is not large.
insert image description here

1.4 Differences with humans (omitted)

1.5 Data overlap analysis

  CLIP can achieve such a good zero-shot performance, and you may doubt that the training data set of CLIP may contain some samples in the test data set, which is the so-called data leakage. Regarding this point, the paper also uses a duplicate detector to check the overlap of the evaluated datasets, and finds that the median overlap rate is 2.2%, while the average is 3.2%. The performance of most datasets before and after deduplication is not too great. The big changes are as follows:
insert image description here

  • Left: While several datasets have significant differences of up to ±20% in zero-shot accuracy on detected overlaps and clean examples, only 5 of the 35 datasets have a 99.5% Clopper-Pearson confidence interval, Differences in accuracy of 0% were excluded. 2 of the datasets perform worse on overlapping data.
  • Right: Since the percentage of detected overlapping examples is almost always in the single digits, the overall test accuracy gain due to overlap is much smaller, with Birdsnap seeing a maximum gain of only 0.6%. Again, only 6 datasets showed statistically significant improvements in accuracy when calculated using a one-sided binomial test.

It can be concluded from this that such data overlap will not bring about a significant increase in accuracy.

1.6 Limitations

  1. The performance needs to be improved.
    On many datasets, CLIP can be tied with ResNet-50 on average (ImageNet accuracy is 76.2), but it is not as good as the best model (VIT-H/14, MAE and other accuracy can reach 90). There is a gap of a dozen points. It is predicted that the current scale of 1000 times will be needed to make up for the gap of more than a dozen points, and the existing hardware conditions cannot be completed. Therefore, it is impossible to expand the data scale, and further improvement in data calculation and efficiency is needed.

  2. It is difficult to understand abstract/complex concepts.
    CLIP does not perform well in zero-shot on some more abstract or complex tasks. For example, count how many objects there are in the picture, or distinguish whether the current frame is abnormal or non-abnormal in the surveillance video, because CLIP cannot understand what is abnormal and safe. So in many cases, CLIP will not work.

  3. Poor out-of-distribution generalization
    CLIP is relatively robust to distribution shifts in natural images. But if the data is too far from the training data (out-of-distribution) when doing inference, CLIP generalization will be poor. For example, the accuracy of CLIP on the MNIST dataset is only 88%, and any classifier can achieve 99%. It can be seen that CLIP is still very fragile. (The author's research found that 400 million samples do not have samples that are very similar to MNIST)

  4. Although CLIP can do zero-shot classification tasks, it still chooses from the given categories and cannot directly generate image captions. The author said that the contrastive learning objective function and the generative objective function can be combined in the future, so that the model has both the efficiency of contrastive learning and the flexibility of generative learning.

  5. The use of data is not efficient enough.
    In the training process of this article, 400 million samples ran for 32 epochs, which is equivalent to passing 12.8 billion pictures. Consider using data augmentation, self-supervision, pseudo-labeling, etc. to reduce data usage.

  6. Introducing bias.
    This article has been using the ImageNet test set as a guide when developing CLIP, and has used the 27 data sets for testing many times, so it is necessary to adjust a lot of parameters to determine the network structure and hyperparameters. This is not a true zero-shot, and introduces bias invisibly.

  7. Social bias
    OpenAI's self-built data set has not been cleaned, because it is crawled from the Internet without filtering and reviewing, and the trained CLIP model is likely to contain some social biases, such as gender and skin color.

  8. It is necessary to improve the performance of few-shot.
    Many complex tasks or concepts cannot be accurately described by text. At this time, some training samples need to be provided to the model. But when a small number of training samples are provided to CLIP, the result is not as good as using zero-shot directly. For example, the few-shot classification of CLIP in 3.1.4. Follow-up work considers how to improve the performance of few-shot

1.7 demo

The following is a piece of code copied from the official website of CLIP, using red envelope pictures (not in the 1000 classes of ImageNet) to conduct a simple test:
insert image description here

import numpy as np
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device) # 加载base模型

image = preprocess(Image.open("red_envelogp.png")).unsqueeze(0).to(device)
text = clip.tokenize(["plane", "dog", "a cat","bird"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text) # 计算图文特征相似性
    probs = logits_per_image.softmax(dim=-1).cpu().numpy() # 和图片最相似的文本就是图片的类别

print("Label probs:", probs) 

Label probs: [[0.3131486  0.3174914  0.08763372 0.28172636]]

  It can be seen that the model is very confused when given a completely irrelevant category for text.
  Add the red envelope category below, and the model makes correct detections.

text = clip.tokenize(["plane", "dog", "a cat","bird","red_envelogp"]).to(device)

Label probs: [[0.00437422 0.00443489 0.00122411 0.0039353  0.98603153]]

  Next, let's experiment. How does the model know that this picture is a red envelope? Is it the color or the envelope (envelogp)? Add these two classes below: (I don’t know why, it’s different from what the teacher demonstrated)

text = clip.tokenize(["plane", "red", "envelogp","bird","red_envelogp"]).to(device)

Label probs: [[0.00259908 0.39436376 0.01481757 0.00233828 0.5858813 ]]

  Finally, see if CLIP can learn the relevant semantics. Red envelopes are usually unique to China, and money is stuffed inside during the New Year. Try changing it to these words:

text = clip.tokenize(["money", "new year","red","envelogp","china"]).to(device)

Label probs: [[0.01408994 0.015231   0.05491581 0.00206337 0.91369987]]

  It can be seen that the model did not choose red or envelopes, but chose the concept of China, which is closely integrated with red envelopes. It can be seen that the related semantics of the model are still well learned. But from the perspective of classification, should it be classified as an envelope?

2. CLIP Semantic Segmentation

reference:

  CLIPIt was published by OpenAI in February 2021, and it has been used in various aspects in the past year or so:

  • Semantic Segmentation: Lseg, GroupViT
  • Object detection: ViLD, GLIP v1/v2
  • Video understanding: VideoCLIP, CLIP4clip, ActionCLIP
  • Image generation: VQGAN-CLIP, CLIPasso, CLIP Draw
  • Multimodal: VL Downstream
  • Others: depthCLIP, pointCLIP, audioCLIP (voice), CLIPasso

2.1 Sec

Paper: "Language-driven Semantic Segmentation" , official website code

  Semantic segmentation can be regarded as pixel-level classification, so new technologies and new ideas for classification can generally be used directly. LSegIt is an article published in ICLR on January 10, 2022. Similar to the way CLIPof realizing zero-shot, LSegthe zero-shot semantic segmentation is also realized by inputting the category prompt as text and then calculating the similarity.

  LSegThe significance of is that the branch of the text is added to the traditional supervised segmentation pipeline model, and the text and the image are combined through matrix multiplication. You can learn the visual features of language aware (language text awareness) during training, so that you can use the text prompt to get any segmentation results you want during the final reasoning.

2.1.1 Model effect

  The figure below shows the detection effect of LSeg. Given a picture, and then through the text prompt, the category to be detected is given, and the corresponding semantic segmentation can be realized.
LSeg's zero-shot segmentation results

  • In the first picture, if given dog,tree,others, the dog and the tree can be detected, and the others are the background color
  • In order to verify the fault tolerance of the model, add a car vehiclelabel, and the outline of the car does not appear in the model
  • The model can also distinguish between subclasses and parent classes, which are no longer given in the label dogbut given pet, and dogthe outline of the can also be divided
  • In the third picture, very similar targets such as chairs, walls, and even floors and ceilings are also perfectly separated

  It is worth mentioning that, since the CLIP class model essentially achieves classification or segmentation by calculating the similarity of images and texts, the category prompt text of the 'other' class can actually be any meaningless text, such as 'me' , 'a', 'an', etc., as long as they are not too close to the target category.

2.1.2 Model framework

insert image description here
  As shown in Figure 4 above, the overall model looks very similar to CLIPthe model , except that the single image text features are replaced by pixel-by-pixel dense features in semantic segmentation.
  In addition, the entire network is exactly the same as the traditional supervised network, except that the text features extracted by the above text encoder are multiplied by the dense image features to calculate the pixel-level image-text similarity.

  • The text encoder extracts N × CN\times CN×The text features of C , the image encoder extractsH ~ × W ~ × C \tilde{H}\times\tilde{W}\times CH~×W~×The dense image features of C , multiplying the two to getH ~ × W ~ × N \tilde{H}\times\tilde{W}\times NH~×W~×The features of N are then Spatial Regularization BlocksupFinally, the output of the calculation model is trained with the cross-entropy loss of the ground truth supervision signal.
  • N , C , H ~ , W ~ N,C,\tilde{H},\tilde{W} N,C,H~,W~ are the number of categories (variable), the number of channels, and the height and width of the feature map, and C is generally 512 or 768.
  • Text Encoder: The model and weights of the CLIP text encoder are directly used, and the training and reasoning are frozen during the whole process. Because the data sets for segmentation tasks are relatively small (100,000-200,000), the results of training will not be good.
  • Image Encoder: DPT structure (a semantic segmentation model that uses ViT for supervised training, the structure is ViT+decoder), the backbone can be ResNetor ViT. If the latter is used, its parameters use Vit/DEitpre-trained weights, and the effect of directly using CLIP's pre-trained weights is not very good.
  • Spatial Regularization Blocksis a module proposed in this paper. After calculating the pixel-level image-text similarity, continue to learn some parameters, which can further learn the features of the text image fusion. The module consists of some convolutions and DW convolutions (the effect improves when two modules are added, and the effect collapses after adding four modules, the author did not explain):
    insert image description here

  The model is trained on 7 split datasets consisting of labeled segmentation maps, so the model is trained in a supervised manner (the loss function is cross-entropy loss rather than the unsupervised contrastive learning objective function). During inference, you can specify any number of prompts with any content to perform zero-shot semantic segmentation.

2.1.3 Experimental results

  The author divides the PascalVOC dataset and the COCO dataset into four according to categories. For example, COCO has 80 classes, the first 20 classes are currently known classes, and the last 60 classes are unknown classes, and then zero-shot and few-shot can be done.
  Compared with zero-shot reasoning, the effect of LSeg is indeed much better; but compared with few-shot or even one-shot, there is still a big distance. Considering that LSeg uses the ViT structure, it can be seen that there is still a lot of room for improvement in LSeg.
insert image description here
Failure cases,
  such as the left side of the figure below, the label given is that toy, grassin the embedding space (embedding space), the visual characteristics of the dog are significantly closer to "toys" than "grass", and there are no other labels that can explain the visual characteristics, so the dog is Detected as toy. If the label is face,grass, the dog will be detected as a face.
  That is to say, all the work using the CLIP model is to calculate the feature similarity between the image and the text, and choose whoever is similar, rather than actually doing classification.
insert image description here

2.2 GroupViT

2.2.1 Preface

  Although the previous section LSegcan achieve zero-shot semantic segmentation, the training method is not contrastive learning (unsupervised training), and text is not used as a supervisory signal. Therefore, LSega hand-labeled segmentation mask ( segmentation mask) is still required for training. The 7 data sets used may add up to 100,000 or 200,000 samples, which is far incomparable with other supervised and unsupervised training.

  GroupViTIt is an article published in CVPR on February 22, 2022. As can be seen from the title, its supervision signal comes from text rather than segmentation mask. GroupViTIt is trained by text self-supervision to achieve simple segmentation tasks (no longer relying on segmentation mask).

2.2.2 Model structure

  GroupViTThe core idea is to use the previous unsupervised segmentation work grouping. Simply put, if there are some clustering center points, diverge from these center points, and gradually spread the surrounding similar points into a group, and finally this group is equivalent to a Segmentation mask (feeling similar to DBSCAN).
  Group ViTThe contribution is to add a computing unit to the existing ViT model Grouping Block, and at the same time add a learnable one Group Tokens. In this way, the model can slowly group adjacent elements little by little during the initial learning, and finally become one by one segmentation mask.

  For example, in the picture below, some colorful blocks were learned in the shallow layer, and the elephants, houses, grass, etc. have been segmented in the deep layer

  Let's take a look at the GroupViT model framework and specific training process:
insert image description here

  • Image Encoder
    • The structure is the Vision Transformer, a total of 12 layers of Transformer Layer, the input of which is not only the Pacth embeddings of the original image, but also a learnable group token.
    • Assuming that the input image size is 224,×224, and ViT-Small/16 is selected , the output Pacth embeddings size is 196×384, which is the token si 1 \mathbf{s}_i^1Image Encoder in the figuresi1Enter group token gi 1 \mathbf{g}_i^1gi1The dimensions are 64x384.
    • The group tokens here are equivalent to the cls token in the classification task. Then use the self-attention of the Transformer Layer to learn which patches belong to which group tokens.

Since the classification task requires only one full-image feature for one image, only one token is required. In semantic segmentation, a picture has multiple targets, so multiple features are required, that is, multiple group tokens. Initially, 64 group tokens (clustering centers) are selected, which are not too big or too small, and can be merged later.

  • train:

    • After six layers of Transformer Layers, I have almost learned. Add a Grouping Blockto complete the grouping, assign image block tokens to each group token, and merge them into a larger group with more high-level semantic information, that is, Segment Token (dimension 64×384, equivalent to a cluster assignment).
    • Grouping BlockThe structure is shown on the right side of the figure above, and its grouping method is similar to the self-attention mechanism. Calculate the similarity matrix (64×196) between the grouping token (64×384) and the image block token (196×384), and assign the token to the grouping token with the highest similarity (assignment of the cluster center). In order to overcome the irreducibility of argmax, the guideable gumbel softmax is used. After the merger is completed, si 2 \mathbf{s}_i^2 is obtainedsi2(64×384) 。
    • Repeat the above process: add new Group tokens gi 2 \mathbf{g}_i^2gi2(8×384), after the learning of 3 layers of Transformer Layers, the grouping block is assigned again to get si 3 \mathbf{s}_i^3si3(8×384) 。
    • In order to compare and learn with text features, the sequence features (8×384) output by the last layer of Transformer Layers are subjected to global average pooling Avg Pooling to obtain 1×384 image features. After another MLP layer becomes z I \mathbf{z}^IzI- dimensional image features. Finally with the text featurez T \mathbf{z}^TzT computes the contrastive loss.
  • Reasoning
    The text and the image pass through their respective encoders to obtain text features and image features, and then calculate the similarity to get the best matching image-text pair, and then you can know what class each group embedding corresponds to. The limitation is that the final cluster center (Group Tokens) has only 8 categories, so a maximum of eight targets can be segmented in one image.
    insert image description here

Summary: GroupViT does not add very complicated modules on the basis of ViT, and the objective function is also consistent with CLIP protection, so its scale performance is very good. That is, the larger the model and the more data, the better its performance will be.

Other details:

  • In the paper, ViT-Small is selected, and the data set is 29 million image-text pairs.

  • In addition to the text itself paired with the graphics, the nouns in the text are also extracted, and a prompt is generated according to a method similar to CLIP (such as "A photo of a {tree}."), and a comparison loss is calculated with an image feature, see the original text Figure 3;
    insert image description here

  • In the ablation experiment, the combination of 64 and 8 for the number of Group tokens is the best
    insert image description here

2.2.3 Group tokens visualization

Group tokensIn order to verify whether   the added ones Grouping Blockare working, Group tokenswhether they have become cluster centers, and whether they correspond to a certain category; the author visualizes the attention areas group tokencorresponding to , and the results are as follows:
insert image description here

  • In the first stage, each token notices some semantically clear regions, such as group5 represents the eyes, group36 represents the limbs, and they are relatively small regions;
  • In the second stage, the semantic area noticed by each token is relatively large, such as face and body. This is in line with the effect of group grouping and merging that the author wants.

2.2.4 Experiment

  1. Comparison with Zero-Shot Baselines
    The following table compares the effects of some other Zero-Shot inference models:
    insert image description here
  2. Comparison with Fully-Supervised Transfer
  • On the PASCAL VOC 2012 dataset, Zero-Shot GroupViT (no fine-tuning) outperforms all self-supervised pretrained ViT variants (supervised fine-tuning)
  • On the PASCAL Context dataset, Zero-Shot GroupViT is also comparable to them.
    insert image description here
      GroupViT is the first work to achieve zero-shot semantic segmentation, which is significantly improved compared to other self-supervised semantic segmentation methods. But compared with the upper limit of the supervised training model, it is still very poor.

  DeepLabv3+(Xception-65-JFT) hasPASCAL VOC reached on , and (Mask2Former, BEiT pretrain) has reached 68.2 on .Mean IoU89
ViT-Adapter-LPASCAL ContextMean IoU

2.2.5 Limitations

The current unsupervised semantic segmentation is still difficult to do, and the author also lists two limitations of GroupViT:

  • GroupViT is more inclined to be an image encoder, without using dense prediction features, such as hole convolution, pyramid pooling and U-Net structure, so as to obtain more contextual information and multi-scale information;
  • The background type interference problem is difficult to deal with.
    In the reasoning process, the maximum similarity may also be very low, such as 0.2; in order to improve the segmentation performance of the foreground class, the author set a similarity threshold, but because of the interference of the background class, this threshold is difficult to set.

  For example, PASCAL VOCif the similarity threshold is 0.9or , when 0.95the similarity between the picture and the text takes the maximum value and the maximum value is greater , the object will be considered as this class; otherwise, it will be considered as the background class. There are few categories in the medium, and the objects have clear semantics, so the background category has less interference; but or the data set has many categories, and the similarity of the foreground is generally very low, which is not much different from the background category. If the threshold is set high, many objects will be detected as background categories; if it is set low, it is easy to misclassify, that is, the category with the highest similarity is not the real category.0.9
  PASCAL VOCPASCAL ContextCOCO

  Because of the background interference problem, the author found Group tokensthat he has learned well, but it is easy to make mistakes in the final segmentation. In order to verify this point, the author did an oracle comparison experiment.

insert image description here
  oracle mask mIoU: After the model is segmented, it does not use the category result predicted by the model, but calculates the IoU between each mask and GT mask, and directly takes the largest category label as the category result. This is equivalent to, as long as the model is segmented accurately, the prediction of the semantic category is definitely accurate.
  As can be seen from the figure above, oraclethe result has a huge improvement of as many as 20 to 30 points compared with the original result; this shows that the semantic category prediction error is a major bottleneck of GroupViTthe model .

  Conclusion : GroupViTThe image segmentation is done well (segmentation mask is generated well), but the semantic segmentation is not good enough. This is because the training method of CLIP, which is a comparative learning method, can learn very well for clarifying semantic object information; but for the background This semantically ambiguous class is difficult to identify because the background can represent many classes. Subsequent improvements can be to set different thresholds for each class, or use learnable thresholds, or change the Zero-Shot reasoning process, add constraints during training, incorporate background class concepts, and so on.

2.3 Summary

  LsegUsing CLIP's pre-trained model and general framework, it combines text and image features for segmentation, but it is still a supervised learning process, and still requires a manually labeled dataset; a segmentation model is trained from scratch, but the objective function GroupViTused As CLIPwith the contrastive learning objective, one of the limitations is that the background class is not handled well enough.

3. CLIP target detection

Li Mu's Thesis Accuracy Series "CLIP Improvement Work Lectures"

3.1 ViLD

Paper "Open-vocabulary Object Detection via Vision and Language Knowledge Distillation" , tensorflow code

3.1.1 Introduction

  ViLD was uploaded to Arxiv on April 28, 21, just two months after CLIP was published, and it trained for about 460epoch, so the speed is very fast. ViLDThat is, Vision and Knowledge Language Distillation, which uses CLIP as a teacher distillation network to achieve Zero-Shot detection. Simply put, what ViLD wants to do is: only need to train the basic class during training, and then learn from the CLIP model through knowledge distillation, so that any new object category can be detected during inference ( Open-vocabulary Object) .

  In the example below, if the traditional target detection algorithm is used, the model will only judge that these objects are toys, that is, the blue basic class in the figure, and cannot detect more detailed categories. After using CLIP, new classes can be detected on the existing detection frame without additional labeling (the class marked in red in the figure).
insert image description here

3.1.2 Model structure

  The research focus of the ViLD method is on the second stage of the two-stage object detection method, that is, after the proposal box (proposal) is obtained. The idea is still the simplest to extract text and image features separately, and then calculate the similarity by dot product. Its model structure is shown in the figure below.
insert image description here

  • (a): Mask R-CNN framework.
    The candidate frames obtained in the first stage proposalsare obtained through the detection head region embeddings, and then the predicted bounding boxand corresponding categories are obtained through the classification head. The loss is divided into localization loss (regression loss) and classification loss.
  • (b): ViLD-textbranch
    • N proposalsafter some processing to get N region embeddings(picture features) similar to Figure a.
    • The text is obtained by processing the object category (basic class) into prompt sentences, and then throwing these texts to the text encoder to obtain Text Embeddings (text features). Similar to Lseg, these Text Embeddings are also frozen weights and do not participate in training.
    • The above object category is Base categories(also called CB, Class Base), which is the same as the basic class of Mask R-CNN supervised training, so it is ViLD-textstill supervised training.
    • Because it is supervised training, it is necessary to add an additional background class for training, which can be learned Background embedding (categories other than the basic class are all classified as background classes).
    • Text Embeddings plus learnable background embedding and dot region embeddingsproduct to calculate the similarity of graphics and text to get logics, and then calculate the cross-entropy loss of logics and GT for training.
    • In ViLD-textthe model , only the text features are associated with the image features (I feel that this is just like Lseg), and the model can do zero-shot detection of text queries. However, since the model does not yet understand other semantic content (X new category CN) other than the basic class CB, the effect of directly doing zero-shot will not be very good.
  • ViLD-textDot product calculation formula:
    insert image description here
  • IRepresents a picture, φ(I)represents the extracted image features, rand is proposals. φ(I)and calculated rtogether Rto get er e_rer( region embeddings, image features)
  • ebg e_{bg}ebgIndicates background embedding, t_1 to t ∣ CB ∣ t_{\left | CB \right |}tCBRepresents the text features of the base class CB Text Embeddings.
  • Image features er e_rerrespectively and the background feature ebg e_{bg}ebgDo the dot product with the text features to calculate the similarity, and finally get ViLD-textthe model output z(r)(logics).
  • z(r)After doing softmax and groud truthcalculating cross entropy to get this part of the loss.
  • ProjectionLayers are introduced to unify the dimensions of image-text features.
  • (/c): ViLD-imageBranch: Introduce the CLIP feature, this part is only distilled during training, not distilled during inference.
    Considering CLIPthat the image encoder of is well trained and closely related to the text, it is hoped ViLD-image-encoderthat the output is as close as possible region embeddingsto the CLIP output image embedding, so that the image feature extraction capabilities of the open world in the CLIP image encoder can be learned. The easiest way to do this is Knowledge Distillation.
    • Right teacherbranch: Resize M proposalsto a size of 224×224, and then input the pre-trained CLIP-image-encoder(freeze, do not participate in training, ensure that the extracted features are as good as CLIP) to getM image embeddings
    • Left studentbranch: ViLD-textSame as the structure before the branch, input M and proposalsget Mregion embeddings
    • Calculate andregion embeddings L1 loss for knowledge distillation, let the detector learn the features extracted by CLIP.image embeddings
    • In order to speed up model training, the CLIP model is used to extract image region features when training ViLD-image, save them in the hard disk, and read them directly from the hard disk during training.
    • This branch supervision signal is no longer manually labeled data, but the image coding of CLIP, so it is no longer limited by the basic class CB, and image features can be extracted by CLIP for any semantic region. Utilization ViLD-image, greatly enhanced the ability to do Open-vocabulary detection

ViLD-imageDisadvantages of branching: pre-loading trained proposals, rather than changing at any timeN proposals

  • This branch input is M pre-complete proposals, which is for training acceleration.

  • In theory, the output of the first stage N proposalsshould be input into the two branches of text and image for training, but it would be too slow to extract CLIP features every time training. Because ViLDthe selected CLIP-L model is very large, it is very expensive to do a forward pass. For example, when M=1000, it means that each iteration needs to go forward 1000 times to get all the image features, so the training time will be infinitely long.

  • The author's approach here is to ViLD-imageuse the RPN network to pre-extract before starting training M pre-complete proposals, and then calculate according to the order in the figure M image embeddings. ViLD-imageWhen training, you only need to load it in, so that the loss calculation is very fast, and the distillation process is also trained very quickly.

  • (d): Combination of ViLD-text and ViLD-image
    For simple training, input the sum M pre-complete proposalsand N proposalsthe detection head together to get n+m embeddings, and then split them into N region embeddingssum M region embeddings. The former calculates ViLD-textthe cross-entropy loss of the branch, and the latter calculates ViLD-imagethe distillation L1 loss.

3.1.3 Model overview

Here's a quick overview of the model:
insert image description here

  • train:

    • The pictures are obtained through the RPN network proposals. Then it is obtained through RoIAlign and some convolutional layers N region embeddings, that is, R 1 R_1 in the figureR1and R 2 R_2R2
    • The basic class obtains the text through the prompt, and obtains the text encoding B 1 B_1 through the text encoderB1to B n B_nBn. Then and R 1 R_1R1 R 2 R_2 R2Calculate the cross-entropy together.
    • Input the already extracted M image embeddings(dice, parking signs, etc. in the picture) into the CLIP image encoder to obtain the feature I 1 I_1I1 I 2 I_2 I2, use them for R 1 R_1R1 R 2 R_2 R2Do distillation (compute L1 loss)
  • reasoning:

    • i m a g e → b a c k b o n e + R P N p r o p o s a l s → R o I A l i g n + C o n v r e g i o n − e m b e d d i n g s image\overset{backbone+RPN}{\rightarrow}proposals\overset{RoIAlign+Conv}{\rightarrow}region-embeddings imagebackbone+RPNproposalsRoIAlign+Convregionembeddings
    • C N + C B → p r o m p t T e x t → T e x t − E n c o d e r t e x t − e m b e d d i n g ( B 1 . . B n + N 1 . . . N k ) CN+CB\overset{prompt}{\rightarrow}Text\overset{Text-Encoder}{\rightarrow}text-embedding(B_1..B_n+N_{1}...N_{k}) CN+CBpromptTextTextEncodertextembedding(B1..Bn+N1...Nk)
    • c l a s s = a r g m a x ( r e g i o n − e m b e d d i n g s ⋅ t e x t − e m b e d d i n g ) class=argmax(region-embeddings\cdot text-embedding) class=argmax(regionembeddingstextembedding)

3.1.4 Experiment

  1. LVis dataset zero-shot effect comparison

  The images of the LVis dataset are sampled from the COCO dataset, but it is a very long-tailed dataset. Among the marked 1203 classes, there are many classes that are only marked a few times, that is, rare classes, so each class is divided into fequent, common, and rare, and the number of marks decreases in turn, that is, AP f , AP c , AP r AP_f, AP_c, AP_rAPf,APc,APr.
  In this experiment, AP f , AP c AP_f,AP_cAPf,APcAs the basic class seen in the model (a total of 886 classes), AP r AP_rAPrAs a new class (a total of 337 classes, the model has not been seen, so zero-shot detection can be done)
insert image description here

  It can be seen that the AP of ViLDthe model Supervised-RFSon the new class is significantly ahead of the baseline model (RFS is a class that samples the tail as much as possible to solve the long tail problem, so this is a strong limit model), and it is a zero-shot test. But this is predictable, because for supervised training, only one or two samples may get worse as the training progresses, it is not as good as direct zero-shot.

  1. Zero-shot effect of other datasets The
    figure below shows the effect of zero-shot transfer pre-trained on the LVis dataset ViLDand PASCAL VOCthe COCOzero-shot migration effect on the dataset. Compared with the supervised training model, there are still some gaps.
    insert image description here

insert image description here

  • First row: ViLD is able to localize correctly and recognize new categories. For clarity, we only show the detected novel classes.
  • Second row: ViLD detects both base classes and novel classes without degrading the detection performance of base classes.
  • The last two lines: ViLD can directly migrate COCO and Objects365 without further fine-tuning.

3.1.5 Conclusion

  ViLD is the first model to do Open-vocabulary target detection on such a difficult data set as LVis, which is a milestone work. ViLD draws on the ideas of CLIP and the pre-training parameters of CLIP, and the final result is also good.

3.2 SLIP v1

3.2.1 Preface

1. Motivation for research: Open-vocabulary Object Detectionnecessary

  Like target detection and segmentation, labeled data sets are very expensive. For corner classes and endless new classes, we have no way to train a model to detect them well. We can only rely on Open-vocabularythe target detection model to handle these corner cases well.
  And if you want to train a strong Open-vocabularytarget detection model, you can only CLIPuse a data set of hundreds of millions of scale, and you must learn the image-text correspondence and positioning well. Then the focus is on the efficient use of image-text data sets, because it is easy to collect .

2. Solution: phrase grounding+Object Detection+ Pseudo-label training
  Vision Language task (picture-text multimodal task) has a type of positioning task Vision grounding, which is mainly to locate the corresponding object in the picture according to the text (phrase grounding), which is different from the target detection task. Very similar, they are all to find the position of the target object in the picture.

  GLIPThe starting point of the article is to convert the detection problem into a phrase grounding problem, so that the GLIP model unifies the two tasks of target detection and positioning, and can use more data sets. Combined with the pseudo-label technology to amplify the data, the amount of training data has reached an unprecedented scale (3M manually labeled data and 24M image-text pair data). Finally, the trained model is directlyGLIP-L reasoned on and in the way of , and the mAP reaches and , which shows that its performance is very strong.zero-shotCOCOLVIS49.826.9

  • groudningThe input to the model is the phrase, the box corresponding to the noun in the phrase, and the picture.
  • Target detection is transformed into phrase grounding: promptthe label name is converted into a phrase through the method. For example, coco has 80 category tags, connect the 80 tags with commas, and add " Detect:" before the phrase to form a short sentence. This has two advantages:
    • Both target detection and phrase groundingdatasets can be used for training
    • For the basic class and various other classes, they can be built into the prompt phrase to detect together, which is more flexible and can easily migrate the task to the open target detection task.
  • Pseudo-label training (self training):
    • All target detection tasks and phrase groundingtask datasets (total 3M) are used for supervised training to obtain GLIP-T(C)a model.
    • 24MInference this model on the image-text pair data crawled from the Internet to get bounding box. Then all these bounding boxes are used as GroundTruth(pseudo-labels), so that 24M supervised data is obtained.
    • Finally, 24Mcontinue training on the supervised data here to get the final model GLIP-L. It can be seen that the whole GLIPis trained with a supervised method.

3. zero-shotInference effect display:
  directly give the object category to generate a sentence like ViLD (Prompt: person. bicycle.car. motorcycle…) or generate a phrase like the phrase grounding task "there are many pits on the road" (Prompt: there are some holes on the road), objects can be detected.
insert image description here

3.2.2 Loss calculation

  The loss function of object detection consists of classification loss and localization loss. For target detection and Vision grounding, the positioning part is similar, and the difference between the two is mainly how to calculate the classification loss. Because the label detectionof is a one-hot category word, and Vision groundingthe label of is a sentence. Therefore, it is necessary to unify the classification loss of the two under one framework, that is:
L = L cls + L loc . L = L _{cls}+ L_{loc} .L=Lcls+Lloc.

  • detection分类损失计算:
    O = E n c I ( I m g ) , S c l s = O W T , L c l s = l o s s ( S c l s ; T ) . O=Enc_{I}(Img),S_{cls}=OW^{T},L_{cls}=loss(S_{cls};T). O=EncI(Img),Scls=OWT,Lcls=loss(Scls;T).

    1. E n c I Enc_{I} EncIRepresents a picture encoder (such as swin transformer), which is obtained after processing img N region embeddings, that is, O ∈ RN × d O\in \mathbb{R}^{N\times d}ORN × d (n bounding boxes, each dimension is d);
    2. N region embeddings → c l s − H e a d S c l s \overset{cls-Head}{\rightarrow}S_{cls} clsHeadScls. Among them, the classification head cls-Head is composed of matrix W ∈ c N × d W\in \mathbb{c}^{N\times d}WcN × d means,N region embeddingsafter multiplying and multiplying, we getS cls ∈ RN × c S_{cls}\in \mathbb{R}^{N\times c}SclsRN×c
    3. L c l s = l o s s ( S c l s ; T ) . L_{cls}=loss(S_{cls};T). Lcls=loss(Scls;T ) . : Use nms to filter these bounding boxes, and then calculate cross-entropy with GroundTruth to obtain classification loss.
  • Vision groundingClassification loss calculation: (in fact, it ViLD textis exactly the same as the branch)
    O = E nc I ( I mg ) , P = E nc L ( P rompt ) , S ground = OPTO=Enc_{I}(Img),P=Enc_{L}( Prompt), S_{ground}=OP^{T}O=EncI(Img),P=EncL(Prompt),Sground=OPT

  1. Represents the picture encoder E nc I Enc_{I}EncIAfter processing img N region embeddings, it is O ∈ RN × d O\in \mathbb{R}^{N\times d}ORN×d
  2. Text Encoder Enc L Enc_{L}EncL(For example, BERT) is obtained by processing Prompt text embedding, that is, P ∈ RM × d P\in \mathbb{R}^{M\times d}PRM×d
  3. The image feature O and the text feature P are multiplied to obtain the similarity result S ground ∈ RN × M S_{ground}\in \mathbb{R}^{N\times M}SgroundRN × M , which is the region-word alignment scores mentioned in the paper.

In the above formula, M ((the number of sub-word tokens) is always greater than the number of phrases c, for four reasons:

  • a phrase always contains many words
  • A word can be divided into several subwords, such as toothbrush is divided into tooth#, #brush
  • There are also some added tokens, such as "Detect:", commas, etc., or special tokens in the language model
  • A token will be added at the end of the tokenized sequence[NoObj]

  During training, if the phrases phraseare both positive match and added tokensnegative match (added tokens cannot match any image object), then use subwords (subwords are also positive examples, and the label matrix is ​​composed of T ∈ [ 0 , 1 ] N × c T\in [0,1]^{N\times c}T[0,1]N × c expands toT ∈ [ 0 , 1 ] N × MT\in [0,1]^{N\times M}T[0,1]N x M ). The average pro of multiple tokens during the test is used as the probability of the phrase.

  After using the above method to unify the loss, the grounding model method can be used to pre-train the detection task, so that GLIPthe model can do zero-shot detection. Afterwards, the author used the unified framework to verify the indicators on the COCO dataset and found that they matched exactly, so he also verified his idea experimentally.

3.2.3 Training dataset

  • The above three lines A, B, and C show that GLIPthe model can use target detection datasets at the same time, such as Objects365and Groundingdatasets GoldG(the combination of several datasets is still very large).
  • GLIP-L: The backbone is the Swin-L model, and then use it at the same time FourODs(all data sets that can be used in supervised training for target detection) GoldGand train with the image text pair Cap24Mdata set. At this time, the data set is already very large, enough to train a very strong model.
    insert image description here

  Cap24MIt is 24Mpseudo-labeled data. The generated pseudo-labels must have errors, but experiments show that the GLIP-L model trained by expanding a large number of pseudo-label data will still improve performance.

3.2.4 Model framework

1. General framework

  As shown in the figure below, since all datasets are labeled, the model is trained in a supervised manner. After calculating the similarity between the text feature and the image feature, you can directly calculate the alignment loss with the GT box (same as the ViLD-text branch). In this way, the feature fusion of text and image is completed, and zero-shot detection can be performed. The positioning loss is also directly calculated with the GT box to calculate the L1 loss.

  The fusion layer ( ) in the middle of the model Deep Fusionis the same as LSeg, which is to further interact image features and text features, so that the final image-text joint feature space (joined embedding space) can be trained better (similar embedding is closer, Dissimilar distance), image features and text features are trained to be stronger and more relevant, so the effect of calculating the similarity matrix later will definitely be better.
insert image description here

2. Deep Fusionlayer

  • The image encoder is DyHead(L layer), and the output image feature of the first layer is expressed as O 0 O^0O0
  • The text encoder is pre-trained BERT(L layer), and the output text feature of the first layer is expressed as P 0 P^0P0
  • X-MHADenotes a cross-modal multi-head attention module.
  • It can be seen from the structure diagram and formula that the graphic features O i , P i O^i,P^i output by each layerOi,Pi willX-MHAinteract in , the features after interaction and the original features are added together and input to the next layer for encoding, and the features of the next layer are obtainedO i + 1 , P i + 1 O^{i+1},P ^{i+1}Oi+1,Pi + 1 .
    insert image description here
    InX-MHAthe module, image features and text features are used interactivelyCross Attention:
    insert image description here
  • Segmentation and detection are both types of dense prediction, and both require classification and positioning at the same time, so many methods can learn from each other, so they can also be used in the Deep Fusionfield of segmentation, for example GroupViT. GroupViTOnly at the end of the image branch and the text branch did a bit of comparative learning. If you do some before that Deep Fusion, the effect may be better.

3. Reasoning display
The author on the right side of the figure above shows two very difficult tasks:

  • Test two syringes and a vial of vaccine. The categories needles and vaccines do not appear to exist in existing datasets. But GLIP made its own understanding of the text and gave the test results of vaccines and needles.
  • A description of the picture was given: "In Playa Esmeralda, Cuba, overlooking the beach from above, and seeing the beautiful sea-green Caribbean Sea". These descriptions are some relatively abstract concepts, which are not quite like objects, but GLIP still does a good job.

3.2.5 Comparative experiment

  1. Comparison of COCO dataset results.
    It can be seen that GLIPthe mAP of the model directly doing zero-shot detection has reached 49.8. If fine-tuning is performed on COCO, GLIPthe result can exceed some of the best supervised methods at present. Of course GLIP, the size of the pre-training data set and some tricks are different from other models, but it is enough to see GLIPthe strength.
    insert image description here
  2. Comparison of LVIS dataset resultsinsert image description here

3.3 GLIPv2

Paper "GLIPv2: Unifying Localization and Vision-Language Understanding" , code

3.3.1 Introduction

insert image description here
  The architecture of GLIPv2 and GLIPv1 is basically the same, but more tasks and data sets are integrated. As can be seen from the thesis title Unifying Localization and Vision-Language Understanding, it unifies all positioning tasks (such as segmentation and detection) and Vision-Language tasks.

Vision-Language: Language-Vision tasks, including:

  • vision Caption: Image description generation, generating descriptive text based on a picture;
  • VQA: Given a picture and a natural language question related to the picture, the computer can generate a correct answer. Text QA is a plain text answer. In contrast, VQA replaces the material with pictures, so this is a typical multimodal question;
  • Vision grounding: Locate the corresponding object in the picture according to the phrase.

  As can be seen from the figure below, compared with GLIPv1, GLIPv2 adds some text encodertraining tasks to make its representation more abundant. For example, the positioning task includes not only target detection but also instance segmentation, and the Understanding task includes the Vision grounding, vision Captionand and VQAtasks.
  Then there are image features and text features Deep Fusion, and the same processing will follow. It is also a current trend to include more tasks, more data sets, and more modalities under a unified framework like this, such as last year's OFA, this year's Unified-IO, and so on.
insert image description here

3.3.2 Loss function

The loss function has been improved in GLIPv2, groundadding two kinds of losses on the basis of the original loss:
LGIP v 2 = L loc + L intra ⏟ L ground + L inter + L mlm {L_{GLIPv2}=\underset{L_{ ground}}{\underbrace {L_{loc}+L_{intra}}}+L_{inter}+L_{mlm}}LGLIPv2=Lground Lloc+Lintra+Linter+Lmlm

  1. Add MLMLoss: Adding this loss strengthens the language properties of the model. It enables the trained model to be extended to VQA / ImageCaption tasks.

  2. Contrastive learning loss between images L inter L_{inter}Linter.
    Originally image-text pair, only pairinternal information could be seen. For example, a pair of data is a photo of a person holding a cat and the corresponding text description. According to the original loss design, the images in the picture can only be similar to 'person', not similar to 'cat', but there is no way to distinguish them from all kinds of entities in other pictures. So consider adding a space between pictures contrast loss.

Computation method of comparison loss:

  • For all pairs in a batch, extract their non-interactive image features and text features O ∘ = E nc V ( I mg ) , P ∘ = E nc L ( T ext ) \overset{ {\circ }}{O} =Enc_{V}(Img),\overset{ {\circ }}{P}=Enc_{L}(Text)O=EncV(Img),P=EncL(Text)
  • Calculate the similarity S groundbatch [ i , j ] = O ∘ i ( P ∘ j ) T S_{ground}^{batch}[i,j]=\overset{ { \ circle }}{O}{_{}}^{i}(\overset{ {\circ }}{P}{_{}}^{j})^{T}Sgroundbatch[i,j]=Oi(Pj)T , so that each object/token can see more negative samples through cross-image matching. So we not only model the features after the interaction between pictures and text, but also model the features before the interaction between pictures and text, similar to loopiter.
  • When cross-sample matching, the object 'person' in picture A and the 'person' category in the prompt corresponding to picture B should also match
    insert image description here

3.3.3 Model structure

The model overview is as follows:
insert image description here

3.3.4 Model effect

  1. We will GLIPv2compare with the current object detection and vision-language pre-trained models in the table below on 8 downstream tasks.
    Experimental results show that a single GLIPv2 model (weights shared by all models) achieves performance close to SoTA on various localization and understanding tasks. The model also demonstrates strong zero-shot and few-shot performance on open-vocabulary object detection tasks and excellent grounding ability on VL understanding tasks.
    insert image description here

SOTA:state-of-the-art, current best/state-of-the-art model

  1. The following are the comparison results between direct inference and prompt tuning for different specifications GLIPv1/ models: (gray indicates that this data set was used during training, so zero-shot inference cannot be performed)GLIPv2
    insert image description here
  2. Ablation test
  • The x-axis on the left indicates the use of different numbers of downstream task samples, and the y-axis is the average AP on 13 data sets;
  • The right side is the ablation test results on ODinWthe data ;
  • zero-shot GLIPv2-T(48.5) over 5-shot DyHead-T (46.4)
  • one-shot GLIPv2-H(61.3) outperforms DyHead-T (60.8) with supervised fine-tuning on all data (ALL).
    insert image description here

4. CLIP image generation

4.1 CLIPasso generates minimalist paintings

4.1.1 Preface: Why CLIP again?

  CLIPassoWon the SIGGRAPH Best Paper Award in 2022. The title of his thesis is Semantically-Aware Object Sketching, which means semantically-aware object sketching . From the picture below that contains the famous painting of Picasso, it can be seen CLIPassothat it is the abbreviation of CLIP and Picasso, which all indicate that this is an article that studies the generation of stick figures from pictures.
insert image description here

Why did this article choose again CLIP?

  1. Preservation Semantically-Aware(semantic awareness).
      Because what the author wants to do is to use the simplest sketch to describe the object with a few strokes, and at the same time everyone can recognize it, so it must be recognized both semantically and structurally. It can be seen that this kind of sketching is very difficult. It is necessary to grasp the most critical features of the object, that is, the ability to abstract objects mentioned in the abstract .
      The picture below is Picasso's famous painting of a bull. It took about a year for this series from the first picture to the last. What the author shows is that he wants to input a picture and output the final stick figure. It can be seen that abstraction is very important and difficult.
    insert image description here
  2. There are some related works before getting rid of supervised training datasets
      , but they all collect some sketch datasets (sketch datasets) for training, and the degree of abstraction is also fixed. In this data-driven (data-driven) approach, any model can be learned from any data set, so the form and style of the final generated sketch are very limited, which violates the original intention of image generation.

  There are few sketch data sets, and the types and styles of learning are not rich enough. For example, there are several sketch data sets in the picture below. SketchyCOCO has only 9 types of objects, all of which are common animals. The latest QuickDraw collected by Google (from everyone's graffiti on the Internet), although there are 50 million images, there are only more than 300 categories. The trained model may not be able to generate accurate sketches when it encounters objects outside these categories, and it needs to collect corresponding data for fine-tuning.
insert image description here

  So the most direct answer to achieve these two points is CLIP. The CLIP image-text pairing learning method makes it particularly sensitive to objects and captures the semantic information of objects very well; it also has excellent zero-shot capabilities, without any fine-tuning on downstream tasks. It can be used directly, so there is CLIPasso.

In the blog "Multimodal Neurons in Artificial Neural Networks"distill published in   the Visualization Journal , the author has thoroughly analyzed the adversarial attacks, OCR attacks, robustness, etc. of the CLIP model, which is very worth reading, including CLIP's Migration of stick figures. Because the natural images processed by CLIP before, the effect of migrating to detection and segmentation is very good. But the distribution of stick figures and natural images is completely different, and it is impossible to judge whether CLIP can work well.   In the article, the author observed that regardless of the style of the picture, CLIP can extract the visual features of the object very well, that is, it is very robust, and thus laid the foundation for the work . (In fact, this article also draws on it )
CLIPassoCLIPDraw

4.1.2 Summary

  Because of the simple and minimal nature of line drawing, abstraction is at the heart of sketch drawings. Abstraction requires identifying basic visual properties of an object or scene, requiring semantic understanding and prior knowledge of high-level concepts . Therefore, abstract drawing is a challenge for artists, and even more so for machines.

  This paper proposes CLIPassoan object sketching method that can achieve different levels of abstraction under the guidance of geometric simplification and semantic simplification. Although sketch generation methods often rely on explicit sketch datasets for training, this paper leverages the powerful capabilities of CLIP to extract semantic concepts from sketches and images , defining sketches as a set of Bezier curves. The curve parameters are then optimized directly for the CLIP-based perceptual loss with a fine-tunable rasterizer.

  The degree of abstraction of stick figures can be controlled by varying the number of strokes, and the resulting sketches exhibit multiple layers of sub-abstraction while maintaining recognizability, basic structure, and basic visual composition of the drawn object. The method generalizes to various categories and tackles challenging levels of abstraction while maintaining semantically visual cues for instance-level and category-level recognition.

4.1.3 Model structure

  The author has improved the training method, loss selection and the initial setting of the stick figure to achieve the final very good effect. For example, in the picture below, by setting different numbers of strokes, the image can be abstracted at different levels:
insert image description here
1. Training process

  The method of generating stick figures is not to generate graphs directly, but to use Bezier (Bezier) curves in graphics to complete stick figures. The Bezier curve is to determine a curve by defining several points on the plane. In this paper, each curve is determined by four points, each point has its x, y coordinates, namely si = { pij } j = 1 4 = { ( xi , yi ) j } j = 1 4 s_{ i}=\left \{ p_{i}^{j} \right \}_{j=1}^{4}=\left \{(x_{i},y_{i})^{j}\ right \}_{j=1}^{4}si={ pij}j=14={ (xi,yi)j}j=14. Among them, s is the abbreviation of Stroke, and j ranges from 1 to 4 to indicate that it is controlled by 4 points.

  So the method of this article is to randomly initialize some Bezier curves, and then change the positions of these points after continuous training, thereby changing the Bezier curves to get the final stick figure. The training process is shown in the figure below:

  • Rasterizer: Rasterizer, a method for drawing Bezier curves according to parameters in the graphics direction, which can be guided. So this part is an existing method without any changes.
  • The focus of this paper's research: how to choose a better initialization; and how to choose the appropriate loss for training.

insert image description here

  • Initially define some Bezier curves s 1 s_1s1to sn s_nsn, and then throw it to the rasterizer Rasterizerto draw the image we can see on the two-dimensional canvas.
  • According to the loss training stroke parameters, the final model output is obtained.


  2. The stick figure generated by the objective function has two requirements, that is, to be consistent with the original picture in terms of semantics and structure. For example, a horse is still a horse, a cow is still a cow; and a horse cannot be generated, but the direction of the horse's head is reversed, or the horse changes from standing to lying down. In CLIPasso, these two requirements are respectively composed of two loss functions—semantic loss L s L_sLsand the geometric distance loss L g L_gLgto guarantee.

  • L s L_s Ls: semantics loss, calculate the original image features and stick figure features, so that the two are as similar as possible.
      Using a CLIPdistillation CLIPassomodel (like ViLD), the image features extracted by the model can be close to those extracted by the CLIP image encoder. In this way, the robustness of CLIP just mentioned is used, that is, features can be extracted well both on the original natural image and on the stick figure. If the two describe the same object, then the encoded features have the same semantics, and their features must be similar.
  • L g L_g Lg: geometric distance loss, calculate the loss of the shallow coding features of the original image and stick figure.
      Borrowed from some LowerLevel visual tasks. Because in the first few layers of the model, relatively low-level geometric texture information is learned instead of high-level semantic information, so it contains some information about length and width, which is more sensitive to geometric positions. Therefore, restricting the shallow features can ensure that the geometric outlines of the original image and the stick figure are closer. (For example, if the backbone of the CLIP pre-training model is ResNet50, the output features of the stage 2, 3, and 4 layers of ResNet50 are extracted to calculate the loss instead of the 2048-dimensional features after pooling)

3. Initialization
  The author found that if the parameters of the Bezier curve are completely randomly initialized, the model training will be very unstable. Some of the generated stick figures are simple and good-looking, and some can’t restore the semantics no matter how much training is done, or even a mess, so we need to find a more stable initialization method.

  Initialization based on saliency (saliency): input the image into the ViT model, and take a weighted average of the final multi-head self-attention to obtain saliency map. Then saliency maptake points in the more prominent area on the top to complete the initialization of the Bezier curve parameters, so that the training is much more stable, and the effect is generally much better.

  Taking points in the salient area is equivalent to knowing that there is an object here (the semantics are clearer), or it is already equivalent to drawing a Bezier curve along the boundary of the object. In this way, the initialization curve and the final stick figure curve are relatively close.

  Figure a below shows the comparison between the generation result of saliency initialization (Proposed) and the generation result of random initialization (Random). It can be seen that the facial features of Proposed are closer to the original image, and the hair is simpler.
  Here the author also visualizes the self-attention graph and the final point distribution graph. It can be seen that the point distribution map is very close to the final stick figure image.
insert image description here
  Figure b is a post-processing operation in this paper. CLIPassoGenerate three stick figures for each picture, and finally calculate the loss of each stick figure and the original picture ( L s + L g L_s+L_gLs+Lg), call out the one with the lowest loss as the final result (blue box).

  This kind of postprocessing is common in Vinsen graphs, eg DALL-E. Generate a lot of images based on the text, and then calculate the similarity between these generated images and the original text in CLIP, and pick out the ones with the highest similarity to display, which can often achieve the best results.

4. Training Visualization

  Training generally requires 2,000 iterations, but the general outline can be seen after 100 iterations. And the author said in the appendix CLIPassothat the training is very fast. On a V100 card, the 2000 iterations can be completed in 6 minutes. So you can try this kind of cross-border research when computing resources are insufficient.
insert image description here

4.1.4 Experimental results

  1. With the help of CLIP zero-shot, CLIPassostick figures can also be generated for uncommon objects.
    The previous methods can only generate stick figures for some objects in the data set, but it is difficult to generate rare objects.
    insert image description here

  2. Arbitrarily control the level of abstraction
    insert image description here

  3. compared to other methods
    insert image description here

4.1.5 Limitations

  1. When the input image has a background, the generated effect is greatly reduced.
      The input image must be an object, and the generated effect is best on a pure white background. Because only in this way can the self-attention map be more accurate and the initialization effect will be better, but with the background, self-attention will be much more complicated.
      Therefore, the author first inputs the picture into U2Net, extracts the object from the background, and then generates it. This is a two-stage process, not end-to-end, so it is not an optimal structure. How to integrate the two stages into one framework, and even remove the influence of the background in the design loss, the model is more applicable.
  2. Stick figures are generated simultaneously rather than sequentially.
    If the model can be done like a human painting, one stroke at a time, each time the position of the next stroke is determined according to the previous stroke, and the generation effect may be better if it is continuously optimized.
  3. Objects with different levels of complexity require different degrees of abstraction.
    CLIPassoThe number of strokes that controls the degree of abstraction must be specified in advance, so it is best to design the number of strokes as a learnable parameter. In this way, objects of different complexity on different pictures can be automatically abstracted very well. At present, every time the user enters a picture, he has to consider using a few strokes to abstract it.

4.1.6 Conclusion

  CLIPasso can adapt to input images of any semantic category, and is no longer limited to several categories inherent in the data set; and can achieve different degrees of abstraction of objects while maintaining semantic and structural consistency with the original image.

4.2 DALL-E2 (put it in another article, follow-up)

Five, CLIP video understanding (omitted)

  This part includes CLIP4clip and ActionCLIP, which are explained in "CLIP Improvement Work Series (Part 2)" . I am not interested in this part for the time being, so I won't write it.

6. Other directions

insert image description here

Guess you like

Origin blog.csdn.net/qq_56591814/article/details/127421979