Paper Reading_Segment_Anything

Paper information

name_en: Segment Anything
name_ch: Segment Anything
paper_addr: http://arxiv.org/abs/2304.02643
doi: 10.48550/arXiv.2304.02643
date_read: 2023-04-07
date_publish: 2023-04-05
tags: ['Deep Learning ','multimodal']
author: Alexander Kirillov, Meta AI Research, FAIR
citation: No
demo yet: https://segment-anything.com

after reading

The paper proposes the Segment Anything (SA) model, which can perform image segmentation (cutout) through text prompts without fine-tuning . SA is based on the application of the Transformer model to the field of image processing ViT备2_论文阅读_ViT , the unsupervised learning of images MAE备2_paper reading_MAE , and the CLIP of text and image mutual mapping paper reading_image generation text_CLIP , It can be said that it is a classic example of the implementation of large models in the image field. In the previous image segmentation model, for example, to identify the cat in the picture, you first need to make some annotation data, use tools to mark the cat in the picture, and then use these annotation data to fine-tune on the basis of the pretrain model. The SA paper solves two problems: linking the text description and the image in the picture; solving the zero-shot problem without fine-tune. In addition, one of the highlights of this article is that billions of pictures are annotated in an interactive first and then automatic way, realizing the self-improvement of the annotation function.


Summary

Segment Anything (SA) is to segment everything. The result of the thesis is the final release of the model SAM, which can segment any object in the picture without fine-tune, and can segment images through text prompts. The effect is comparable to supervised learning. The paper also released more than 1B pictures and 11M mask-labeled data set SA-1B.

introduce

Tip learning helps large language models improve the ability to handle zero-shot problems; CLIP and ALIGN models also provide text and image alignment methods for downstream tasks, such as DALL-E's image generation. This paper focuses on image segmentation: matting via text cues.

Specifically, the model is built through three interrelated components: task, model, and data.

Task

Hint engineering has had a huge impact in natural language and vision modeling in recent years, and in this paper, hintable image segmentation is proposed. As shown in Figure 1-(a), the desired area is segmented by providing pictures and various prompts. Prompts can contain: descriptive text, points in space (stars), regions (squares), etc. In cases where the cue is ambiguous, there may be multiple objects (e.g., clothes and people wearing them), at least one of which can be reasonably segmented.

In the pre-training stage, tasks that may be similar to the specific usage method are constructed to train the model, and an image segmenter with generalization ability is generated to solve the zero-shot problem. In the later stage, components in a larger system can be built to perform new and different tasks by combining prompts and downstream tasks.

Model

The design model structure SAM needs support: flexible prompts, real-time calculations, and ambiguity recognition.
The specific implementation is shown in Figure-1(b), an image encoder generates image embeddings, an instruction encoder generates hint embeddings, and then a lightweight mask decoder is used to combine the two for segmentation tasks.

image encoder

The ViT-based image encoder only generates an image embedding when the image is input. After the embedding is generated, it can be combined with multiple prompts to save computing power. It only takes 50ms each time to meet the needs of web interaction. Aiming at the problem of ambiguity, a scheme to prompt multiple masks is designed.

command encoder

Consider two sets of cues: sparse (points, boxes, text) and dense (mask). Sparse cues associate learned embeddings for each cue type with embeddings from CLIP (text-to-image mapping) via positional encoding. Dense hints (masks) use convolutional embeddings that can be summed element-wise with image embeddings.

decoder

The mask decoder maps image embeddings, hint embeddings and output tokens to masks. The model modifies the Transformer decoder block followed by a dynamic mask prediction header. All embeddings are updated using hinted self-attention and cross-attention; image embeddings are then up-sampled; MLP maps output tokens to a dynamic linear classifier and computes the probability that the mask at each image location is foreground.

ambiguity problem

If the hint given is ambiguous, the model will generate multiple valid masks. Therefore, the model is modified to predict multiple output masks for a single cue, and 3 mask outputs are found to be sufficient for most common cases (nested masks are usually up to three depths: whole, part, and subpart).

data engine

Large models require a large number of pictures and mask training with different distributions, and the existing data sets are not rich.
In this paper, it is proposed to establish a data engine, the model labels the data, and the data in turn trains the model, and the cycle repeats. Specifically, it includes three stages:

  • Assisted manual: SAM assists annotators to annotate masks.
    Annotators were asked to label objects in salient order and encouraged to process the next image after labeling for more than 30 s. Using a common public segmentation dataset for training, and then starting interactive labeling, the model was retrained 6 times in total, the average annotation time per mask was reduced from 34 seconds to 14 seconds, and the average number of masks per image was reduced from 20 increased to 44.
  • Semi-automatic: SAM automatically generates the mask, and the annotator focuses on annotating the remaining objects to improve the diversity of the mask.
    Retrain the model 5 times on the newly collected data. Labeling of objects is more challenging, with the average annotation time back to 34 seconds and the average number of masks per image increased from 44 to 72 masks, including automatic masks.
  • Fully automatic: SAM automatically annotates, generating an average of about 100 high-quality masks for each image.
    At this stage, an ambiguity-aware model is developed that predicts valid masks even in ambiguous situations. 99.1% of the final generated dataset comes from fully automatic annotation.

Finally, the SA-1B data set is generated, with more than 1 billion data sets with masks. All pictures are automatically marked by SAM, with an average of 100 masks per image.

model effect

Create yourself and go to the meta site to try it out, you can use it without a ladder.
https://segment-anything.com/demo
I uploaded a picture and tried it myself. I separated the hair from the face, the two hands can be separated, the flesh-colored clothes and skin, and the edges are relatively perfect. After the mask, the effect is very similar to the animation effect. I don’t know what retouchers and illustrators think, will the children learn illustration and sketching again? Shouldn't we first study whether the Go class will be affected after AlphaGo comes out?

local build environment

The source code is based on Pytorch. From the perspective of predictor_example, the interface is very simple. Anyone who has done some image modeling can understand it. The mask area is returned directly. I did not find the image-text alignment part that calls CLIP, but only tried the cutting part.

download source code

git clone https://github.com/facebookresearch/segment-anything.git
按README.txt中提示安装即可

Run based on docker

docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

After entering docker, install jupyter

pip install jupyter_nbextensions_configurator jupyter_contrib_nbextensions
jupyter notebook --allow-root -y --no-browser --ip=0.0.0.0

My environment also has the following tools installed

apt-get update
apt-get install build-essential libgl1-mesa-glx libglib2.0-0
pip install matplotlib torchvision pycocotools onnx black isort opencv-python

Test the model with different parameter sizes:

ViT-B(base), ViT-L(Large), ViT-H(Huge)。

By default, ViT-H is used, the download is about 2.4G, and the GPU memory is fully used up to 11G.
The effect of wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
is as follows:

Download ViT-B, the download is about 358M, and the GPU memory uses about 8G. After comparing
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
, we can see that the mask effect of the large model is obviously better:

It's not a particularly large model. If you have a GPU, the speed at home is acceptable. Since then, you have your own little helper for cutouts.

Guess you like

Origin blog.csdn.net/xieyan0811/article/details/130043245