[Project Learning] Record the use of segment-anything, SAM and derived automatic labeling tools

This article consists of three parts:
1. Segment Anything Model (SAM) overview: It is a record made by me to learn the concept, you can skip it without reading it .
2. Use of SAM-derived annotation tools: Tried two open source SAM-derived annotation tools to record
3. Problems encountered

References:
1. Segment-anything official demo demonstration
2. Introduction to SA basic model image segmentation
3. Segment-anything project

1. Overview of Segment Anything Model (SAM)

Segment Anything Model(SAM)——致力于图像分割的第一个基础模型。

分割——识别哪些图像像素属于一个对象——是计算机视觉的核心任务之一。

The Segment Anything project is a new task, dataset and model for image segmentation proposed by mata AI. Released the general Segment Anything Model (SAM) and the Segment Anything 1-Billion Mask dataset (SA-1B), the largest segmentation dataset ever.

At the heart of the Segment Anything project is reducing the need for task-specific modeling expertise, training computations, and custom data annotations for image segmentation. The goal is to build a base model for image segmentation: a hintable model that is trained on different data and can be adapted to a specific task, similar to the way hints are used in natural language processing models. However, unlike the abundance of images, videos, and text on the Internet, the segmentation data needed to train such models is not readily available online or elsewhere. So, with Segment Anything, simultaneously develop a general, promptable segmentation model and use it to create a segmentation dataset of unprecedented scale.

SAM already has a general idea of ​​what an object is, and it can generate masks for any object in any image or any video, even objects and image types it didn't encounter during training. SAMs are general enough to cover a wide range of use cases, and can be used out-of-the-box for new image “domains”—whether underwater photographs or cellular microscopy—without additional training (this capability is often referred to as zero-sample transfer ).

In the future, SAM can be used to help applications in numerous domains that need to find and segment any object in any image. For the AI ​​research community and others, SAMs can be an integral part of larger AI systems for more general multimodal understanding of the world , for example, understanding the visual and textual content of web pages. In the field of AR/VR, SAM can select objects according to the user's line of sight, and then "lift" them into 3D. For content creators, SAM can improve creative applications, such as extracting image regions for collage or video editing. SAM can also be used to aid in the scientific study of natural events on Earth or even in space, for example by locating animals or objects to study and track in video. The possibilities are wide,

Versatility

Previously, to solve any kind of segmentation problem, there were two categories of approaches. The first is interactive segmentation, which allows segmentation of objects of any class, but requires a human to guide the method by iteratively refining the mask. The second, automatic segmentation, allows segmentation of specific object categories defined in advance (e.g., cats or chairs), but requires a large number of manually annotated objects to train (e.g., thousands or even tens of thousands of examples for segmenting cats), along with computational resources Train segmentation models with technical expertise. Neither of these approaches provides a general, fully automated approach to segmentation.

SAM is a generalization of these two types of methods. It is a single model that can easily perform interactive and automatic segmentation . The promptable interface of the model allows to use it in a flexible way, and a wide range of segmentation tasks can be accomplished by simply designing the correct prompts (clicks, boxes, text, etc.) for the model. In addition, the SAM is trained on a diverse, high-quality dataset containing more than 1 billion masks (collected as part of this project), which enables it to generalize to new types of objects and images beyond what it has achieved during training. what was observed. This ability to generalize means that practitioners will no longer need to collect their own segmented data and fine-tune models for their use cases .

These features enable SAMs to generalize to new tasks and domains . This flexibility is the first of its kind in the field of image segmentation.

SAM function description

(1) Allow the user to split objects by clicking or by interactively clicking points to include and exclude objects. The model can also be hinted with a bounding box.
(2) Multiple effective masks can be output when facing the ambiguity of the segmented object, which is an important and necessary ability to solve segmentation problems in the real world.
(3) All objects in the image can be automatically discovered and masked.
(4) Segmentation masks can be generated for any cue in real-time after precomputed image embeddings, allowing real-time interaction with the model.

introduce

The underlying model is a promising development that can perform zero-shot and few-shot learning on new datasets and tasks by using "hinting" techniques.

Based on the successful implementation of the basic model in the NLP field, the exploration of the basic model has also begun in the field of computer vision. For example, CLIP and ALIGN use contrastive training to align image and text encoders of two modalities. The designed text cues enable zero-probability generalization to novel visual concepts and data distributions. Such encoders can also be efficiently combined with other modules for downstream tasks such as image generation. While great progress has been made in vision and language encoders, computer vision encompasses a wide range of problems beyond that, and for many of them, there is no abundant training data.

**Task:** The goal is to build a base model for image segmentation, seeking to develop a cue model and pre-train it on a wide range of datasets using a task that enables strong generalization. Armed with this model, it is possible to use just-in-time engineering to solve a series of downstream segmentation problems on new data distributions.

A hint segmentation task is proposed , where the goal is to return an efficient segmentation mask given any segmentation hint.

insert image description here

Prompts simply specify what is to be segmented in the image , for example, prompts may include spatial or textual information to identify objects. The requirement for a valid output mask means that even if the hint is ambiguous and might refer to multiple objects (e.g. a dot on a shirt might represent either the shirt or the person wearing it), the output should be of at least one of those objects Reasonable mask. Use the hint segmentation task as a pre-training target, and solve general downstream segmentation tasks through hint engineering.
insert image description here

**Model:** Hintable segmentation tasks and realistic goals impose constraints on the model architecture. Models must support flexible hinting, masks need to be computed in amortized real-time to allow interactive use, and must be ambiguity aware.

A simple design satisfies all three constraints: a robust image encoder computes image embeddings, a hint encoder embeds hints, and then combines the two sources of information in a lightweight masked decoder that Predict segmentation mask. We call this model the Segment Anything model (SAM for short). By splitting the SAM into an image encoder and a fast hint encoder/mask decoder, it is possible to reuse the same image embedding (and amortize its cost) with different hints.

Given an image embedding, the hint encoder and mask decoder predict the mask from the hint in a web browser in ~50ms. Focus on point, box, and mask hints, and use free-form text hints to render initial results. To enable SAM to perceive ambiguity, it is designed to predict multiple masks for a single cue , allowing SAM to deal with ambiguity naturally, such as the shirt vs. person example.

**Data Engine:** There are three stages: assisted manual, semi-automatic and fully automatic. In the first stage, SAM helps annotators to annotate masks, similar to the classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by hinting at possible object locations, while the annotator focuses on annotating the remaining objects, thereby helping to increase the diversity of masks. In the final stage, the SAM is prompted with a regular grid of foreground points, yielding on average about 100 high-quality masks per image.

**Experimental:** Extensive evaluation of SAM. First, using 23 different new segmentation datasets, we find that SAMs produce high-quality masks from individual foreground points, often only slightly below the manually annotated ground truth. Second, consistent strong quantitative and qualitative results are found on a variety of downstream tasks using just-in-time engineering under the zero-shot transfer protocol, including edge detection, object proposal generation, instance segmentation, and preliminary explorations of text-to-mask prediction. These results demonstrate that SAMs can use out-of-the-box fast engineering to solve a variety of tasks involving object and image distributions beyond the SAM training data.

Task

The hint segmentation task is to return a valid segmentation mask given any hint.

Prompt Segmentation Tasks The goal is to make a versatile model that can be adapted to many (though not all) existing and new segmentation tasks through rapid engineering. In these works, a model trained with hint segmentation can be used as a component in a larger system to perform a new and different task at inference time,

Hints and compositions are powerful tools that allow a single model to be used in a scalable manner, potentially accomplishing tasks not known at the time the model was designed. It is expected that composable system designs, powered by techniques such as hint engineering, will enable broader applications than systems trained exclusively for a fixed set of tasks.
insert image description here

The heavyweight image encoder outputs image embeddings, which can then be efficiently queried by various input cues to generate object masks at amortized real-time speed. For ambiguous cues corresponding to multiple objects, SAM can output multiple effective masks and associated confidence scores.

An image encoder, a flexible hint encoder and a fast mask decoder.

Image Encoder : Motivated by scalability and powerful pre-training methods, we use a MAE pre-trained VisionTransformer (ViT), minimally adapted to handle high-resolution inputs. The image encoder runs once per image and can be applied before hinting the model.

Hint Encoder : Consider two sets of hints: sparse (points, boxes, text) and dense (masks). Points and boxes are represented by positional encoding and summed with learned embeddings for each cue type, and free-form text is represented using an off-the-shelf text encoder from CLIP. Dense hints (i.e. masks) use convolutional embeddings and are element-wise summed with image embeddings.

Mask Decoder : A mask decoder efficiently maps image embeddings, hint embeddings, and output tokens to masks. A modified Transformer decoder block is used, followed by a dynamic mask prediction header. The improved decoder block updates all embeddings using hint self-attention and cross-attention (hint to image embedding and vice versa) in both directions. After running both blocks, the image embeddings are up-sampled, and the MLP maps the output labels to a dynamic linear classifier, which then computes the masked foreground probability for each image location.

The overall model design is largely driven by efficiency. Given a precomputed image embedding, the hint encoder and mask decoder run in the web browser, on the CPU, in about 50ms. This runtime performance enables seamless, real-time interactive prompting of models.

The data engine has three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage that mixes automatic prediction masks and model-assisted annotations, and (3) a model-generated mask without annotator input. fully automatic stage.

2. Use of annotation tools derived from SAM

Projects tried:
1. https://github.com/haochenheheda/segment-anything-annotator
2. https://github.com/anuragxel/salt
3. https://github.com/zhouayi/SAM-Tool (It is similar to the salt of 2. I haven't tried it here)

Note: For the two projects here, I will report an error when I open the annotation tool on the server (see 3), and I choose to use them on windows.
Both projects are configured according to the environment on git, and can be operated according to the steps. We will not demonstrate in detail here.

SAA project: https://github.com/haochenheheda/segment-anything-annotator

1. Start the labeling platform

python annnotator.py --app_resolution 1000,1600 --model_type vit_b --keep_input_size True --max_size 720

The interface is as follows:
insert image description here

–model_type: vit_b, vit_l, vit_h
–keep_input_size: True: keep the original image size of SAM; False: resize the input image to –max_size (save GPU memory)

2. The txt file of the category

insert image description here

The classes.txt content is as follows (the label of the data set):

insert image description here
Click to select classes.txt in the interface
insert image description here.

3. Specify the image folder and label storage folderinsert image description here

4. Load the SAM model

insert image description here
After clicking, the terminal will appear:
insert image description here

4. Other functions
After the above operation interface is shown in the figure:
insert image description here

Zoom in/out: Press "CTRL" and mouse wheel to resize Manually
:
insert image description here

Add the mask manually by clicking on the boundary of the object, press the right button and drag to draw the arc.
insert image description here

insert image description here

Click to generate a mask proposal. The left/right mouse buttons represent positive/negative clicks, respectively. See several mask proposals in the box below: You can click or shortcut keys 1, 2, 3, 4.
Operation: Click point prompt, select 1234, and click a .
insert image description here

Select 3 here, then click a to select the label:
insert image description here

insert image description here
Generate a mask proposal with boxes
insert image description here

Accept (shortcut key: a): Accept the selected suggestion and add it to the comment column.
Reject (shortcut r: ): Reject the suggestion and clean up the workspace.

Annotation objects can be modified, class labels or ids can be changed by double-clicking an object item in the annotation bar; boundaries can be modified by dragging a point on the boundary.
insert image description here

Delete (shortcut key: 'd'): Edit Mode, deletes selected/highlighted object from annotation dock

Edit Mode, if the polygon is too dense to edit, you can use this button to reduce the points on the selected polygon. But this slightly reduces annotation quality.

Others:
Zoom in/out: Press "CTRL" and mouse wheel
Class On/Off: If Class is turned on, a dialog box will appear after accepting the mask to record the category and id, otherwise catgeory will be the default value "Object".

After editing, you need to click insert image description here
Save or it will be lost.

After completion, the json file will be generated in savedir:

[
#object1
{
‘label’:,
‘group_id’:,
‘shape_type’:‘polygon’,
‘points’:[[x1,y1],[x2,y2],[x3,y3],…]
},
#object2

]

SALT project https://github.com/anuragxel/salt

1. Turn onnx:
insert image description here

The exported onnx model:
insert image description here

2. Extract the embeddings of all images in the dataset

python helpers/extract_embeddings.py --checkpoint-path models/vit_b.pth --model_type vit_b --dataset-path /home/kunwang/project_2023/sam_kk/segment-anything-main/dataset/ 

Error: Nvidia driver version is too low
insert image description here

So modify the code, device uses cpu
"/home/kunwang/project_2023/sam_kk/segment-anything-main/helpers/extract_embeddings.py"
insert image description here


As you can see, the corresponding embeddings are generated under the path: /home/kunwang/project_2023/sam_kk/segment-anything-main/dataset/embeddings/:
insert image description here

note:

Because an error was encountered on the server (see three).
Therefore, in the windows configuration environment (some dependencies are different from those under linux, you can find the corresponding win64).
For example, the environment.yaml in this project requires conda to install ncurses=6.4. Here, use conda to install win-ncurses in win. On the official website Find the corresponding version. Sometimes conflicts will be reported, and similar versions can be installed.

Enter the salt-main path:
run the command:

python segment_anything_annotator.py --categories "nohat,hat,vest,other"

insert image description here
insert image description here
insert image description here
insert image description here

The generated partial json shows:
insert image description here

In general, there are two steps to use this tool
: 1. Turn the onnx model to generate embeddings for all images to be labeled in the dataset;
2. Open the labeling tool for labeling

Advantages: A label box can be generated.

Disadvantages: The tools in the above two projects coexist, and the effect of intensive target marking is generally time-consuming (caused by sam's own shortcomings).

Three, encounter problems

After project 1 and project 2requiments.txt are installed, try to start the annotation platform:

都会报错:
Got keys from plugin meta data (“xcb”)
QFactoryLoader::QFactoryLoader() checking directory path “/home/kunwang/.conda/envs/sam/bin/platforms” …
loaded library “/home/kunwang/.conda/envs/sam/lib/python3.8/site-packages/cv2/qt/plugins/platforms/libqxcb.so”
QObject::moveToThread: Current thread (0x55c4efa86610) is not the object’s thread (0x55c4f292d870).
Cannot move to target thread (0x55c4efa86610)

qt.qpa.plugin: Could not load the Qt platform plugin “xcb” in “/home/kunwang/.conda/envs/sam/lib/python3.8/site-packages/cv2/qt/plugins” even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: xcb, eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, webgl.

Solution:
1. Try to reinstall opencv-python, lower the version of opencv-python, but still report an error.
insert image description here

And I don't have sudo permission, so this method can't be solved, and finally I chose to start the label under win.

Search related issues are conflicts between PyQt5 and Opencv (to be resolved)

reference:

1. http://www.manongjc.com/detail/63-wjnshoygnkrqpsf.html
2. https://www.cnblogs.com/isLinXu/p/15876688.html
Because the version of OpenCV is too high, it cannot read and write with pyqt5 Incompatible—Cannot solve
3. https://github.com/opencv/opencv-python/issues/46 problem collection

But I have used the above methods but still can't solve [sorrow].

有用的:
4.https://zhuanlan.zhihu.com/p/471661231
5.https://blog.csdn.net/hxxjxw/article/details/115936461
6.https://my.visualstudio.com/Downloads?q=build%20tools

Guess you like

Origin blog.csdn.net/weixin_45392674/article/details/130499738