Segment all images with natural language [lang-segment-anything]

In recent years, computer vision has made remarkable progress, especially in image segmentation and object detection tasks. A recent major breakthrough is the Segment Anything Model (SAM), a versatile deep learning model designed to efficiently predict object masks from images and input cues. By utilizing powerful encoders and decoders, SAM is able to handle a wide range of segmentation tasks, making it an invaluable tool for researchers and developers alike.

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes

1. Introduction to SAM

SAM uses an image encoder, usually a visual transformer (ViT), to extract image embeddings that serve as the basis for mask prediction. The model also contains a hint encoder that encodes various types of input hints, such as point coordinates, bounding boxes, and low-resolution mask inputs. These encoded cues are then fed into a mask decoder along with image embeddings to generate the final object mask.
insert image description here

The above architecture allows fast and lightweight hinting on already encoded images.

SAM is designed to handle a variety of prompts, including:

  • Mask: A coarse, low-resolution binary mask can be provided as initial input to guide the model.
  • Point: The user can enter [x, y] coordinates and their type (foreground or background) to help define object boundaries.
  • Box: A bounding box can be specified using coordinates [x1, y1, x2, y2] to tell the position and size of the model object.
  • Text: Text hints can also be used to provide additional context or to designate objects of interest.

insert image description here

Digging deeper into SAM's architecture, we can explore its key components:

  • Image encoder: The default image encoder of SAM is ViT-H, but ViT-L or ViT-B can also be used according to specific requirements.
  • Downsampling: To reduce the resolution of the hinted binary mask, a series of convolutional layers are employed.
  • Hint Encoder: Positional embeddings are used to encode various input cues, which help inform the model of the location and context of objects in images.
  • Mask Decoder: The modified transformer encoder is used as a mask decoder to convert encoded cues and image embeddings into final object masks.
  • Valid masks: For any given prompt, SAM generates the three most relevant masks, giving the user a range of options from which to choose.

They train the model using a weighted combination of focal, dice, and IoU losses. The weights are 20, 1, 1, respectively.

The strength of SAM lies in its adaptability and flexibility, as it can use different hint types to generate accurate segmentation masks. Much like underlying language models (LLMs), which are a powerful foundation for a variety of natural language processing applications, SAMs also provide a solid foundation for computer vision tasks. The model's architecture is designed to facilitate easy fine-tuning of downstream tasks, enabling it to be tailored to specific use cases or domains. By fine-tuning a SAM for task-specific data, developers can enhance its performance and ensure it meets the unique requirements of their application.

This fine-tuning capability not only enables SAM to achieve impressive performance in various scenarios, but also facilitates a more efficient development process. With a pretrained model as a starting point, developers can focus on optimizing the model for their specific task, rather than starting from scratch. This approach not only saves time and resources, but also leverages the extensive knowledge encoded in the pre-trained model, making the system more robust and accurate.

2. Natural language prompts

The integration of text cues with SAM enables the model to perform highly specific and context-aware object segmentation. By leveraging natural language cues, SAM can be guided to segment objects of interest based on their semantic properties, attributes, or relationships with other objects in the scene.

In training the SAM, the largest publicly available CLIP model (ViT-L/14@336px) is used to compute text and image embeddings. These embeddings are normalized before being used in the training process.

To generate training hints, the bounding box around each mask is first expanded by a random factor ranging from 1x to 2x. The expanded box is then cropped to a square to maintain its aspect ratio and resized to 336×336 pixels. Pixels outside the mask are zeroed out with a probability of 50% before feeding the crop to the CLIP image encoder. Masked attention is used in the last layer of the encoder to ensure that the embeddings are focused on objects, limiting the attention of output tokens to image locations within the mask. Output token embeddings as final hints. During training, CLIP-based cues are provided first, followed by iterative point cues to improve predictions.

For inference, the unmodified CLIP text encoder is used to create cues for SAM. The model relies on the alignment of text and image embeddings enabled by CLIP, which enables training without explicit text supervision while still using text-based cues for inference. This approach allows SAM to effectively exploit natural language cues to achieve accurate and context-aware segmentation results.

Unfortunately, Meta has not released weights for SAM with text encoders.

3、lang-segment-anything

The lang-segment-anything library combines the strengths of GroundingDino and SAM to provide an innovative approach to object detection and segmentation.

Initially, GroundingDino performs zero-shot text-to-bounding-box object detection, efficiently identifying objects of interest in images based on natural language descriptions. These bounding boxes are then used as input hints for the SAM model, which generates accurate segmentation masks for the identified objects.

from  PIL  import  Image
from lang_sam import LangSAM
from lang_sam.utils import draw_image

model = LangSAM()
image_pil = Image.open('./assets/car.jpeg').convert("RGB")
text_prompt = 'car, wheel'
masks, boxes, labels, logits = model.predict(image_pil, text_prompt)
image = draw_image(image_pil, masks, boxes, labels)

insert image description here

4. Lightening application

You can quickly deploy apps using the Lightning AI App Framework. We'll use the ServeGradio component to deploy the model with the UI. You can learn more about ServeGradio here.

import os

import gradio as gr
import lightning as L
import numpy as np
from lightning.app.components.serve import ServeGradio
from PIL import Image

from lang_sam import LangSAM
from lang_sam import SAM_MODELS
from lang_sam.utils import draw_image
from lang_sam.utils import load_image

class LitGradio(ServeGradio):

    inputs = [
        gr.Dropdown(choices=list(SAM_MODELS.keys()), label="SAM model", value="vit_h"),
        gr.Slider(0, 1, value=0.3, label="Box threshold"),
        gr.Slider(0, 1, value=0.25, label="Text threshold"),
        gr.Image(type="filepath", label='Image'),
        gr.Textbox(lines=1, label="Text Prompt"),
    ]
    outputs = [gr.outputs.Image(type="pil", label="Output Image")]

    def __init__(self, sam_type="vit_h"):
        super().__init__()
        self.ready = False
        self.sam_type = sam_type

    def predict(self, sam_type, box_threshold, text_threshold, image_path, text_prompt):
        print("Predicting... ", sam_type, box_threshold, text_threshold, image_path, text_prompt)
        if sam_type != self.model.sam_type:
            self.model.build_sam(sam_type)
        image_pil = load_image(image_path)
        masks, boxes, phrases, logits = self.model.predict(image_pil, text_prompt, box_threshold, text_threshold)
        labels = [f"{phrase} {logit:.2f}" for phrase, logit in zip(phrases, logits)]
        image_array = np.asarray(image_pil)
        image = draw_image(image_array, masks, boxes, labels)
        image = Image.fromarray(np.uint8(image)).convert("RGB")
        return image

    def build_model(self, sam_type="vit_h"):
        model = LangSAM(sam_type)
        self.ready = True
        return model

app = L.LightningApp(LitGradio())

That's it, the application is launched in the browser!

insert image description here

5 Conclusion

That concludes our introduction to the Segment Anything Model. It is clear that SAM is an invaluable tool for computer vision researchers and developers, capable of handling various segmentation tasks and adapting to different types of hints. Its architecture is easy to implement, making it general enough to be tailored for specific use cases and domains. Overall, SAM has quickly become a great asset to the machine learning community and will certainly continue to make waves in the field.


Original link: Segmenting images with natural language - BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/131017358