【Paper Notes】FASTER SEGMENT ANYTHING:TOWARDS LIGHTWEIGHT SAM FOR MOBILE APPLICATIONS

The front-foot fast SAM has just been released, and the back-foot mobile SAM has been released. In the previous paper notes , I always thought that fast SAM should actually be regarded as an extension of yolo. It is far from the native SAM architecture, and the introduction is straightforward ( gong) compared to (ji) FastSAM, let’s take a look at this mobile SAM today.

 

1 Introduction

1.1 Motivation

The reason why the SAM pipeline is computationally heavy is the huge image encoder. In this work, lightweight SAM for resource-constrained mobile devices is studied.

1.2 Challenges & Solutions

  • Challenge: The optimization difficulty of SAM retraining mainly comes from the coupling optimization of the image encoder and mask decoder.
  • Solution: Propose to decouple the optimization of image encoder and mask decoder

First extract the knowledge from the default image encoder ViT-H to a tiny ViT.

Afterwards, we can fine-tune the mask decoder in the original SAM (optional) to better align with the extracted image encoder.

2.mobile SAM

2.1 Ease coupled distillation

To alleviate the optimization problem of coupled distillation:

(1) Semi-coupled distillation: copied and frozen mask decoders optimize the image encoder (as shown on the right)

The selection of encoder hints is random, which makes the mask decoder variable and thus increases the difficulty of optimization.

(2) Decoupled distillation: Distill a small image encoder directly from ViT-H in the original SAM (as shown on the left)

A simple MSE loss can be used, and there is no need to use focal loss and dice loss for mask prediction like in the original SAM paper.

 The decoupled distillation effect indeed not only reduces computing resources but also improves performance than semi-coupled distillation.

 2.2 Mask decoder fine-tuning

The image encoding generated from the student image encoder can be close enough to that of the original teacher encoder, making fine-tuning of the combined decoder in the second stage optional.

2.3 Comparison with FastSAM

3.Code

3.1 Model code

Because mobile SAM is based on SAM and replaces the encoder with lightweight TinyViT, the basic model architecture has not changed much.

 

3.2 Training code

Waiting for the training code....

Guess you like

Origin blog.csdn.net/weixin_50862344/article/details/131455431