RetinaTrack: Online Single Stage Joint Detection and Tracking

Main solution: IDSwitch problem due to occlusion

CenterTrack、FairMOT、JDE、Tracktor++

https://arxiv.org/abs/2003.13870

RetinaNet predicts the classification and regression information under k anchors on the branch of classification and regression respectively.

RetinaTrack, like JDE and FairMOT, has added a 256-dimensional feature information embeddings branch. Each FPN feature map Fi is divided into k branches, k is the number of anchors, each anchor has an independent branch to manage classification and regression

All convolutional parameters from k anchors to classification and regression sub-networks are shared, without explicitly extracting per-instance features.

The centers of the two cars in the picture overlap. If the detection frames of both are predicted based on the same anchor point, it will be difficult to obtain resolving embeddings.

 

During the tracking process, due to occlusion, multiple objects correspond to the same anchor, and this problem occurs.

Our solution is to force the split among the anchors to occur earlier among the post-FPN prediction layers, allowing us to access intermediate level features that can still be uniquely associated with an anchor (and consequently a final detection)

 

In order to capture instance-level features, retinatrack establishes branches in FPN in advance to extract more distinguishing features. 

training details

     Sigmoid Focal Loss for classification, and Huber Loss for box regression   and embedding loss。triplet loss  us-ing the BatchHard strategy for sampling triplets。

 

        Target assignment rules in detection loss: 1 If the iou>=0.5 threshold of an anchor and gt box, the anchor will be assigned a gt box. 2 For each gt box, match the nearest anchor (with respect to IOU, meaning the anchor that forms the largest IOU with the current gt box), even if the IOU is lower than the threshold. The purpose is to ensure that each gt box has an anchor to match it. In rule 1, it is possible that the iou is lower than the threshold.

        Assigning track identities to anchors in triplet losses, using a stricter strategy iou>=0.7, can improve the tracking results. Only anchors that match to track identities are used to produce triplets. Furthermore, triplets are always produced from within the same clip.

We train on Google TPUs (v3) [30] using Momentum SGD with weight decay 0.0004 and momentum 0.9。images are resized to 1024 × 1024 resolution, and in order to fit this
resolution in TPU memory, we use mixed precision training with bfloat16 type in all our training runs。

Except for the embbeding model, other parts are pre-trained on the coco dataset. The first 1000 steps use a linear learning rate warmup to increase the learning rate to a base learning rate of 0.001, and then use a cosine annealing learning rate to train 9k steps. Random horizontal flipping and random cropping data augmentation methods. all batch norm layers to update independently during training

Guess you like

Origin blog.csdn.net/xihuanniNI/article/details/124686554