Target detection label assignment: A Dual Weighting Label Assignment Scheme

In object detection, to be compatible with NMS, a good detector should be able to predict boxes with high classification scores and precise locations. However, if all training samples are treated equally, there will be a misalignment between the two heads: the position with the highest category score is usually not the best position to regress object boundaries. This misalignment degrades the performance of the detector, especially at high IoU metrics. Soft label assignment is to process training samples in a soft label way through weighted loss, trying to enhance the consistency between cls and reg headers.

The paper introduced today starts with the loss weight, proposes DW label assignment, and constructs Wneg and Wpos in a more fine-grained manner, thereby helping the network to better improve the consistency between cls and iou.

Paper address , github
insert image description here

Motivation and Framework

The idea of ​​this paper is relatively simple. The author believes that the anchor with high cls should also have high iou, and the anchor with low iou should have its cls lower accordingly. In this way, the height of cls and iou should be kept consistent, and the output box after NMS can have Optimal IOUs. The previous papers all emphasized the weight of the positive sample points in the loss. Wpos is respectively weighted in iou and cls to affect the score, but the positive sample points have low iou. Therefore, the author believes that more fine-grained Design Wneg to reduce the cls score of these positive sample points.

For this reason, the author designed Positive Weighting Function , Negative Weighting Function , Box Refinement , and the method of positive sample selection.

Proposed Method

1. Positive Weighting Function
High cls score and high IoU are sufficient and necessary conditions for the pos prediction box. This means that anchors satisfying both conditions are more likely to be defined as pos prediction boxes during testing, so they should have higher value during training. From this point of view, Wpos should be positively correlated with the ranking score of IoU and cls, the author first defines the consistency measure, the formula is as follows: where S represents the cls
insert image description here
score, IOU represents the IOU of the predicted reg and GT, β \betaβ is used to balance the two, encouraging Wpos differences between different anchors.
insert image description here
where µ is a hyperparameter controlling the relative gap between different pos weights.

p_loc = torch.exp(-reg_loss*5)
p_cls = (cls_score * objectness)[:, gt_labels] 
p_pos = p_cls * p_loc

p_pos_weight = (torch.exp(5*p_pos) * p_pos * center_prior_weights) / \
				(torch.exp(3*p_pos) * p_pos * \
				center_prior_weights).sum(0, keepdim=True).clamp(min=EPS)

The code is as above, reg_loss is the loss of iou, reg_loss=1-iou, p_loc is an index with e as the base, when reg_loss tends to 0 (that is, iou tends to 1), p_loc tends to 1, indicating that the anchor returns well, β \betaβ is set to 5. p_pos is t in the paper, p_pos_weight is Wpos, center_prior_weights represent positive sample points, Gaussian kernels are set for each category in the code, and the gt of the instance is brought in, each anchor can get a score, the closer the score is to the center of GT The higher it is, the Gaussian decays, but the Gaussian kernel is constant and cannot be learned. Center_prior_weights will have a score in the anchor inside GT. Unlike fcos, center_prior_weights is soft. In addition, the code also performs a normalization operation on Wpos, (torch.exp(3*p_pos) * p_pos * center_prior_weights).sum(0, keepdim=True), and sums each gt (the reason is unknown, maybe Appropriately increasing Wpos is better).

2. The author of Negative Weighting Function
believes that although Wpos can force anchors to have high cls scores and large IoUs, Wpos cannot distinguish the degree of inconsistency of anchors. In order to provide more discriminative supervision information, the author proposes Probability of being a Negative Sample and Importance Conditioned on being a Negative Sample to give Wneg more fine-grained.

  • Probability of being a Negative Sample
    A box with a higher cls score may become a false detection due to IOU. coco uses an IOU interval of 0.5~0.95 to estimate AP, so Pneg should meet the following rules.
    insert image description here
    Any monotonically decreasing function is applicable to Pneg. For simplicity, the author uses the following formula for Pneg's func.
    insert image description here
    The formula passes through two points (0.5,1) and (0.95,0), that is, when IOU<0.5, Pneg=1, IOU>0.95, Pneg=0. As shown in the figure below.
    insert image description here

  • Importance Conditioned on being a Negative Sample
    Neg prediction boxes with higher ranking scores are more important than neg prediction boxes with lower ranking scores, because they are difficult samples for network optimization. Therefore, the importance of the neg samples represented by Ineg should be a function of the cls score, and the Ineg formula is as follows
    insert image description here
    where γ 2 \gamma^2c2 indicates how many priority factors should be based on. The final Wneg is shown in the following formula. The
    insert image description here
    specific code of Wneg=Ineg * Pneg is as follows. The formula represented by t is the second line of the above formula, where k and b are 1.33 respectively, andγ 2 = 2 \gamma2=2c 2=2. It is necessary to normalize each GT, calculate x_ from all anchors and the iou of the GT through t, find the minimum and maximum values ​​of x_ t1, t2, and normalize x_ to y, the purpose is to increase the distance between x_ The interval between is more evenly distributed between 0-1. Finally, multiply bys γ 2 s^{\gamma2}sc 2 .

alpha = 2
t = lambda x: 1/(0.5**alpha-1)*x**alpha - 1/(0.5**alpha-1)
  if num_gts > 0:
      def normalize(x): 
          x_ = t(x)
          t1 = x_.min()
          t2 = min(1., x_.max())
          y = (x_ - t1 + EPS ) / (t2 - t1 + EPS )
          y[x<0.5] = 1
          return y
      for instance_idx in range(num_gts):
          idxs = inside_gt_bbox_mask[:, instance_idx]
          if idxs.any():
              neg_metrics[idxs, instance_idx] = normalize(ious[idxs, instance_idx])
      foreground_idxs = torch.nonzero(neg_metrics != -1, as_tuple=True)
      p_neg_weight[foreground_idxs[0],
                   gt_labels[foreground_idxs[1]]] = neg_metrics[foreground_idxs]
p_neg_weight = p_neg_weight.detach()
neg_avg_factor = (1 - p_neg_weight).sum()
p_neg_weight = p_neg_weight * joint_conf ** 2
neg_loss = p_neg_weight * F.binary_cross_entropy(joint_conf, torch.zeros_like(joint_conf), reduction='none')
neg_loss = neg_loss.sum() 

3. Box Refinement
In order to better return the box coordinates, the author designed a learnable prediction module to generate offset points for each edge, and use the offset points to refine the coordinates. In the code, reg_feat is a regression feature. Use reg_offset convolution to generate reg_offset features with a size of (b, h, w, 8), form 4 offset points with bbox_pred_d, and send decoded_bbox_preds and reg_offset to deform_sampling. deform_sampling is a deformable convolution, but the weight of the deformable convolution is a constant, so it only acts as a difference here.

def deform_sampling(self, feat, offset):
        ## 这里其实只是一个双线差值操作,没有参数可以优化,为了使feat沿着offset方向偏移。
        ##参数x为输入:形状为(N,Cin​,Hin​,Win​);offset为可变形卷积的输入坐标偏移:形状为 (N,2∗Hf​∗Wf​,Hout​,Wout​);weight为卷积核参数:形状为(Cout​,Cin​,Hf​,Wf​)
        b, c, h, w = feat.shape
        weight = feat.new_ones(c, 1, 1, 1)
        y = deform_conv2d(feat, offset, weight, 1, 0, 1, c, c)
        return y
        
if self.with_reg_refine:
            reg_dist = bbox_pred.permute(0, 2, 3, 1).reshape(-1, 4)
            points = self.prior_generator.single_level_grid_priors((h,w), self.strides.index(stride), dtype=x.dtype, device=x.device)
            points = points.repeat(b, 1) 
            decoded_bbox_preds = distance2bbox(points, reg_dist).reshape(b, h, w, 4).permute(0, 3, 1, 2)
            reg_offset = self.reg_offset(reg_feat)
            bbox_pred_d  = bbox_pred / stride 
            reg_offset = torch.stack([reg_offset[:,0], reg_offset[:,1] - bbox_pred_d[:, 0],\
                                        reg_offset[:,2] - bbox_pred_d[:, 1], reg_offset[:,3],
                                        reg_offset[:,4], reg_offset[:,5] + bbox_pred_d[:, 2],
                                        reg_offset[:,6] + bbox_pred_d[:, 3], reg_offset[:,7],], 1)
            bbox_pred = self.deform_sampling(decoded_bbox_preds.contiguous(), reg_offset.contiguous()) 
            bbox_pred = F.relu(bbox2distance(points, bbox_pred.permute(0, 2, 3, 1).reshape(-1, 4)).reshape(b, h, w, 4).permute(0, 3, 1, 2).contiguous())

Ablation study

insert image description here

Guess you like

Origin blog.csdn.net/litt1e/article/details/129860735