About DIMP: Understanding of the Pipeline of Learning Discriminative Model Prediction for Tracking

Literature DIMP Click to view
If you only want to see the key points, just look at the last page of red

1. Data flow

The sampling method here is almost exactly the same as that in ATOM, you can refer to ATOM

The data sampling method is in "/ltr/train_settings/bbreg/atom.py", through dataset_train = sampler.ATOMSampler(*args)encapsulation.

dataset_train = sampler.ATOMSampler([lasot_train, got10k_train, trackingnet_train, coco_train],
                                   [1, 1, 1, 1],
                                   # 这里的samples_per_epoch=batch_size x n,如果batch是1,在训练中显示的就是
                                   # [train: num_epoch, x / batch_size * n] FPS: 0.3 (4.5)  ,  Loss/total: 43.69654  ,  Loss/segm: 43.69654  ,  Stats/acc: 0.56702
                                   # 由於batch_size * n構成了一個Epoch中所有的TensorDict形式的數據數量,通過LTRLoader包裝成batch結構後,就剩 "n" 個TensorDict,這裏就是1000個
                                   samples_per_epoch=1000*settings.batch_size,
                                   max_gap=50,
                                   processing=data_processing_train)

By inheriting the method torch.utils.data.dataloader.DataLoaderof the class LTRLoader(), for i, data in enumerate(loader, 1)it is used to . The specific function is to pack the sampled data according to certain rules and output a [batch, n_frames, channels, H, W] format The data

loader_train = LTRLoader('train',
                         dataset_train,
                         #
                         training=True,
                         batch_size=settings.batch_size,
                         # 数据读取线程数
                         num_workers=settings.num_workers,
                         # 在DDP模式下,没有这个参数
                         shuffle=True,
                         # 例如,99个数据,batch=30,最后会剩下9个数据,这时候就把这9个数据扔掉不用
                         drop_last=True,
                         # 在loader中,意思是:按batch_size抽取的数据,在第“1”维度上拼接起来,大概就是num_sequences
                         stack_dim=1)

Encapsulated into the trainer, the role of the trainer is to distribute the data to the network model, and then let the network go through the forward process, calculate the loss, empower the network through the loss function, and update the data.

trainer = LTRTrainer(actor, [loader_train, loader_val], optimizer, settings, lr_scheduler)

The reason why the data loading module is encapsulated [loader_train, loader_val]is because in "/ltr/trainers/ltr_trainer.py", it is necessary to implement validation every n Epochs. The specific implementation method is:

# selfl.loaders就是[loader_train, loader_val]
for loader in self.loaders:
	if self.epoch % loader.epoch_interval == 0:
		# 这里就是利用 for i, data in enumerate(loader, 1)来把数据放入网络
		self.cycle_dataset(loader)

2. Data Sampling Method

For DiMP, there is only one Template and Search num_train_frames=3, num_test_frames=3, for ATOM . The data sampling function is in "/ltr/data/sampler.py" , which is inherited from . It can be understood that the sampling function is actually , and its function is to initialize some parameters for its parent class.num_train_frames=1, num_test_frames=1,
class ATOMSamlperclass TrackingSamplerclass TrackingSamplerclass ATOMSamlper

①Random extraction of data sets

There is a set of lists in the function, self.p_datasets = [4, 3, 2, 1]which means 4 data sets, according to 4 4 + 3 + 2 + 1 \frac{4}{4+3+2+1}4+3+2+14 3 4 + 3 + 2 + 1 \frac{3}{4+3+2+1} 4+3+2+13 2 4 + 3 + 2 + 1 \frac{2}{4+3+2+1} 4+3+2+12 1 4 + 3 + 2 + 1 \frac{1}{4+3+2+1} 4+3+2+11The probability of extracting data from a certain data set.

p_total = sum(p_datasets)
self.p_datasets = [x / p_total for x in p_datasets]
# 这里的self.datasets是在ltr/train_settings/bbreg/atom.py中封装的[lasot_train, got10k_train, trackingnet_train, coco_train]
# 这里的dataset返回的是ltr/dataset下面的各类函数,例如lasot.py中的class Lasot(BaseVideoDataset):
dataset = random.choices(self.datasets, self.p_datasets)[0]

②Random extraction of video sequences in a data set

First, by dataset.get_num_sequences()obtaining how many video sequences there are in the data set
, and then extracting a video sequence from many video sequences (a video sequence contains many frames of pictures)

seq_id = random.randint(0, dataset.get_num_sequences() - 1)

③ Sampling in a certain video sequence, such as the Got-10k data set

❶ interval sampling:

interval

In a video sequence, a picture is randomly selected as base_frame;
within the range of max_gap=50 before and after base_frame (ie ± \pm± 50) Extract a train_frame and a test_frame respectively;
if not, then max_gap + gap_increase, expand the range to extract;

Why within ± \pmCan't draw in such a large range of ± 50?

Because the video frame containing the target needs to be extracted, sometimes the video does not contain a visible target, and the search range needs to be increased at this time

❷ casual sampling:

casual

Take the middle point of the video sequence as the reference frame, that is, the base_frame
is within the range of max_gap=50 before and after the base_frame (ie ± \pm± 50) Extract a train_frame and a test_frame respectively;
if not, then max_gap + gap_increase, expand the range to extract;

❸ default sampling:

There is no reference frame base_frame, directly randomly extract train_frame in the video sequence
within the range of max_gap=50 before and after train_frame (ie ± \pm± 50) Randomly select a test_frame;
train_frame and test_frame may be repeated
If not drawn, then max_gap + gap_increase, expand the range to extract;

④ Sampling in a non-video sequence, such as the COCO dataset

train_frame_ids = [1] * self.num_train_frames
test_frame_ids = [1] * self.num_test_frames
direct extraction

3. Data processing method after sampling

Base class for handler function inheritance

class BaseProcessing:
    """
    处理类用于在传入网络之前,处理数据, 返回一个数据集
    例如,可以用于裁剪目标物体附近的搜索区域、用于不同的数据增强
    """
    def __init__(self, transform=transforms.ToTensor(), train_transform=None, test_transform=None, joint_transform=None):
        """
        参数:
            transform       : 用于图片的一系列变换操作
                              仅当train_transform或者test_transform是None的时候才用
            train_transform : 用于训练图片的一系列变换操作
                              如果为None, 取而代之的是'transform'值
            test_transform  : 用于测试图片的一系列变换操作
                              如果为None, 取而代之的是'transform'值
                              注意看,虽然在train_settings中设置的是transform_val,但是赋值的是transform_test=transform_val
                              所以,test_transform和transform_val是一回事
            joint_transform : 将'jointly'用于训练图片和测试图片的一系列变换操作
                              例如,可以转换测试和训练图片为灰度
        """
        self.transform = {
    
    'train': transform if train_transform is None else train_transform,
                          'test': transform if test_transform is None else test_transform,
                          'joint': joint_transform}

    def __call__(self, data: TensorDict):
        raise NotImplementedError

self.transform['joint']processing

That is, all the pictures are ToTensor first, and there is still a probability of 0.05 to turn the picture into a grayscale image.

transform_joint = tfm.Transform(tfm.ToGrayscale(probability=0.05))
# 这里的self.transform['joint']指向基类中的self.transform
data['train_images'], data['train_anno'] = self.transform['joint'](image=data['train_images'],
																   bbox=data['train_anno'])

self._get_jittered_boxDisturb the bbox

Using _get_jittered_boxthe generated bbox with disturbance, the disturbance is only valid for test_anno, train_anno will not produce disturbance. Disturbance control is achieved through self.scale_jitter_factorand self.center_jitter_factor, where mode is the control flag.

self.scale_jitter_factor = {
    
    'train': 0, 'test': 0.5}
self.center_jitter_factor = {
    
    'train': 0, 'test': 4.5}

The final composition is:

    def _get_jittered_box(self, box, mode):
        """
        抖动一下输入box,box是相对坐标的(cx/sw, cy/sh, log(w), log(h))
        参数:
            box : 输入的bbox
            mode: 字符串'train'或者'test' 指的是训练或者测试数据
        返回值:
            torch.Tensor: jittered box
        """
        # randn(2) 生成两个服从(0,1)的数,范围是【-1,+1】前一个对应w,后一个对应h
        # 对于train,scale_jitter_factor=0,所以 jittered_size=box[2:4]
        jittered_size = box[2:4] * torch.exp(torch.randn(2) * self.scale_jitter_factor[mode])
        # 计算jitter_size后的x * y * w * h然后开方,乘以center_jitter_factor['train' or 'test'],作为最大偏移量
        # 对于train,center_jitter_factor=0,所以 max_offset=0
        max_offset = (jittered_size.prod().sqrt() * torch.tensor(self.center_jitter_factor[mode]).float())
        # 计算中心抖动 [x + w/2 + max_offset * (torch.randn(2)[0] - 0.5),  y + h/2 + max_offset * (torch.randn(2)[1] - 0.5)]
        jittered_center = box[0:2] + 0.5 * box[2:4] + max_offset * (torch.rand(2) - 0.5)
        return torch.cat((jittered_center - 0.5 * jittered_size, jittered_size), dim=0)

Specific effect description:

test_anno=[x, y, w, h] length and width [ w , h ] [w, h][w,h ] Randomly zoom in or out[ 1 e , e ] [\frac{1}{\sqrt{e}}, \sqrt{e}][e 1,e ] times (obey the normal distribution, the probability of the multiple is 1 is the largest, enlargee \sqrt{e}e or shrink 1 e \frac{1}{\sqrt{e}}e 1The lowest probability), get the new length and width [ wjittered , hjittered ] [w_{jittered}, h_{jittered}][wjittered,hjittered]
 
test_anno=[x, y, w, h]center point coordinates [ x + w 2 , y + h 2 ] [x+\frac{w}{2}, y+\frac{h}{2}][x+2w,y+2h]随机偏移 [ − 1 2 w j i t t e r e d × h j i t t e r e d × 4.5 , + 1 2 w j i t t e r e d × h j i t t e r e d × 4.5 ] [-\frac{1}{2}\sqrt{w_{jittered}\times h_{jittered}}\times 4.5, +\frac{1}{2}\sqrt{w_{jittered}\times h_{jittered}}\times 4.5] [21wjittered×hjittered ×4.5,+21wjittered×hjittered ×4.5 ] (obey the normal distribution, the probability of offset is 0 is the highest, the offset is1 2 wjittered × hjittered × 4.5 \frac{1}{2}\sqrt{w_{jittered}\times h_{jittered}} \times 4.521wjittered×hjittered ×4.5 has the lowest probability)
 
and ends up with[ xjittered , yjittered , wjittered , hjittered ] [x_{jittered}, y_{jittered}, w_{jittered}, h_{jittered}][xjittered,yjittered,wjittered,hjittered]

③Cut prutils.jittered_center_cropaccording to the results of the above processing

The input image according to [ xjittered , yjittered , wjittered , hjittered ] [x_{jittered}, y_{jittered}, w_{jittered}, h_{jittered}][xjittered,yjittered,wjittered,hjittered] , real label, search_area_factor and output_size to cut out the required size, and obtain the corresponding coordinates of the Boundingbox in the cropped picture.

④ againtransform

Transform the result of the above process, and encapsulate the parameters in "/ltr/train_settings/bbreg/atom.py".

transform_train = tfm.Transform(tfm.ToTensorAndJitter(0.2),
                                tfm.Normalize(mean=settings.normalize_mean,
                                                  std=settings.normalize_std))

About tfm.ToTensorAndJitter(0.2), is the probability of obeying the normal distribution, let the picture be in [ 0.8 , 1.2 ] [0.8, 1.2][0.8,1.2 ] to adjust the brightness, that is, the probability of no change is the largest,× 0.8 \times 0.8× 0.8 and× 1.2 \times 1.2× 1.2 has the lowest probability.

class ToTensorAndJitter(TransformBase):
    """
       继承了TransformBase,所有下面的transform_image和transform_mask会在TransformBase
       通过transform_func = getattr(self, 'transform_' + var_name),来调用具体用了哪个函数
       """
    def __init__(self, brightness_jitter=0.0, normalize=True):
        super().__init__()
        self.brightness_jitter = brightness_jitter
        self.normalize = normalize
    def roll(self):
        return np.random.uniform(max(0, 1 - self.brightness_jitter), 1 + self.brightness_jitter)
    def transform(self, img, brightness_factor):
        img = torch.from_numpy(img.transpose((2, 0, 1)))
        # 这里的brightness_factor是随机参数,其实就是roll的返回值
        return img.float().mul(brightness_factor / 255.0).clamp(0.0, 1.0)

About , is to normalize the image tfm.Normalize(mean=settings.normalize_mean, std=settings.normalize_std)using the mean settings.normalize_mean = [0.485, 0.456, 0.406]and standard deviationsettings.normalize_std = [0.229, 0.224, 0.225]

class Normalize(TransformBase):
    def __init__(self, mean, std, inplace=False):
        super().__init__()
        # settings.normalize_mean = [0.485, 0.456, 0.406]
        self.mean = mean
        # settings.normalize_std = [0.229, 0.224, 0.225]
        self.std = std
        # 计算得到的值不会覆盖之前的值
        self.inplace = inplace

    def transform_image(self, image):
        return tvisf.normalize(image, self.mean, self.std, self.inplace)

⑤ Use self._generate_proposals()to data['test_anno']add noise

data['test_anno']It is the bbox generated according to the process of ① ② ③ ④ ⑤

	self.proposal_params = {
    
    'min_iou': 0.1, 'boxes_per_frame': 16, 'sigma_factor': [0.01, 0.05, 0.1, 0.2, 0.3]}
	
    def _generate_proposals(self, box):
        """
        通过给输入的box添加噪音,生成proposal
        """
        # 生成proposal
        num_proposals = self.proposal_params['boxes_per_frame']
        # .get(key,'default')查找键值‘key’,如果不存在,则返回‘default’
        proposal_method = self.proposal_params.get('proposal_method', 'default')
        if proposal_method == 'default':
            proposals = torch.zeros((num_proposals, 4))
            gt_iou = torch.zeros(num_proposals)
            for i in range(num_proposals):
                proposals[i, :], gt_iou[i] = prutils.perturb_box(box,
                                                                 min_iou=self.proposal_params['min_iou'],
                                                                 sigma_factor=self.proposal_params['sigma_factor'])

        elif proposal_method == 'gmm':
            proposals, _, _ = prutils.sample_box_gmm(box,
                                                     self.proposal_params['proposal_sigma'],
                                                     num_samples=num_proposals)
            gt_iou = prutils.iou(box.view(1, 4), proposals.view(-1, 4))

        # map to [-1, 1]
        gt_iou = gt_iou * 2 - 1
        return proposals, gt_iou

❶ The first disturbance method:

computed [ xcenter , ycenter , w , h ] [x_{center}, y_{center}, w, hdata['test_anno] ][xcenter,ycenter,w,h ] RandomlyuseTensorto calculate the average value as[ xcenter , ycenter , w , h ] [x_{center}, y_{center}, w, h]
 
'sigma_factor': [0.01, 0.05, 0.1, 0.2, 0.3]perturb_factor=[0.1, 0.1, 0.1, 0.1]
 
random.gauss(bbox[0], perturb_factor[0])[xcenter,ycenter,w,h ] , the standard deviation is [0.1, 0.1, 0.1, 0.1], the vernacular is[ xcenter , ycenter , w , h ] [x_{center}, y_{center}, w, h][xcenter,ycenter,w,h ] has the highest probability, and the perturbed[ xperturbed , yperturbed , wperturbed , hperturbed ] [x_{perturbed}, y_{perturbed}, w_{perturbed}, h_{perturbed}][xperturbed,yperturbed,wperturbed,hperturbed]
 
计算 [ x p e r t u r b e d , y p e r t u r b e d , w p e r t u r b e d , h p e r t u r b e d ] [x_{perturbed}, y_{perturbed}, w_{perturbed}, h_{perturbed}] [xperturbed,yperturbed,wperturbed,hperturbed] [ x c e n t e r , y c e n t e r , w , h ] [x_{center}, y_{center}, w, h] [xcenter,ycenter,w,h ] the IOU
 
will perturb the coefficientperturb_factor *= 0.9
 

Cycle the above process 100 times to get the result, if the result is obtained within 100 times box_iou > min_iou, directly output a setbox_per, box_iou

Carry out the above process 16 times to get 16 groups box_per, box_iou, that isnum_proposals

❷ The second perturbation method (Gaussian mixture model):
Gaussian mixture model, which uses multiple Gaussian functions to approximate the probability distribution:
p GMM = Σ k = 1 K p ( k ) p ( x ∣ k ) = Σ k = 1 K α kp ( x ∣ μ k , Σ k ) p_{GMM} = \Sigma^{K}_{k=1}p(k)p(x|k) = \Sigma^{K}_{k= 1}\alpha_k p(x|\mu_k, \Sigma_k)pGMM=Sk=1Kp(k)p(xk)=Sk=1Kakp(xμk,Sk)
among them,KKK is the number of models (equivalent to num_proposals=16), that is, how many single Gaussian distributions are used;α k \alpha_kakis the kkthThe probability of k single Gaussian distribution,Σ k = 1 K α k = 1 \Sigma^{K}_{k=1}\alpha_k = 1Sk=1Kak=1 0 p( x ∣ μ k , Σ k ) p(x|\mu_k, \Sigma_k)p(xμk,Sk) is thekkthThe mean value of k is μ k \mu_kmk, the variance is Σ k \Sigma_kSkProbability density of Gaussian distribution
Code implementation:
variance Σ k \Sigma_kSk

# proposal_sigma = [[a, b], [c, d]]
center_std = torch.Tensor([s[0] for s in proposal_sigma])
sz_std = torch.Tensor([s[1] for s in proposal_sigma])
# stack后维度[4,1,2]
std = torch.stack([center_std, center_std, sz_std, sz_std])
# 2
num_components = std.shape[-1]
# 4
num_dims = std.numel() // num_components
# (1,4,2)
std = std.view(1, num_dims, num_components)

Number of models KKK

k = torch.randint(num_components, (num_samples,), dtype=torch.int64)
# 输出[16, 4], std=[1, 4, 2],由于这里k只有0和1,作用就是把最后一个维度复制成为16,索引方式就是index=0或1
std_samp = std[0, :, k].t()

The center point coordinates of the Bbox after GMM sampling (here is the deviation of the center point coordinates, which is equivalent to xi − x x_i - xxix ), and then calculatexi x_ixi

x_centered = std_samp * torch.randn(num_samples, num_dims)
# rel左上角和长宽的对数表示bbox
proposals_rel = x_centered + mean_box_rel

⑥ Gaussian label division ( there is no such operation in ATOM )

data["train_anno"]It is used for the network training part, data["test_anno"]but for predcalculating the loss with the network output, that is, the Gaussian label is used as Ground_Truth.

def gauss_1d(sz, sigma, center, end_pad=0, density=False):
    # K 决定了输出的尺寸,理解为meshgrid更合适
    k = torch.arange(-(sz - 1) / 2, (sz + 1) / 2 + end_pad).reshape(1, -1).cuda() - center.reshape(1, -1)
    gauss = torch.exp(-1.0 / (2 * sigma ** 2) * (k) ** 2)
    if density:
        gauss /= math.sqrt(2 * math.pi) * sigma
    return gauss


def gauss_2d(sz, sigma, center, end_pad=(0, 0), density=False):
    return gauss_1d(sz[0].item(), sigma[0], center[:, 0], end_pad[0], density).reshape(center.shape[0], 1, -1) * gauss_1d(sz[1].item(), sigma[1], center[:, 1], end_pad[1], density).reshape(center.shape[0], -1, 1)


def gaussian_label_function(target_bb, end_pad_if_even=False, density=False, uni_bias=0):
    # 为了直观,这里设置一下参数,作为展示
    uni_bias = torch.Tensor([uni_bias]).cuda()
    sigma_factor = torch.Tensor([0.25, 0.25]).cuda()
    feat_sz = torch.Tensor([18, 18]).cuda()
    image_sz = torch.Tensor([511, 511]).cuda()
    #target_bb = target_bb.view(-1, *target_bb.shape)

    # target_bb 包含了3个bbox, 也就是num_train_frames或者num_test_frames
    # target_bb=[3, 4]    target_center=[3, 2]
    # 注意,这里的target_bb是image_crop下的坐标,需要通过norm转换到feat_sz=[18, 18下面
    target_center = target_bb[:, 0:2] + 0.5 * target_bb[:, 2:4]
    target_center_norm = (target_center - image_sz / 2) / image_sz
    center = feat_sz * target_center_norm+ 0.5 * torch.Tensor([(kernel_sz[0] + 1) % 2, (kernel_sz[1] + 1) % 2])
    sigma = sigma_factor * feat_sz.prod().sqrt().item()

    if end_pad_if_even:
        end_pad = (1, 1)
    else:
        end_pad = (0, 0)
    gauss_label = gauss_2d(feat_sz, sigma, center, end_pad, density=density).float()
    if density:
        sz = (feat_sz + torch.Tensor(end_pad)).prod()
        label = (1.0 - uni_bias) * gauss_label + uni_bias / sz
    else:
        label = gauss_label + uni_bias

    return label

The effect is as shown in the figuregaussian_label

⑦ Combined output

data['test_images']And data['train_images']pictures (single sheets), the Batch samples will be packaged in LTRLoader
data['test_anno']and then passed through self._generate_proposals, transformed into: the IoU
    data['test_proposals']containing 8 bboxes
    data['proposal_iou']including 8 disturbed bboxes
data['train_anno']passed through ⑥ self._generate_label_function() bbox, sent to the network
data['test_anno']through ⑥The bbox after self._generate_label_function() is used to calculate the loss

4. Network Model

Dimp_modelThe model structure in the paper includes two parts, Train and Inference, and is incomplete. The IoUnet part of ATOM is not displayed. Backbone is still the two parts of the ATOM structure output layer2 and layer3 (indicated in the red line box in the figure). The complete model is in the green line box in the figure. The model has two outputs, one is Iou_pred, and the other is the filter's score_pred.
So what is the use of these two outputs?

First, let's recall the classifier in ATOM:
f ( x ; w ) = ϕ 2 w 2 × ϕ 1 ( w 1 × x ) f(x;w) = \phi_{2}{w_{2} \times \phi_ {1}{(w_1 \times x)}}f(x;w)=ϕ2w2×ϕ1(w1×x )
In fact, two sets of filters are trained, but the training position is in Inference, use the filters to devolve the feat to get score_raw, and use this score to find the position information.
Whereas in DIMP, the proceduref ( x ; w ) f(x; w)f(x;w ) is put in the training step. To put it bluntly, it is the same classifier model, just change the module position...
The filter in DiMP is theoretically stillf ( x ; w ) f(x;w)f(x;w ) , in each Epoch, iterate 5 times to find a set of Filters.
In the inference, this set of filters rolls feat to get similar to Score_raw in ATOM, and the output of the classifier is the same, just replace it one by one.

    def track(self, image, info: dict=None) -> dict:
        self.debug_info = {
    
    }
        self.frame_num += 1
        self.debug_info['frame_num'] = self.frame_num

        # 图片从numpy格式转换成tensor
        im = numpy_to_torch(image)

        # 定位

        # 提取backbone特征,调用dimpnet中的extract_backbone_features返回layer2和layer3的featuremap
        backbone_feat, sample_coords, im_patches = self.extract_backbone_features(im,
                                                                                  self.get_centered_sample_pos(),
                                                                                  self.target_scale * self.params.scale_factors,
                                                                                  self.img_sample_sz)
        # 提取分类特征,调用dimpnet中的extract_classification_feat,返回的是经过Bottlenet的layer2和layer3的featuremap
        test_x = self.get_classification_features(backbone_feat)
        
        # 样本的定位, 就是featuremap中bbox的坐标
        sample_pos, sample_scales = self.get_sample_location(sample_coords)

        # 计算分类得分,不同于ATOM的Score_raw是在Inference中训出来的w1和w2计算出来的
        # 这里是直接训练出来了一组filters,然后拿来直接用了
        scores_raw = self.classify_target(test_x)
        # 这里是ATOM的做法:
        # scores_raw = self.apply_filter(test_x)
        # self.apply_filter = operation.conv2d(sample_x, self.filter, mode='same')

        # 定位目标
        translation_vec, scale_ind, s, flag = self.localize_target(scores_raw, sample_pos, sample_scales)
        new_pos = sample_pos[scale_ind, :] + translation_vec

        # 更新位置和缩放因子
        if flag != 'not_found':
            if self.params.get('use_iou_net', True):
                update_scale_flag = self.params.get('update_scale_when_uncertain', True) or flag != 'uncertain'
                if self.params.get('use_classifier', True):
                    self.update_state(new_pos)
                # ATOM中是用W1和W2计算的pos来迭代5次IoUnet,得到精确到bbox坐标
                # DIMP中是用模型训练的filters计算的pos,在IoUNet中迭代5次,得到精确到bbox坐标
                self.refine_target_box(backbone_feat, sample_pos[scale_ind, :], sample_scales[scale_ind], scale_ind, update_scale_flag)
            elif self.params.get('use_classifier', True):
                self.update_state(new_pos, sample_scales[scale_ind])

        # 更新
        update_flag = flag not in ['not_found', 'uncertain']
        hard_negative = (flag == 'hard_negative')
        learning_rate = self.params.get('hard_negative_learning_rate', None) if hard_negative else None

        if update_flag and self.params.get('update_classifier', False):
            # 获得训练样本
            train_x = test_x[scale_ind:scale_ind+1, ...]
            # 创建target_box和空间样本的标签
            target_box = self.get_iounet_box(self.pos, self.target_sz, sample_pos[scale_ind, :], sample_scales[scale_ind])
            # 更新分类模型
            """
            利用DiMP网络,把卷图片用的weights[...]算出来
                其中还有hard_negative:将难样本加入初始化的负例样本中进行训练
                    最后把得到的self.target_filter放入def classify_target中,
                        就是把 self.target_filter和图片卷一下,得到scores
            """
            self.update_classifier(train_x, target_box, learning_rate, s[scale_ind, ...])

        # 设置追踪器位置到IoUNet的位置
        if self.params.get('use_iou_net', True) and flag != 'not_found' and hasattr(self, 'pos_iounet'):
            self.pos = self.pos_iounet.clone()

        score_map = s[scale_ind, ...]
        max_score = torch.max(score_map).item()

        # 可视化和设置显示的debug标志
        self.search_area_box = torch.cat((sample_coords[scale_ind, [1, 0]], sample_coords[scale_ind, [3, 2]] - sample_coords[scale_ind, [1, 0]] - 1))
        self.debug_info['flag' + self.id_str] = flag
        self.debug_info['max_score' + self.id_str] = max_score
        if self.visdom is not None:
            self.visdom.register(score_map, 'heatmap', 2, 'Score Map' + self.id_str)
            self.visdom.register(self.debug_info, 'info_dict', 1, 'Status')
        elif self.params.debug >= 2:
            show_tensor(score_map, 5, title='Max Score = {:.2f}'.format(max_score))

        # 计算输出的BoundingBox,换句话说,就是下一帧track的pos
        new_state = torch.cat((self.pos[[1, 0]] - (self.target_sz[[1, 0]] - 1) / 2, self.target_sz[[1, 0]]))

        if self.params.get('output_not_found_box', False) and flag == 'not_found':
            output_state = [-1, -1, -1, -1]
        else:
            output_state = new_state.tolist()

        out = {
    
    'target_bbox': output_state}
        return out

Loss function and filter update method

  • 损失函数
    L ( f ) = 1 S t r a i n Σ ( x , c ) ∈ S t r a i n ( ∣ ∣ r ( x × f , c ) ∣ ∣ 2 + ∣ ∣ λ f ∣ ∣ 2 ) L(f) = \frac{1}{S_{train}} \underset {(x,c) \in S_{train}}{\Sigma} (|| r(x \times f, c) ||^2 + || \lambda f||^2) L(f)=Strain1(x,c)StrainS(∣∣r(x×f,c)2+∣∣λf2 )
    wherer ( x × f , c ) r(x \times f, c)r(x×f,c ) is the residual, the specific calculation method is:
    r ( x × f , c ) = ν c ⋅ [ mc ⋅ x × f + ( 1 − mc ) max ( 0 , x × f ) − c ] r(x \ times f, c) = \nu _c \cdot [m_c \cdot x \times f + (1 - m_c)max(0, x \times f) - c]r(x×f,c)=nc[mcx×f+(1mc)max(0,x×f)c ]
    whereν c \nu _cncis the spatial weight, mc m_cmcIt is a mask, all based on distmap dist_{map}distmapThe result of convolution.
    • d i s t m a p dist_{map} distmapThe calculation method is obtained by changing the Euclidean space distance dist through a function change, and dist uses the coordinate center of the BoundingBox as the coordinate origin, and the size of the Featuremap as the coordinate range, as shown in the figure: the transformation function is: yc ( t ) =
      insert image description hereΣ
      k = 0 N − 1 ϕ ky ρ k ( ∣ ∣ t − c ∣ ∣ ) y_c (t) = \Sigma^{N-1}_{k=0} \phi^y_k \rho_k (|| t - c| |)yc(t)=Sk=0N1ϕkyrk(∣∣tc ∣∣ )
      where functionρ k ( ∣ ∣ t − c ∣ ∣ ) \rho_k(|| t - c ||)rk(∣∣tc ∣∣ ) is: (t − c tctc also free dist)
      ρ k ( d ) = { max ( 0 , 1 − ∣ d − k △ ∣ △ ) k < N − 1 max ( 0 , min ( 1 , 1 + d − k △ △ ) ) k = N − 1 \rho_k(d) = \begin {cases} max(0, 1 - \frac{|d - k \triangle|}{\triangle}) & k<N-1 \\ max(0, min (1, 1+\frac{d - k\triangle}{\triangle})) & k=N-1\end{cases}rk(d)={ max(0,1dk△∣)max(0,min(1,1+dk))k<N1k=N1

    • ν c \nu _{c}ncCalculation method, will distmap dist_{map}distmapFeed into the convolutional layer:

    num_bin_dist = 100
    self.spatial_weight_predictor = nn.Conv2d(num_dist_bins, 1, kernel_size=1, bias=False) 
    spatial_weight = self.spatial_weight_predictor(dist_map)
    
    • m c m_c mcCalculation method of distmap dist_{map}distmapFeed into the convolutional layer:
    num_bin_dist = 100
    mask_layers = [nn.Conv2d(num_dist_bins, 1, kernel_size=1, bias=False)]
    if mask_act == 'sigmoid':
    	mask_layers.append(nn.Sigmoid())
    	init_bias = 0.0
    elif mask_act == 'linear':
    	init_bias = 0.5
    else:
    	raise ValueError("激活函数未知")
    # 预测的就是损失中的 mc 被牵制在了[0, 1],还要经过sigmoid
    self.target_mask_predictor = nn.Sequential(*mask_layers)
    target_mask = self.target_mask_predictor(dist_map)
    
  • Gradient descent and update step
    will be L ( f ) L(f)L(f)进行泰勒展开:
    L ( f ) ≈ L ~ ( f ) = 1 2 ( f − f ( i ) ) T Q ( i ) ( f − f ( i ) ) + ( f − f ( i ) ) T ∇ L ( f ( i ) ) + L ( f ( i ) ) L(f) \approx \tilde{L}(f) = \frac{1}{2}(f - f^{(i)})^T Q^{(i)}(f - f^{(i)}) + (f - f^{(i)})^T \nabla L(f^{(i)}) + L(f^{(i)}) L(f)L~(f)=21(ff(i))TQ(i)(ff(i))+(ff(i))TL(f(i))+L(f( i ) )
    step sizeα \alphaα为:
    α = ∇ L ( f ( i ) ) T ∇ L ( f ( i ) ) ∇ L ( f ( i ) ) T Q ( i ) ∇ L ( f ( i ) ) \alpha = \frac{\nabla L(f^{(i)})^T \nabla L(f^{(i)})}{\nabla L(f^{(i)})^T Q^{(i)} \nabla L(f^{(i)})} a=L(f(i))TQ(i)L(f(i))L(f(i))TL(f(i))
    更新方式:
    f ( i + 1 ) = f ( i ) − α ∇ L ( f ( i ) ) f^(i+1) = f^{(i)} - \alpha \nabla L(f^{(i)}) f(i+1)=f(i)αL(f(i))

Guess you like

Origin blog.csdn.net/Soonki/article/details/129478349