Pytorch|YOWO原理及代码详解(一)

Pytorch|YOWO原理及代码详解(一)

阅前可看:YOWO论文翻译
YOWO很有趣,使用价值很大,最近刚好需要,所以就研究一下。一直认为只有把源码看懂,才知道诸多细节,才算真正了解一个算法。笔者能力有限,博文若有出错,欢迎指正交流。

这次为了方便debug,所以就稍微改动了train.py 文件,修改为myTrain.py,代码分析就从这里开始,但在之前需要完成各项配置。

1.训练之前需要的工作。

1.1 ucf101-24数据集

ucf101-24数据集下载。论文使用了两个数据集,本次代码分析只使用ucf24数据集。

1.2 基础骨干网络预训练模型

有两个,第一个是2d网络yolov2。还有一个是3d网络ResNeXt ve ResNet。本次代码分析使用:“resnext-101-kinetics.pth”。

1.3 YOWO网络预训练模型

作者放百度云了,密码:95mm。

1.4 路径配置

基础骨干网络放到“weight”中,ucf24路径随意,但记得需要在ucf24.data中进行修改,如下:
在这里插入图片描述

2. 准备开始训练

首先附上myTrain.py的完整代码:

from __future__ import print_function
import sys
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torchvision import datasets, transforms
import dataset
import random
import math
import os
from opts import parse_opts
from utils import *
from cfg import parse_cfg
from region_loss import RegionLoss
from model import YOWO, get_fine_tuning_parameters
import argparse
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", type=str, default="ucf101-24", help="dataset")
    parser.add_argument("--data_cfg", type=str, default="cfg/ucf24.data ", help="data_cfg")
    parser.add_argument("--cfg_file", type=str, default="cfg/ucf24.cfg ", help="cfg_file")
    parser.add_argument("--n_classes", type=int, default=24, help="n_classes")
    parser.add_argument("--backbone_3d", type=str, default="resnext101", help="backbone_3d")
    parser.add_argument("--backbone_3d_weights", type=str, default="weights/resnext-101-kinetics.pth", help="backbone_3d_weights")
    parser.add_argument("--backbone_2d", type=str, default="darknet", help="backbone_3d_weights")
    parser.add_argument("--backbone_2d_weights", type=str, default="weights/yolo.weights", help="backbone_2d_weights")
    parser.add_argument("--freeze_backbone_2d", type=bool, default=True, help="freeze_backbone_2d")
    parser.add_argument("--freeze_backbone_3d", type=bool, default=True, help="freeze_backbone_3d")
    parser.add_argument("--evaluate", type=bool, default=False, help="evaluate")
    parser.add_argument("--begin_epoch", type=int, default=0, help="begin_epoch")
    parser.add_argument("--end_epoch", type=int, default=4, help="evaluate")
    opt = parser.parse_args()
    # opt = parse_opts()
    # which dataset to use
    dataset_use = opt.dataset
    assert dataset_use == 'ucf101-24' or dataset_use == 'jhmdb-21', 'invalid dataset'
    # path for dataset of training and validation
    datacfg = opt.data_cfg
    # path for cfg file
    cfgfile = opt.cfg_file
    data_options = read_data_cfg(datacfg)
    net_options = parse_cfg(cfgfile)[0]
    # obtain list for training and testing
    basepath = data_options['base']
    trainlist = data_options['train']
    testlist = data_options['valid']
    backupdir = data_options['backup']
    # number of training samples
    nsamples = file_lines(trainlist)
    gpus = data_options['gpus']  # e.g. 0,1,2,3
    ngpus = len(gpus.split(','))
    num_workers = int(data_options['num_workers'])
    batch_size = int(net_options['batch'])
    clip_duration = int(net_options['clip_duration'])
    max_batches = int(net_options['max_batches'])
    learning_rate = float(net_options['learning_rate'])
    momentum = float(net_options['momentum'])
    decay = float(net_options['decay'])
    steps = [float(step) for step in net_options['steps'].split(',')]
    scales = [float(scale) for scale in net_options['scales'].split(',')]
    # loss parameters
    loss_options = parse_cfg(cfgfile)[1]
    region_loss = RegionLoss()
    anchors = loss_options['anchors'].split(',')
    region_loss.anchors = [float(i) for i in anchors]
    region_loss.num_classes = int(loss_options['classes'])
    region_loss.num_anchors = int(loss_options['num'])
    region_loss.anchor_step = len(region_loss.anchors) // region_loss.num_anchors
    region_loss.object_scale = float(loss_options['object_scale'])
    region_loss.noobject_scale = float(loss_options['noobject_scale'])
    region_loss.class_scale = float(loss_options['class_scale'])
    region_loss.coord_scale = float(loss_options['coord_scale'])
    region_loss.batch = batch_size
    # Train parameters
    max_epochs = max_batches * batch_size // nsamples + 1
    use_cuda = True
    seed = int(time.time())
    eps = 1e-5
    best_fscore = 0  # initialize best fscore
    # Test parameters
    nms_thresh = 0.4
    iou_thresh = 0.5
    if not os.path.exists(backupdir):
        os.mkdir(backupdir)
    # 设置随机种子
    torch.manual_seed(seed)
    if use_cuda:
        os.environ['CUDA_VISIBLE_DEVICES'] = gpus
        torch.cuda.manual_seed(seed)
    # Create model
    model = YOWO(opt)
    model = model.cuda()
    model = nn.DataParallel(model, device_ids=None)  # in multi-gpu case
    model.seen = 0
    print(model)
    parameters = get_fine_tuning_parameters(model, opt)
    optimizer = optim.SGD(parameters, lr=learning_rate / batch_size, momentum=momentum, dampening=0,
                          weight_decay=decay * batch_size)
    kwargs = {'num_workers': num_workers, 'pin_memory': True} if use_cuda else {}
    # Load resume path if necessary
    # if opt.resume_path:
    #     print("===================================================================")
    #     print('loading checkpoint {}'.format(opt.resume_path))
    #     checkpoint = torch.load(opt.resume_path)
    #     opt.begin_epoch = checkpoint['epoch']
    #     best_fscore = checkpoint['fscore']
    #     model.load_state_dict(checkpoint['state_dict'])
    #     optimizer.load_state_dict(checkpoint['optimizer'])
    #     model.seen = checkpoint['epoch'] * nsamples
    #     print("Loaded model fscore: ", checkpoint['fscore'])
    #     print("===================================================================")
    region_loss.seen = model.seen
    processed_batches = model.seen // batch_size
    init_width = int(net_options['width'])
    init_height = int(net_options['height'])
    init_epoch = model.seen // nsamples
    def adjust_learning_rate(optimizer, batch):
        lr = learning_rate
        for i in range(len(steps)):
            scale = scales[i] if i < len(scales) else 1
            if batch >= steps[i]:
                lr = lr * scale
                if batch == steps[i]:
                    break
            else:
                break
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr / batch_size
        return lr
    def train(epoch):
        global processed_batches
        t0 = time.time()
        cur_model = model.module
        region_loss.l_x.reset()
        region_loss.l_y.reset()
        region_loss.l_w.reset()
        region_loss.l_h.reset()
        region_loss.l_conf.reset()
        region_loss.l_cls.reset()
        region_loss.l_total.reset()
        train_loader = torch.utils.data.DataLoader(
            dataset.listDataset(basepath, trainlist, dataset_use=dataset_use, shape=(init_width, init_height),
                                shuffle=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                ]),
                                train=True,
                                seen=cur_model.seen,
                                batch_size=batch_size,
                                clip_duration=clip_duration,
                                num_workers=num_workers),
            batch_size=batch_size, shuffle=False, **kwargs)
        lr = adjust_learning_rate(optimizer, processed_batches)
        logging('training at epoch %d, lr %f' % (epoch, lr))
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            adjust_learning_rate(optimizer, processed_batches)
            processed_batches = processed_batches + 1
            if use_cuda:
                data = data.cuda()
            optimizer.zero_grad()
            output = model(data)
            region_loss.seen = region_loss.seen + data.data.size(0)
            loss = region_loss(output, target)
            loss.backward()
            optimizer.step()
            # save result every 1000 batches
            if processed_batches % 500 == 0:  # From time to time, reset averagemeters to see improvements
                region_loss.l_x.reset()
                region_loss.l_y.reset()
                region_loss.l_w.reset()
                region_loss.l_h.reset()
                region_loss.l_conf.reset()
                region_loss.l_cls.reset()
                region_loss.l_total.reset()
        t1 = time.time()
        logging('trained with %f samples/s' % (len(train_loader.dataset) / (t1 - t0)))
        print('')
    def test(epoch):
        def truths_length(truths):
            for i in range(50):
                if truths[i][1] == 0:
                    return i

        test_loader = torch.utils.data.DataLoader(
            dataset.listDataset(basepath, testlist, dataset_use=dataset_use, shape=(init_width, init_height),
                                shuffle=False,
                                transform=transforms.Compose([
                                    transforms.ToTensor()
                                ]), train=False),
            batch_size=batch_size, shuffle=False, **kwargs)

        num_classes = region_loss.num_classes
        anchors = region_loss.anchors
        num_anchors = region_loss.num_anchors
        conf_thresh_valid = 0.005
        total = 0.0
        proposals = 0.0
        correct = 0.0
        fscore = 0.0
        correct_classification = 0.0
        total_detected = 0.0
        nbatch = file_lines(testlist) // batch_size
        logging('validation at epoch %d' % (epoch))
        model.eval()
        for batch_idx, (frame_idx, data, target) in enumerate(test_loader):
            if use_cuda:
                data = data.cuda()
            with torch.no_grad():
                output = model(data).data
                all_boxes = get_region_boxes(output, conf_thresh_valid, num_classes, anchors, num_anchors, 0, 1)
                for i in range(output.size(0)):
                    boxes = all_boxes[i]
                    boxes = nms(boxes, nms_thresh)
                    if dataset_use == 'ucf101-24':
                        detection_path = os.path.join('ucf_detections', 'detections_' + str(epoch), frame_idx[i])
                        current_dir = os.path.join('ucf_detections', 'detections_' + str(epoch))
                        if not os.path.exists('ucf_detections'):
                            os.mkdir(current_dir)
                        if not os.path.exists(current_dir):
                            os.mkdir(current_dir)
                    else:
                        detection_path = os.path.join('jhmdb_detections', 'detections_' + str(epoch), frame_idx[i])
                        current_dir = os.path.join('jhmdb_detections', 'detections_' + str(epoch))
                        if not os.path.exists('jhmdb_detections'):
                            os.mkdir(current_dir)
                        if not os.path.exists(current_dir):
                            os.mkdir(current_dir)
                    with open(detection_path, 'w+') as f_detect:
                        for box in boxes:
                            x1 = round(float(box[0] - box[2] / 2.0) * 320.0)
                            y1 = round(float(box[1] - box[3] / 2.0) * 240.0)
                            x2 = round(float(box[0] + box[2] / 2.0) * 320.0)
                            y2 = round(float(box[1] + box[3] / 2.0) * 240.0)
                            det_conf = float(box[4])
                            for j in range((len(box) - 5) // 2):
                                cls_conf = float(box[5 + 2 * j].item())
                                if type(box[6 + 2 * j]) == torch.Tensor:
                                    cls_id = int(box[6 + 2 * j].item())
                                else:
                                    cls_id = int(box[6 + 2 * j])
                                prob = det_conf * cls_conf

                                f_detect.write(
                                    str(int(box[6]) + 1) + ' ' + str(prob) + ' ' + str(x1) + ' ' + str(y1) + ' ' + str(
                                        x2) + ' ' + str(y2) + '\n')
                    truths = target[i].view(-1, 5)
                    num_gts = truths_length(truths)
                    total = total + num_gts
                    for i in range(len(boxes)):
                        if boxes[i][4] > 0.25:
                            proposals = proposals + 1
                    for i in range(num_gts):
                        box_gt = [truths[i][1], truths[i][2], truths[i][3], truths[i][4], 1.0, 1.0, truths[i][0]]
                        best_iou = 0
                        best_j = -1
                        for j in range(len(boxes)):
                            iou = bbox_iou(box_gt, boxes[j], x1y1x2y2=False)
                            if iou > best_iou:
                                best_j = j
                                best_iou = iou

                        if best_iou > iou_thresh:
                            total_detected += 1
                            if int(boxes[best_j][6]) == box_gt[6]:
                                correct_classification += 1
                        if best_iou > iou_thresh and int(boxes[best_j][6]) == box_gt[6]:
                            correct = correct + 1
                precision = 1.0 * correct / (proposals + eps)
                recall = 1.0 * correct / (total + eps)
                fscore = 2.0 * precision * recall / (precision + recall + eps)
                logging(
                    "[%d/%d] precision: %f, recall: %f, fscore: %f" % (batch_idx, nbatch, precision, recall, fscore))
        classification_accuracy = 1.0 * correct_classification / (total_detected + eps)
        locolization_recall = 1.0 * total_detected / (total + eps)
        print("Classification accuracy: %.3f" % classification_accuracy)
        print("Locolization recall: %.3f" % locolization_recall)
        return fscore
    if opt.evaluate:
        logging('evaluating ...')
        test(0)
    else:
        for epoch in range(opt.begin_epoch, opt.end_epoch + 1):
            # Train the model for 1 epoch
            train(epoch)
            # Validate the model
            fscore = test(epoch)
            is_best = fscore > best_fscore
            if is_best:
                print("New best fscore is achieved: ", fscore)
                print("Previous fscore was: ", best_fscore)
                best_fscore = fscore
            # Save the model to backup directory
            state = {
                'epoch': epoch,
                'state_dict': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'fscore': fscore
            }
            save_checkpoint(state, is_best, backupdir, opt.dataset, clip_duration)
            logging('Weights are saved to backup directory: %s' % (backupdir))

2.1 基本设置

    parser = argparse.ArgumentParser()
	......
    # path for dataset of training and validation
    datacfg = opt.data_cfg
    # path for cfg file
    cfgfile = opt.cfg_file

配置数据集,cfg路径,选择的基础网络是什么,是否使用评估模式等。

2.2 解析data/cfg文件

    data_options = read_data_cfg(datacfg)
    net_options = parse_cfg(cfgfile)[0]

首先看read_data_cfg(datacfg)
完整代码如下:

def read_data_cfg(datacfg):
    options = dict()
    options['gpus'] = '0'
    options['num_workers'] = '0'
    with open(datacfg, 'r') as fp:
        lines = fp.readlines()

    for line in lines:
        line = line.strip()
        if line == '':
            continue
        key,value = line.split('=')
        key = key.strip()
        value = value.strip()
        options[key] = value
    return options

该段代码解析cfg/ucf24.data,获取训练、测试集的路径。
查看解析结果如下:
在这里插入图片描述
代码:net_options = parse_cfg(cfgfile)[0],是解析网络配置文件。解析的文件是cfg/ucf24.cfg
完整代码如下:

def parse_cfg(cfgfile):
    blocks = []
    fp = open(cfgfile, 'r')
    block =  None
    line = fp.readline()
    while line != '':
        line = line.rstrip()
        if line == '' or line[0] == '#':
            line = fp.readline()
            continue        
        elif line[0] == '[':
            if block:
                blocks.append(block)
            block = dict()
            block['type'] = line.lstrip('[').rstrip(']')
            # set default value
            if block['type'] == 'convolutional':
                block['batch_normalize'] = 0
        else:
            key,value = line.split('=')
            key = key.strip()
            if key == 'type':
                key = '_type'
            value = value.strip()
            block[key] = value
        line = fp.readline()

    if block:
        blocks.append(block)
    fp.close()
    return blocks

得到的结果如下:
在这里插入图片描述
这个解析应该分为两个部分。

扫描二维码关注公众号,回复: 10399611 查看本文章
  • 第一部分是整个网络的训练配置,不如数据的尺寸、学习率大小,学习率衰减策略等。
  • 第二部分主要是yolov2的配置,因为type='region'。剩下的anchor是预设的尺寸,一共有10个尺寸,5个anchor,对应num=5。剩下的object_scale、noobject_scale、class_scale以及coord_scale应该是损失函数的惩罚因子,待loss函数处进行验证。

根据blocks是列表,可得知net_options 获取的是整个网络的训练配置。

2.3 获取训练和测试时的配置列表

    basepath = data_options['base']
    trainlist = data_options['train']
    testlist = data_options['valid']
    backupdir = data_options['backup']
    # number of training samples
    nsamples = file_lines(trainlist)
    gpus = data_options['gpus']  # e.g. 0,1,2,3
    ngpus = len(gpus.split(','))
    num_workers = int(data_options['num_workers'])

    batch_size = int(net_options['batch'])
    clip_duration = int(net_options['clip_duration'])
    max_batches = int(net_options['max_batches'])
    learning_rate = float(net_options['learning_rate'])
    momentum = float(net_options['momentum'])
    decay = float(net_options['decay'])
    steps = [float(step) for step in net_options['steps'].split(',')]
    scales = [float(scale) for scale in net_options['scales'].split(',')]

这一部分则是把上面解析cfg/data的内容分别保存下来,从而进行后续的训练、测试。

2.4 损失函数的各项参数

    # loss parameters
    loss_options = parse_cfg(cfgfile)[1]
    region_loss = RegionLoss()
	...
    region_loss.batch = batch_size

这里主要分析代码region_loss = RegionLoss()RegionLoss是在region_loss.py中的一共类,完整如下。

class RegionLoss(nn.Module):
    # for our model anchors has 10 values and number of anchors is 5
    # 我们的模型锚点有10个值,锚点的数量是5个
    # parameters: 24, 10 float values, 24, 5
    def __init__(self, num_classes=0, anchors=[], batch=16, num_anchors=1):
        super(RegionLoss, self).__init__()
        self.num_classes = num_classes
        self.batch = batch
        self.anchors = anchors
        self.num_anchors = num_anchors
        self.anchor_step = len(anchors)//num_anchors    # each anchor has 2 parameters
        self.coord_scale = 1
        self.noobject_scale = 1
        self.object_scale = 5
        self.class_scale = 1
        self.thresh = 0.6
        self.seen = 0
        self.l_x = AverageMeter()
        self.l_y = AverageMeter()
        self.l_w = AverageMeter()
        self.l_h = AverageMeter()
        self.l_conf = AverageMeter()
        self.l_cls = AverageMeter()
        self.l_total = AverageMeter()


    def forward(self, output, target):
        # output : B*A*(4+1+num_classes)*H*W
		......
        return loss

这里主要分析其初始化__init__,其forward在计算loss时再做讲解。其中的各项参数都在上述部分基本讲解过。self.seen = 0暂时不知道上面意思,后面再做讲解。AverageMeter是utils.py中的一个类,目的是计算平均值和存储当前值,也是说loss中的x、y、w、h、conf、cls以及total会随着计算不断累加且求平均。

2.5 训练/测试参数设置。

 	# Train parameters
    max_epochs = max_batches * batch_size // nsamples + 1
    use_cuda = True
    seed = int(time.time())
    eps = 1e-5
    best_fscore = 0  # initialize best fscore
    # Test parameters
    nms_thresh = 0.4
    iou_thresh = 0.5
    if not os.path.exists(backupdir):
        os.mkdir(backupdir)
    # 设置随机种子
    torch.manual_seed(seed)
    if use_cuda:
        os.environ['CUDA_VISIBLE_DEVICES'] = gpus
        torch.cuda.manual_seed(seed)

作者好像没有用到max_epochs 这个参数。。。,除了设计随机种子外,也把非极大值抑制(NMS)的相关参数给设置了。

2.6 加载模型设置优化器

    # Create model
    model = YOWO(opt)
    model = model.cuda()
    model = nn.DataParallel(model, device_ids=None)  # in multi-gpu case
    model.seen = 0
    print(model)
    parameters = get_fine_tuning_parameters(model, opt)
    optimizer = optim.SGD(parameters, lr=learning_rate / batch_size, momentum=momentum, dampening=0,
                          weight_decay=decay * batch_size)
    kwargs = {'num_workers': num_workers, 'pin_memory': True} if use_cuda else {}
    # Load resume path if necessary
    # if opt.resume_path:
    #     print("===================================================================")
    #     print('loading checkpoint {}'.format(opt.resume_path))
    #     checkpoint = torch.load(opt.resume_path)
    #     opt.begin_epoch = checkpoint['epoch']
    #     best_fscore = checkpoint['fscore']
    #     model.load_state_dict(checkpoint['state_dict'])
    #     optimizer.load_state_dict(checkpoint['optimizer'])
    #     model.seen = checkpoint['epoch'] * nsamples
    #     print("Loaded model fscore: ", checkpoint['fscore'])
    #     print("===================================================================")
    region_loss.seen = model.seen
    processed_batches = model.seen // batch_size
    init_width = int(net_options['width'])
    init_height = int(net_options['height'])
    init_epoch = model.seen // nsamples

这里最核心的是:
是否使用多GPU训练。使用随机梯度下降做优化器(SGD)。是否要中断后重写训练if opt.resume_path:。现在可以看出model.seen是记录当前训练了多久,以方便终端后继续训练,其训练的epoch、region_loss以及processed_batches能刚好衔接上。
另外则是model = YOWO(opt),创建YOWO模型,YOWO是在model.py中,查看:

class YOWO(nn.Module):
    def __init__(self, opt):
        super(YOWO, self).__init__()
        self.opt = opt 
        ##### 2D Backbone #####
        if opt.backbone_2d == "darknet":
            self.backbone_2d = darknet.Darknet("cfg/yolo.cfg")
            num_ch_2d = 425 # Number of output channels for backbone_2d
        else:
            raise ValueError("Wrong backbone_2d model is requested. Please select\
                              it from [darknet]")
        if opt.backbone_2d_weights:# load pretrained weights on COCO dataset
            self.backbone_2d.load_weights(opt.backbone_2d_weights) 
        ##### 3D Backbone #####
        if opt.backbone_3d == "resnext101":
            self.backbone_3d = resnext.resnext101()
            num_ch_3d = 2048 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "resnet18":
            self.backbone_3d = resnet.resnet18(shortcut_type='A')
            num_ch_3d = 512 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "resnet50":
            self.backbone_3d = resnet.resnet18(shortcut_type='B')
            num_ch_3d = 2048 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "resnet101":
            self.backbone_3d = resnet.resnet18(shortcut_type='B')
            num_ch_3d = 2048 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "mobilenet_2x":
            self.backbone_3d = mobilenet.get_model(width_mult=2.0)
            num_ch_3d = 2048 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "mobilenetv2_1x":
            self.backbone_3d = mobilenetv2.get_model(width_mult=1.0)
            num_ch_3d = 1280 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "shufflenet_2x":
            self.backbone_3d = shufflenet.get_model(groups=3,   width_mult=2.0)
            num_ch_3d = 1920 # Number of output channels for backbone_3d
        elif opt.backbone_3d == "shufflenetv2_2x":
            self.backbone_3d = shufflenetv2.get_model(width_mult=2.0)
            num_ch_3d = 2048 # Number of output channels for backbone_3d
        else:
            raise ValueError("Wrong backbone_3d model is requested. Please select it from [resnext101, resnet101, \
                             resnet50, resnet18, mobilenet_2x, mobilenetv2_1x, shufflenet_2x, shufflenetv2_2x]")
        if opt.backbone_3d_weights:# load pretrained weights on Kinetics-600 dataset
            self.backbone_3d = self.backbone_3d.cuda()
            self.backbone_3d = nn.DataParallel(self.backbone_3d, device_ids=None) # Because the pretrained backbone models are saved in Dataparalled mode
            pretrained_3d_backbone = torch.load(opt.backbone_3d_weights)
            backbone_3d_dict = self.backbone_3d.state_dict()
            pretrained_3d_backbone_dict = {k: v for k, v in pretrained_3d_backbone['state_dict'].items() if k in backbone_3d_dict} # 1. filter out unnecessary keys
            backbone_3d_dict.update(pretrained_3d_backbone_dict) # 2. overwrite entries in the existing state dict
            self.backbone_3d.load_state_dict(backbone_3d_dict) # 3. load the new state dict
            self.backbone_3d = self.backbone_3d.module # remove the dataparallel wrapper

        ##### Attention & Final Conv #####
        self.cfam = CFAMBlock(num_ch_2d+num_ch_3d, 1024)
        self.conv_final = nn.Conv2d(1024, 5*(opt.n_classes+4+1), kernel_size=1, bias=False)
        self.seen = 0

    def forward(self, input):
        x_3d = input # Input clip
        x_2d = input[:, :, -1, :, :] # Last frame of the clip that is read

        x_2d = self.backbone_2d(x_2d)
        x_3d = self.backbone_3d(x_3d)
        x_3d = torch.squeeze(x_3d, dim=2)

        x = torch.cat((x_3d, x_2d), dim=1)
        x = self.cfam(x)

        out = self.conv_final(x)

        return out

YOWO由两部分组成:2D网络和3D网络。如下图所示:
在这里插入图片描述
只是需要注意每个网络的输出通道是多少。本例中使用的2d网络是yolov2,3d网络是resnext101。yolov2是负责检测的,默认输出425是因为5 * (80 + 5)。5个anchor,80个类。if opt.backbone_3d_weights,如果有预训练模型,则加载,这些3d预训练模型是在Kinetics-600得到的。2d网络和3d网络都是可以任意组合的,这里就不一一分析,接下来是论文提出的一个创新点:通道融合与注意机制
通道融合与注意机制:self.cfam = CFAMBlock(num_ch_2d+num_ch_3d, 1024)CFAMBlock是在cfam.py中,完整代码如下:

class CAM_Module(nn.Module):
    """ Channel attention module """
    def __init__(self, in_dim):
        super(CAM_Module, self).__init__()
        self.chanel_in = in_dim
        self.gamma = nn.Parameter(torch.zeros(1))
        self.softmax  = nn.Softmax(dim=-1)
    def forward(self,x):
        """
            inputs :
                x : input feature maps( B X C X H X W )
            returns :
                out : attention value + input feature
                attention: B X C X C
        """
        m_batchsize, C, height, width = x.size()
        proj_query = x.view(m_batchsize, C, -1)
        proj_key = x.view(m_batchsize, C, -1).permute(0, 2, 1)
        energy = torch.bmm(proj_query, proj_key)
        energy_new = torch.max(energy, -1, keepdim=True)[0].expand_as(energy)-energy
        attention = self.softmax(energy_new)
        proj_value = x.view(m_batchsize, C, -1)
        out = torch.bmm(attention, proj_value)
        out = out.view(m_batchsize, C, height, width)
        out = self.gamma*out + x
        return out

class CFAMBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(CFAMBlock, self).__init__()
        inter_channels = 1024
        self.conv_bn_relu1 = nn.Sequential(nn.Conv2d(in_channels, inter_channels, kernel_size=1, bias=False),
                                    nn.BatchNorm2d(inter_channels),
                                    nn.ReLU())  
        self.conv_bn_relu2 = nn.Sequential(nn.Conv2d(inter_channels, inter_channels, 3, padding=1, bias=False),
                                    nn.BatchNorm2d(inter_channels),
                                    nn.ReLU())
        self.sc = CAM_Module(inter_channels)
        self.conv_bn_relu3 = nn.Sequential(nn.Conv2d(inter_channels, inter_channels, 3, padding=1, bias=False),
                                   nn.BatchNorm2d(inter_channels),
                                   nn.ReLU())
        self.conv_out = nn.Sequential(nn.Dropout2d(0.1, False), nn.Conv2d(inter_channels, out_channels, 1))
    def forward(self, x):
        x = self.conv_bn_relu1(x)
        x = self.conv_bn_relu2(x)
        x = self.sc(x)
        x = self.conv_bn_relu3(x)
        output = self.conv_out(x)
        return output

关于CFAM的详细内容可以看:YOWO论文翻译。这里使用论文中的结构图,方便分析:
在这里插入图片描述
CFAMBlock由四层卷积层和1个CAM_Module组成。把2d和3d网络的输出按通道拼接而成作为输入,接着使用2个2d 卷积提取特征,然后输入到CAM_Module中,最后再使用2个2d 卷积得到最后的输出,其shape的 H × W H'\times W' 保持不变,而通道由 C + C C'+C'' 变成了 C C*
CFAMBlock中的前后两个卷积层比较容易理解,现在看下CAM_Module,并和论文中的公式推导一一结合。

  • 首先,把经过两个卷积层的输出 B B 重塑成 F R C × N F\in \mathbb{R}^{C\times N} ,其中 N = H × W N=H×W ,即每个通道的特征向量化为一维: B R C × H × W v e c t o r i z a t i o n F R C × N B\in \mathbb{R}^{C\times H\times W}\overset{vectorization}{\rightarrow}F\in \mathbb{R}^{C\times N} ,对应代码:proj_query = x.view(m_batchsize, C, -1)
  • F F 和其转置(proj_key = x.view(m_batchsize, C, -1).permute(0, 2, 1))相乘可以得到格拉姆矩阵: G = F × F T w i t h G i j = k = 1 N F i k F j k G=F\times F^{T} \quad with \quad G_{ij}=\sum_{k=1}^{N}F_{ik}\cdot F_{jk} ,对应代码:energy = torch.bmm(proj_query, proj_key)
  • 关于energy_new = torch.max(energy, -1, keepdim=True)[0].expand_as(energy)-energy我着实没看懂,没看到论文中没有这个地方的介绍,若有人知道,望指教。
  • 接下来就是计算格拉姆矩阵,使用softmax层生成通道注意图 M R C × C M\in \mathbb{R}^{C\times C} ,即: M i j = e x p ( G i j ) ) j = 1 C e x p ( G i j ) M_{ij}=\frac{exp(G_{ij}))}{\sum_{j=1}^{C}exp(G_{ij})} ,对应:attention = self.softmax(energy_new)
  • 为了实现注意力映射对原始特征的影响,进一步进行 M M F F 的矩阵乘法,将结果重新整形为与输入张量形状相同的三维空间 R C × H × W {R}^{C\times H\times W } :即 F = M F {F}'=M\cdot F ,对应代码:
        proj_value = x.view(m_batchsize, C, -1)
        out = torch.bmm(attention, proj_value)
  • 接着再重塑 F F ,即 F R C × C r e s h a p e F R C × H × W {F}'\in \mathbb{R}^{C\times C}\overset{reshape}{\rightarrow}{F}''\in \mathbb{R}^{C\times H\times W} ,对应:out = out.view(m_batchsize, C, height, width)
  • 通道注意力模块 C R C × H × W C\in \mathbb{R}^{C\times H\times W} 的输出将此结果与原始输入特征图 B B 结合,并使用可训练标量参数 α α 进行元素和运算, α α 从0逐渐学习权重: C = α F + B C = α\cdot F''+B ,对应代码:out = self.gamma*out + xself.gamma被初始化为0.
    所以YOWO模型中的:
        ##### Attention & Final Conv #####
        self.cfam = CFAMBlock(num_ch_2d+num_ch_3d, 1024)
        self.conv_final = nn.Conv2d(1024, 5*(opt.n_classes+4+1), kernel_size=1, bias=False)

也不难理解,把2d和3d网络按通道进行contac,最后在输出张量的通道为为5*(opt.n_classes+4+1),即对应5个anchor对24(opt.n_classes)种行为理解定位(x,y,w,h,conf)的结果。
代码:parameters = get_fine_tuning_parameters(model, opt),是选择是否进行微调,一般是对最后几层进行微调,关于微调的更多理解可以看模型finetune


现在基本工作分析完毕,下面就是开始训练了,详情请见:
Pytorch|YOWO原理及代码详解(二)

发布了82 篇原创文章 · 获赞 62 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/qq_24739717/article/details/104955554
今日推荐