I3D Network在Mindspore框架下的实现

可执行案例可参考gitee仓库：tutorials/classification/i3d · Yanlq/zjut_mindvideo - 码云 - 开源中国 (gitee.com)

I3D，全名《Quo Vadis,Action Recognition? A New Model and the Kinetics Dataset》。该论文是一篇CVPR2018年的论文。使用了新的数据集Kinetics重新评估了当前最新的模型架构，Kinetics数集有400个人体行为类别，每个类别有400多个clips，这些数据来自真实有挑战的YouTube视频。作者提出的双流膨胀3D卷积网络(I3D)，该网络是对一个非常深的图像分类网络中的卷积和池化kernel从2D扩展到了3D，来无缝的学习时空特征。并且模型I3D在Kinetics预训之后，I3D在基准数据集HMDB-51和UCF-101达到了80.9%和98.0%的准确率。

在ImageNet上训练好的深度结构网络可以用于其他任务，同时随着深度结构的改进，效果也越来越好。然而，在视频领域，有一个公开的问题，在一个足够大的数据集上训练好的行为识别网络应用于不同的时序任务或数据集上是否有类似性能的提升。为了回答这个问题，作者通过实验重新实现了许多有代表性的神经网络结构，之后通过对这些网络在Kinetics上预训练，之后对这些网络在HMDB-51和UCF-101数据集上进行微调来分析他们的迁移行为，最终作者发现通过预训练这些模型表现上有很大提升，但是性能的提升很大程度上与网络的结构有关。正是基于此发现，作者提出了I3D，在Kinetics充分的预训练之后，可以实现很好的表现。I3D主要依据最优的图像网络架构实现的，并对他们的卷积和池化核从2D扩展到3D，并选择使用它们的参数，最终得到了非常深的时空分类网络。作者也发现，基于在Kinetics预训练的InceptionV1的I3D模型的表现远远超过了其他最优的模型的I3D。

inflate如何实现

2d网络里2d的卷积kernel直接变成3d的卷积kernel，2d的池化层直接变成3d的池化层就好了，剩下的网络结构统统都不变，这样我们就可以选择在图像领域里用过比较好的2d网络结构（vgg，res50 etc.），得到在视频理解领域下可以使用的3d的网络结构。将同一张图片反复地复制粘贴得到一个视频，假设视频的长度是N，每一帧的输入为x，将图片的2d filter在时间维度上也复制N遍，那么就得到了wNx，那么最后再进行缩放，除以N，那么就得到了wx。虽然前面所说的，pooling的时候 33对应的转为333，但在实现上，还是稍微有点区别的，时间维度上最好不要做下采样，所以前面2个max pooling层对应的kernel是133，stride是12*2，后续的就kernel和stride就是正常inflate了。

数据集加载

Kinetics400数据集包含400个行为类别的视频片段，是一个大规模的视频行为识别数据集。每个视频片段的长度在10秒到30秒之间，视频的分辨率为240x320或320x240。通过ParseKinetic400接口来加载Kinetics400数据集。


import os
import csv
import json

from src.data import transforms
from src.data.meta import ParseDataset
from src.data.video_dataset import VideoDataset

from src.utils.class_factory import ClassFactory, ModuleType

__all__ = ["Kinetic400", "ParseKinetic400"]


@ClassFactory.register(ModuleType.DATASET)
class Kinetic400(VideoDataset):
    """
    Args:
        path (string): Root directory of the Mnist dataset or inference image.
        split (str): The dataset split supports "train", "test" or "infer". Default: None.
        transform (callable, optional): A function transform that takes in a video. Default:None.
        target_transform (callable, optional): A function transform that takes in a label.
                Default: None.
        seq(int): The number of frames of captured video. Default: 16.
        seq_mode(str): The way of capture video frames,"part") or "discrete" fetch. Default: "part".
        align(boolean): The video contains multiple actions. Default: False.
        batch_size (int): Batch size of dataset. Default:32.
        repeat_num (int): The repeat num of dataset. Default:1.
        shuffle (bool, optional): Whether or not to perform shuffle on the dataset. Default:None.
        num_parallel_workers (int): Number of subprocess used to fetch the dataset in parallel.
                Default: 1.
        num_shards (int, optional): Number of shards that the dataset will be divided into.
                Default: None.
        shard_id (int, optional): The shard ID within num_shards. Default: None.
        download (bool): Whether to download the dataset. Default: False.
        frame_interval (int): Frame interval of the sample strategy. Default: 1.
        num_clips (int): Number of clips sampled in one video. Default: 1.
    Examples:
        >>> from mindvision.msvideo.dataset.hmdb51 import HMDB51
        >>> dataset = HMDB51("./data/","train")
        >>> dataset = dataset.run()

    The directory structure of Kinetic-400 dataset looks like:

        .
        |-kinetic-400
            |-- train
            |   |-- ___qijXy2f0_000011_000021.mp4      // video file
            |   |-- ___dTOdxzXY_000022_000032.mp4      // video file
            |    ...
            |-- test
            |   |-- __Zh0xijkrw_000042_000052.mp4       // video file
            |   |-- __zVSUyXzd8_000070_000080.mp4   // video file
            |-- val
            |   |-- __wsytoYy3Q_000055_000065.mp4       // video file
            |   |-- __vzEs2wzdQ_000026_000036.mp4   // video file
            |    ...
            |-- kinetics-400_train.csv             //training dataset label file.
            |-- kinetics-400_test.csv              //testing dataset label file.
            |-- kinetics-400_val.csv               //validation dataset label file.

            ...
    """

    def __init__(self,
                 path,
                 split=None,
                 transform=None,
                 target_transform=None,
                 seq=16,
                 seq_mode="part",
                 align=False,
                 batch_size=16,
                 repeat_num=1,
                 shuffle=None,
                 num_parallel_workers=1,
                 num_shards=None,
                 shard_id=None,
                 download=False,
                 frame_interval=1,
                 num_clips=1
                 ):
        load_data = ParseKinetic400(os.path.join(path, split)).parse_dataset
        super().__init__(path=path,
                         split=split,
                         load_data=load_data,
                         transform=transform,
                         target_transform=target_transform,
                         seq=seq,
                         seq_mode=seq_mode,
                         align=align,
                         batch_size=batch_size,
                         repeat_num=repeat_num,
                         shuffle=shuffle,
                         num_parallel_workers=num_parallel_workers,
                         num_shards=num_shards,
                         shard_id=shard_id,
                         download=download,
                         frame_interval=frame_interval,
                         num_clips=num_clips
                         )

    @property
    def index2label(self):
        """Get the mapping of indexes and labels."""
        csv_file = os.path.join(self.path, f"kinetics-400_{self.split}.csv")
        mapping = []
        with open(csv_file, "r")as f:
            f_csv = csv.DictReader(f)
            c = 0
            for row in f_csv:
                if not cls:
                    cls = row['label']
                    mapping.append(cls)
                if row['label'] != cls:
                    c += 1
                    cls = row['label']
                    mapping.append(cls)
        return mapping

    def download_dataset(self):
        """Download the HMDB51 data if it doesn't exist already."""
        raise ValueError("HMDB51 dataset download is not supported.")

    def default_transform(self):
        """Set the default transform for UCF101 dataset."""
        size = 256
        order = (3, 0, 1, 2)
        trans = [
            transforms.VideoShortEdgeResize(size=size, interpolation='linear'),
            transforms.VideoCenterCrop(size=(224, 224)),
            transforms.VideoRescale(shift=0),
            transforms.VideoReOrder(order=order),
            transforms.VideoNormalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.255])
        ]

        return trans


class ParseKinetic400(ParseDataset):
    """
    Parse kinetic-400 dataset.
    """
    urlpath = "https://storage.googleapis.com/deepmind-media/Datasets/kinetics400.tar.gz"

    def load_cls_file(self):
        """Parse the category file."""
        base_path = os.path.dirname(self.path)
        csv_file = os.path.join(base_path, f"kinetics-400_train.csv")
        cls2id = {}
        id2cls = []
        cls_file = os.path.join(base_path, "cls2index.json")
        print(cls_file)
        if os.path.isfile(cls_file):
            with open(cls_file, "r")as f:
                cls2id = json.load(f)
            id2cls = [*cls2id]
            return id2cls, cls2id
        with open(csv_file, "r")as f:
            f_csv = csv.DictReader(f)
            for row in f_csv:
                if row['label'] not in cls2id:
                    cls2id.setdefault(row['label'], len(cls2id))
                    id2cls.append(row['label'])
        f.close()
        os.mknod(cls_file)
        with open(cls_file, "w")as f:
            f.write(json.dumps(cls2id))
        return id2cls, cls2id

    def parse_dataset(self, *args):
        """Traverse the HMDB51 dataset file to get the path and label."""
        parse_kinetic400 = ParseKinetic400(self.path)
        split = os.path.split(parse_kinetic400.path)[-1]
        video_label, video_path = [], []
        _, cls2id = self.load_cls_file()
        with open(os.path.join(os.path.dirname(parse_kinetic400.path),
                               f"kinetics-400_{split}.csv"), "rt")as f:
            f_csv = csv.DictReader(f)
            for row in f_csv:
                start = row['time_start'].zfill(6)
                end = row['time_end'].zfill(6)
                file_name = f"{row['youtube_id']}_{start}_{end}.mp4"
                video_path.append(os.path.join(self.path, file_name))
                video_label.append(cls2id[row['label']])
        return video_path, video_label

构建网络

Inception3dModule是在I3D网络中使用的一种特殊的卷积神经网络模块，用于在三维卷积神经网络中处理视频数据。与传统的卷积神经网络不同，Inception3d Module使用了多个不同大小的卷积核，并将它们的输出连接起来，以提高模型的特征提取能力。以下是实现Inception3dModule的代码

from mindspore import nn
from mindspore import ops
from typing import Union, List, Tuple
from src.models.avgpool3d import AvgPool3D

from src.models.layers.unit3d import Unit3D
from src.models.builder import build_layer,build_model
from src.utils.class_factory import ClassFactory, ModuleType
__all__ = ['I3D']

@ClassFactory.register(ModuleType.LAYER)
class Inception3dModule(nn.Cell):
    """
    Inception3dModule definition.

    Args:
        in_channels (int):  The number of channels of input frame images.
        out_channels (int): The number of channels of output frame images.

    Returns:
        Tensor, output tensor.

    Examples:
        Inception3dModule(in_channels=3, out_channels=3)
    """

    def __init__(self, in_channels, out_channels):
        super(Inception3dModule, self).__init__()
        self.cat = ops.Concat(axis=1)
        self.b0 = Unit3D(
            in_channels=in_channels,
            out_channels=out_channels[0],
            kernel_size=(1, 1, 1))
        self.b1a = Unit3D(
            in_channels=in_channels,
            out_channels=out_channels[1],
            kernel_size=(1, 1, 1))
        self.b1b = Unit3D(
            in_channels=out_channels[1],
            out_channels=out_channels[2],
            kernel_size=(3, 3, 3))
        self.b2a = Unit3D(
            in_channels=in_channels,
            out_channels=out_channels[3],
            kernel_size=(1, 1, 1))
        self.b2b = Unit3D(
            in_channels=out_channels[3],
            out_channels=out_channels[4],
            kernel_size=(3, 3, 3))
        self.b3a = ops.MaxPool3D(
            kernel_size=(3, 3, 3),
            strides=(1, 1, 1),
            pad_mode="same")
        self.b3b = Unit3D(
            in_channels=in_channels,
            out_channels=out_channels[5],
            kernel_size=(1, 1, 1))

    def construct(self, x):
        b0 = self.b0(x)
        b1 = self.b1b(self.b1a(x))
        b2 = self.b2b(self.b2a(x))
        b3 = self.b3b(self.b3a(x))
        return self.cat((b0, b1, b2, b3))


@ClassFactory.register(ModuleType.LAYER)
class InceptionI3d(nn.Cell):
    """
    InceptionI3d architecture. TODO: i3d Inception backbone just in 3d?what about 2d. and two steam.

    Args:
        in_channels (int): The number of channels of input frame images(default 3).
    Returns:
        Tensor, output tensor.

    Examples:
        >>> InceptionI3d(in_channels=3)
    """

    def __init__(self, in_channels=3):

        super(InceptionI3d, self).__init__()

        self.conv3d_1a_7x7 = Unit3D(
            in_channels=in_channels,
            out_channels=64,
            kernel_size=(7, 7, 7),
            stride=(2, 2, 2))
        self.maxpool3d_2a_3x3 = ops.MaxPool3D(
            kernel_size=(1, 3, 3),
            strides=(1, 2, 2),
            pad_mode="same")

        self.conv3d_2b_1x1 = Unit3D(
            in_channels=64,
            out_channels=64,
            kernel_size=(1, 1, 1))

        self.conv3d_2c_3x3 = Unit3D(
            in_channels=64,
            out_channels=192,
            kernel_size=(3, 3, 3))

        self.maxpool3d_3a_3x3 = ops.MaxPool3D(
            kernel_size=(1, 3, 3),
            strides=(1, 2, 2),
            pad_mode="same")

        self.mixed_3b = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 192,
                "out_channels": [64, 96, 128, 16, 32, 32]})

        self.mixed_3c = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 256,
                "out_channels": [128, 128, 192, 32, 96, 64]})

        self.maxpool3d_4a_3x3 = ops.MaxPool3D(
            kernel_size=(3, 3, 3),
            strides=(2, 2, 2),
            pad_mode="same")

        self.mixed_4b = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 128 + 192 + 96 + 64,
                "out_channels": [192, 96, 208, 16, 48, 64]})

        self.mixed_4c = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 192 + 208 + 48 + 64,
                "out_channels": [160, 112, 224, 24, 64, 64]})

        self.mixed_4d = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 160 + 224 + 64 + 64,
                "out_channels": [128, 128, 256, 24, 64, 64]})

        self.mixed_4e = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 128 + 256 + 64 + 64,
                "out_channels": [112, 144, 288, 32, 64, 64]})

        self.mixed_4f = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 112 + 288 + 64 + 64,
                "out_channels": [256, 160, 320, 32, 128, 128]})

        self.maxpool3d_5a_2x2 = ops.MaxPool3D(
            kernel_size=(2, 2, 2),
            strides=(2, 2, 2),
            pad_mode="same")

        self.mixed_5b = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 256 + 320 + 128 + 128,
                "out_channels": [256, 160, 320, 32, 128, 128]})

        self.mixed_5c = build_layer(
            {
                "type": "Inception3dModule",
                "in_channels": 256 + 320 + 128 + 128,
                "out_channels": [384, 192, 384, 48, 128, 128]})

        self.mean_op = ops.ReduceMean(keep_dims=True)
        self.concat_op = ops.Concat(axis=2)
        self.stridedslice_op = ops.StridedSlice()

    def construct(self, x):
        """Average pooling 3D construct."""
        x = self.conv3d_1a_7x7(x)
        x = self.maxpool3d_2a_3x3(x)
        x = self.conv3d_2b_1x1(x)
        x = self.conv3d_2c_3x3(x)
        x = self.maxpool3d_3a_3x3(x)
        x = self.mixed_3b(x)
        x = self.mixed_3c(x)
        x = self.maxpool3d_4a_3x3(x)
        x = self.mixed_4b(x)
        x = self.mixed_4c(x)
        x = self.mixed_4d(x)
        x = self.mixed_4e(x)
        x = self.mixed_4f(x)
        x = self.maxpool3d_5a_2x2(x)
        x = self.mixed_5b(x)
        x = self.mixed_5c(x)
        return x

平均3D池化（Average 3D Pooling）是一种在三维卷积神经网络中常用的操作，它通过将特征图中每个3D卷积核对应的区域取平均值来减小特征图的尺寸。与2D池化类似，平均3D池化可以降低模型的计算量，减少特征图的尺寸，并提高模型的鲁棒性和泛化能力。下面的代码实现平均3D池化：

class AvgPooling3D(nn.Cell):
    """
    A module of average pooling for 3D video features.

    Args:
        kernel_size(Union[int, List[int], Tuple[int]]): The size of kernel window used to take the
            average value, Default: (1, 1, 1).
        strides(Union[int, List[int], Tuple[int]]): The distance of kernel moving. Default: (1, 1, 1).

    Inputs:
        x(Tensor): The input Tensor.

    Returns:
        Tensor, the pooled Tensor.
    """

    def __init__(self,
                 kernel_size: Union[int, List[int], Tuple[int]] = (1, 1, 1),
                 strides: Union[int, List[int], Tuple[int]] = (1, 1, 1),
                 ) -> None:
        super(AvgPooling3D, self).__init__()
        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size, kernel_size)
        kernel_size = tuple(kernel_size)
        if isinstance(strides, int):
            strides = (strides, strides, strides)
        strides = tuple(strides)

        self.pool = AvgPool3D(kernel_size, strides)

    def construct(self, x):
        x = self.pool(x)
        return x

通过Inception3dModule与平均3D池化进行I3D头部网络与网络整体的构建:

@ClassFactory.register(ModuleType.LAYER)
class I3dHead(nn.Cell):
    """
    I3dHead definition

    Args:
        in_channels: Input channel.
        num_classes (int): The number of classes .
        dropout_keep_prob (float): A float value of prob.

    Returns:
        Tensor, output tensor.

    Examples:
        I3dHead(in_channels=2048, num_classes=400, dropout_keep_prob=0.5)
    """

    def __init__(self, in_channels, num_classes=400, dropout_keep_prob=0.5):
        super(I3dHead, self).__init__()
        self._num_classes = num_classes
        self.dropout = nn.Dropout(dropout_keep_prob)
        self.logits = Unit3D(
            in_channels=in_channels,
            out_channels=self._num_classes,
            kernel_size=(1, 1, 1),
            activation=None,
            norm=None,
            has_bias=True)
        self.mean_op = ops.ReduceMean()
        self.squeeze = ops.Squeeze(3)

    def construct(self, x):
        x = self.logits(self.dropout(x))
        x = self.squeeze(self.squeeze(x))
        x = self.mean_op(x, 2)
        return x


@ClassFactory.register(ModuleType.MODEL)
class I3D(nn.Cell):
    """
    TODO: introduction i3d network.

    Args:
        in_channel(int): Number of channel of input data. Default: 3.
        num_classes(int): Number of classes, it is the size of classfication score for every sample,
            i.e. :math:`CLASSES_{out}`. Default: 400.
        keep_prob(float): Probability of dropout for multi-dense-layer head, the number of probabilities equals
            the number of dense layers. Default: 0.5.
        pooling_keep_dim: whether to keep dim when pooling. Default: True.
        pretrained(bool): If `True`, it will create a pretrained model, the pretrained model will be loaded
            from network. If `False`, it will create a i3d model with uniform initialization for weight and bias. Default: False.

    Inputs:
        - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`.

    Outputs:
        Tensor of shape :math:`(N, CLASSES_{out})`.

    Supported Platforms:
        ``GPU``

    Examples:
        >>> import numpy as np
        >>> import mindspore as ms
        >>> from mindvision.msvideo.models import i3d
        >>>
        >>> net = i3d()
        >>> x = ms.Tensor(np.ones([1, 3, 32, 224, 224]), ms.float32)
        >>> output = net(x)
        >>> print(output.shape)
        (1, 400)

    About i3d:

    TODO: i3d introduction.

    Citation:

    .. code-block::

        TODO: i3d Citation.
    """

    def __init__(self,
                 in_channel: int = 3,
                 num_classes: int = 400,
                 keep_prob: float = 0.5,
                 #pooling_keep_dim: bool = True,
                 backbone_output_channel=1024):
        super(I3D, self).__init__()

        self.backbone = InceptionI3d(in_channels=in_channel)
        #self.neck = ops.AvgPool3D(kernel_size=(2,7,7),strides=(1,1,1))
        self.neck = AvgPooling3D(kernel_size=(2,7,7))
        #self.neck = ops.ReduceMean(keep_dims=pooling_keep_dim)
        self.head = I3dHead(in_channels=backbone_output_channel,
                            num_classes=num_classes,
                            dropout_keep_prob=keep_prob)

    def construct(self, x):
        x = self.backbone(x)
        x = self.neck(x)
        x = self.head(x)

        return x

模型训练与评估

本节对于以上构建的I3D网络进行训练与评估:

from mindspore import context, load_checkpoint, load_param_into_net
from mindspore.context import ParallelMode
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
from mindspore.train import Model
from mindspore.nn.metrics import Accuracy
from mindspore.communication.management import init, get_rank, get_group_size

from src.utils.check_param import Validator, Rel
from src.utils.config import parse_args, Config
from src.loss.builder import build_loss
from src.schedule.builder import get_lr
from src.optim.builder import build_optimizer
from src.data.builder import build_dataset, build_transforms
from src.models import build_model

def main(pargs):
    # set config context
    config = Config(pargs.config)
    context.set_context(**config.context)

    # run distribute
    if config.train.run_distribute:
        if config.device_target == "Ascend":
            init()
        else:
            init("nccl")
        context.set_auto_parallel_context(device_num=get_group_size(),
                                          parallel_mode=ParallelMode.DATA_PARALLEL,
                                          gradients_mean=True)
        ckpt_save_dir = config.train.ckpt_path + "ckpt_" + str(get_rank()) + "/"
    else:
        ckpt_save_dir = config.train.ckpt_path

    # perpare dataset
    transforms = build_transforms(config.data_loader.train.map.operations)
    data_set = build_dataset(config.data_loader.train.dataset)
    data_set.transform = transforms
    dataset_train = data_set.run()
    Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT)
    batches_per_epoch = dataset_train.get_dataset_size()

    # set network
    network = build_model(config.model)

    # set loss
    network_loss = build_loss(config.loss)
    # set lr
    lr_cfg = config.learning_rate
    lr_cfg.steps_per_epoch = int(batches_per_epoch / config.data_loader.group_size)
    lr = get_lr(lr_cfg)

    # set optimizer
    config.optimizer.params = network.trainable_params()
    config.optimizer.learning_rate = lr
    network_opt = build_optimizer(config.optimizer)

    if config.train.pre_trained:
        # load pretrain model
        param_dict = load_checkpoint(config.train.pretrained_model)
        load_param_into_net(network, param_dict)

    # set checkpoint for the network
    ckpt_config = CheckpointConfig(
        save_checkpoint_steps=config.train.save_checkpoint_steps,
        keep_checkpoint_max=config.train.keep_checkpoint_max)
    ckpt_callback = ModelCheckpoint(prefix=config.model_name,
                                    directory=ckpt_save_dir,
                                    config=ckpt_config)

    # init the whole Model
    model = Model(network,
                  network_loss,
                  network_opt,
                  metrics={"Accuracy": Accuracy()})

    # begin to train
    print('[Start training `{}`]'.format(config.model_name))
    print("=" * 80)
    model.train(config.train.epochs,
                dataset_train,
                callbacks=[ckpt_callback, LossMonitor()],
                dataset_sink_mode=config.dataset_sink_mode)
    print('[End of training `{}`]'.format(config.model_name))


if __name__ == '__main__':
    args = parse_args()
    main(args)

可视化模型预测

本节对于训练后模型进行可视化预测并输出结果

from PIL import Image
import numpy as np
import cv2
import decord
import moviepy.editor as mpy

import mindspore as ms
from mindspore import context, nn, load_checkpoint, load_param_into_net, ops, Tensor
from mindspore.train import Model
from mindspore.dataset.vision import py_transforms as T_p

from src.utils.check_param import Validator, Rel
from src.utils.config import parse_args, Config
from src.loss.builder import build_loss
from src.data.builder import build_dataset, build_transforms
from src.models import build_model


FONTFACE = cv2.FONT_HERSHEY_DUPLEX
FONTSCALE = 0.6
FONTCOLOR = (255, 255, 255)
BGBLUE = (0, 119, 182)
THICKNESS = 1
LINETYPE = 1


def infer_classification(pargs):
    # set config context
    config = Config(pargs.config)
    context.set_context(**config.context)
    cast = ops.Cast()
    # perpare dataset
    transforms = build_transforms(config.data_loader.eval.map.operations)
    #transforms = T_p.ToTensor()
    data_set = build_dataset(config.data_loader.eval.dataset)
    data_set.transform = transforms
    dataset_infer = data_set.run()
    Validator.check_int(dataset_infer.get_dataset_size(), 0, Rel.GT)

    # set network
    network = build_model(config.model)

    # load pretrain model
    param_dict = load_checkpoint(config.infer.pretrained_model)
    load_param_into_net(network, param_dict)

    # set loss
    network_loss = build_loss(config.loss)

    # init the whole Model
    model = Model(network,
                  network_loss)
    expand_dims = ops.ExpandDims()
    concat = ops.Concat(axis=0)

    # 随机生成一个指定视频
    vis_num = len(data_set.video_path)
    vid_idx = np.random.randint(vis_num)
    video_path = data_set.video_path[vid_idx]
    video_reader = decord.VideoReader(video_path, num_threads=1)
    img_set = []

    for k in range(16):
        im = video_reader[k].asnumpy()
        img_set.append(im)
    video = np.stack(img_set, axis=0)
    # video = video.transpose(3, 0, 1, 2)
    # video = Tensor(video, ms.float32)
    # video = expand_dims(video, 0)
    for t in transforms:
        video = t(video)
    # Begin to eval.
    video = Tensor(video, ms.float32)
    video = expand_dims(video, 0)
    result = network(video)
    result.asnumpy()
    print("This is {}-th category".format(result.argmax()))

    return result, video_path

def add_label(frame, label, BGCOLOR=BGBLUE):
    threshold = 30
    def split_label(label):
        label = label.split()
        lines, cline = [], ''
        for word in label:
            if len(cline) + len(word) < threshold:
                cline = cline + ' ' + word
            else:
                lines.append(cline)
                cline = word
        if cline != '':
            lines += [cline]
        return lines

    if len(label) > 30:
        label = split_label(label)
    else:
        label = [label]
    label = ['Action: '] + label

    sizes = []
    for line in label:
        sizes.append(cv2.getTextSize(line, FONTFACE, FONTSCALE, THICKNESS)[0])
    box_width = max([x[0] for x in sizes]) + 10
    text_height = sizes[0][1]
    box_height = len(sizes) * (text_height + 6)

    cv2.rectangle(frame, (0, 0), (box_width, box_height), BGCOLOR, -1)
    for i, line in enumerate(label):
        location = (5, (text_height + 6) * i + text_height + 3)
        cv2.putText(frame, line, location, FONTFACE, FONTSCALE, FONTCOLOR, THICKNESS, LINETYPE)
    return frame

if __name__ == '__main__':

    import json
    cls_file = '/home/publicfile/kinetics-400/cls2index.json'
    with open(cls_file, "r")as f:
        cls2id = json.load(f)

    className = {v:k for k, v in cls2id.items()}

    args = parse_args()
    result, video_path = infer_classification(args)

    label = className[int(result.argmax())]

    video = decord.VideoReader(video_path)
    frames = [x.asnumpy() for x in video]
    vid_frames = []
    for i in range(1, 50):
        vis_frame = add_label(frames[i], label)
        vid_frames.append(vis_frame)

    vid = mpy.ImageSequenceClip(vid_frames, fps=24)
    vid.write_gif('/home/i3d_mindspore-main/src/result.gif')

训练表现

epoch: 1 step: 1, loss is 5.988250255584717
epoch: 1 step: 2, loss is 6.022036075592041
epoch: 1 step: 3, loss is 5.980734348297119
epoch: 1 step: 4, loss is 5.944761276245117
epoch: 1 step: 5, loss is 5.96290922164917
epoch: 1 step: 6, loss is 6.018253326416016
epoch: 1 step: 7, loss is 6.002189636230469
epoch: 1 step: 8, loss is 5.987124443054199
epoch: 1 step: 9, loss is 5.987508773803711
epoch: 1 step: 10, loss is 6.022692680358887
epoch: 1 step: 11, loss is 5.963132381439209
epoch: 1 step: 12, loss is 5.998828411102295
epoch: 1 step: 13, loss is 6.029492378234863
epoch: 1 step: 14, loss is 5.980365753173828
epoch: 1 step: 15, loss is 6.032135963439941
epoch: 1 step: 16, loss is 6.006852626800537
epoch: 1 step: 17, loss is 6.035465240478516
epoch: 1 step: 18, loss is 5.993041038513184
epoch: 1 step: 19, loss is 6.0041608810424805
epoch: 1 step: 20, loss is 6.035679340362549

性能表现

step:[ 1240/ 1242], metrics:['Top_1_Accuracy: 0.6719', 'Top_5_Accuracy: 0.8705'], loss:[0.861/1.585], time:14687.10 2 ms, 
step:[ 1241/ 1242], metrics:['Top_1_Accuracy: 0.6720', 'Top_5_Accuracy: 0.8706'], loss:[1.857/1.585], time:15122.631 ms, 
step:[ 1242/ 1242], metrics:['Top_1_Accuracy: 0.6719', 'Top_5_Accuracy: 0.8706'], loss:[2.374/1.586], time:13412.127 ms, 
Epoch time: 19056676.745 ms, per step time: 15343.540 ms, avg loss: 1.586
{'Top_1_Accuracy': 0.6717995169082126, 'Top_5_Accuracy': 0.8705213365539453}
preprocess_batch: 1242; batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; push_start_time: 2023-02-09-10:06:25.384.821, 2023-02-09-10:06:42.071.273, 2023-02-09-10:06:56.265.424, 2023-02-09-10:07:11.964.719, 2023-02-09-10:07:28.630.509, 2023-02-09-10:07:43.888.313, 2023-02-09-10:07:58.374.929, 2023-02-09-10:08:13.298.385, 2023-02-09-10:08:28.411.211, 2023-02-09-10:08:41.837.745; push_end_time: 2023-02-09-10:06:25.384.904, 2023-02-09-10:06:42.073.247, 2023-02-09-10:06:56.265.505, 2023-02-09-10:07:11.964.839, 2023-02-09-10:07:28.632.493, 2023-02-09-10:07:43.888.420, 2023-02-09-10:07:58.375.017, 2023-02-09-10:08:13.298.464, 2023-02-09-10:08:28.411.292, 2023-02-09-10:08:41.837.828.