CVPR2019 | 15篇论文速递(涵盖目标检测、语义分割和姿态估计等方向)

【导读】CVPR 2019 接收论文列表已经出来了,但只是一些索引号,所以并没有完整的论文合集。CVer 最近也在整理收集,今天一文涵盖15篇 CVPR 2019 论文速递,内容涵盖目标检测、语义分割和姿态估计等方向。

特别鸣谢 CV_arXiv_Daily 公众号提供的素材,本文介绍的论文已经同步至:

https://github.com/zhengzhugithub/CV-arXiv-Daily

姿态估计

[1] CVPR 2019 Pose estimation文章,目前SOTA,已经开源

论文题目:Deep High-Resolution Representation Learning for Human Pose Estimation

作者:Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang

论文链接:https://arxiv.org/abs/1902.09212

代码链接:https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

摘要: This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

来自微软和中国科技大学研究学者的论文《Deep High-Resolution Representation Learning for Human Pose Estimation》和相应代码甫一公布,立刻引起大家的关注,不到一天之内,github上已有将近50颗星。

今天就跟大家一起来品读此文妙处。

该文作者信息:

该文为第一作者Ke Sun在微软亚洲研究院实习期间发明的算法。

基本思想

作者观察到,现有姿态估计算法中往往网络会有先降低分辨率再恢复高分辨率的过程,比如下面的几种典型网络。

为便于表达,在下面的a、b、c、d四幅图中,同一水平线上的特征图为相同分辨率,越向下分辨率越小,在最终的高分辨率特征图heatmap中计算姿态估计的关键点。

Hourglass

Cascaded pyramid networks

Simple baseline

Combined with dilated convolutions

其中的网络结构说明如下:

作者希望不要有这个分辨率恢复的过程,在网络各个阶段都存在高分辨率特征图。

下图简洁明了地表达作者的思想。

在上图中网络向右侧方向,深度不断加深,网络向下方向,特征图被下采样分辨率越小,相同深度高分辨率和低分辨率特征图在中间有互相融合的过程。

作者描述这种结构为不同分辨率子网络并行前进。

关键点的heatmap是在最后的高分辨率特征图上计算的。

网络中不同分辨率子网络特征图融合过程如下:

主要是使用strided 3*3的卷积来下采样和up sample 1*1卷积上采样。

这么做有什么好处?

作者认为:

1)一直维护了高分辨率特征图,不需要恢复分辨率。

2)多次重复融合特征的多分辨率表示。

实验结果

该算法在COCO姿态估计数据集的验证集上测试结果:

与目前的state-of-the-art比较,取得了各个指标的最高值。相同分辨率的输入图像,与之前的最好算法相比增长了3个百分点!

在COCO test-dev数据集上,同样一骑绝尘!

在MPII test 数据集上,同样取得了最好的结果!

作者进一步与之前最好模型比较了参数量、计算量,该文发明的HRNet-W32在精度最高的同时,计算量最低!

如下图:

在PoseTrack2017姿态跟踪数据集上的结果比较:

同样取得了最好的结果。

下图是算法姿态估计的结果示例:

不仅仅是姿态估计

作者在官网指出,深度高分辨率网络不仅对姿态估计有效,也可以应用到计算机视觉的其他任务,诸如语义分割、人脸对齐、目标检测、图像分类中,期待更多具有说服力的结果公布。

论文代码地址

论文地址:

http://cn.arxiv.org/pdf/1902.09212.pdf

项目主页:

https://jingdongwang2017.github.io/Projects/HRNet/PoseEstimation.html

代码地址:

https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

【H】但是算法落地存在问题,其中一个分支一直维护大小不变,一直卷积,计算量非常大。

视频目标分割

[2] CVPR2019 VOS文章

论文题目:FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation

作者:Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, Liang-Chieh Chen

论文链接:https://arxiv.org/abs/1902.09513

摘要: Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame. In contrast to previous work, our embedding is only used as an internal guidance of a convolutional network. Our novel dynamic segmentation head allows us to train the network, including the embedding, end-to-end for the multiple object segmentation task with a cross entropy loss. We achieve a new state of the art in video object segmentation without fine-tuning on the DAVIS 2017 validation set with a J&F measure of 69.1%.

行为识别

[3] CVPR2019 Action Recognition文章

论文题目:An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

作者:Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, Tieniu Tan

论文链接:https://arxiv.org/abs/1902.09130

摘要: Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increases temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.

目标检测

[4] CVPR2019 检测新文

论文题目:Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression

作者:Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, Silvio Savarese

论文链接:https://arxiv.org/abs/1902.09630

摘要: Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks. However, there is a gap between optimizing the commonly used distance losses for regressing the parameters of a bounding box and maximizing this metric value. The optimal objective for a metric is the metric itself. In the case of axis-aligned 2D bounding boxes, it can be shown that IoU can be directly used as a regression loss. However, IoU has a plateau making it infeasible to optimize in the case of non-overlapping bounding boxes. In this paper, we address the weaknesses of IoU by introducing a generalized version as both a new loss and a new metric. By incorporating this generalized IoU (GIoU) as a loss into the state-of-the art object detection frameworks, we show a consistent improvement on their performance using both the standard, IoU based, and new, GIoU based, performance measures on popular object detection benchmarks such as PASCAL VOC and MS COCO.

极市已将所有论文总结到github:

https://github.com/extreme-assistant/cvpr2019 ,欢迎关注~今天介绍一篇object detection方面的文章。

论文链接 | arxiv.org/abs/1902.0963

原文地址 | https://zhuanlan.zhihu.com/p/57863810

1.Motivation

包围框回归是2D/3D 视觉任务中一个最基础的模块,不管是目标检测,目标跟踪,还是实例分割,都依赖于对bounding box进行回归,以获得准确的定位效果。目前基于深度学习的方法想获得更好的检测性能,要么是用更好的backbone,要么是设计更好的策略提取更好的feature,然而却忽视了bounding box regression中L1、L2 loss这个可以提升的点。

IoU是目标检测中一个重要的概念,在anchor-based的方法中,他的作用不仅用来确定正样本和负样本,还可以用来评价输出框(predict box)和ground-truth的距离,或者说predict box的准确性。IoU有一个好的特性就是对尺度不敏感(scale invariant)。

在regression任务中,判断predict box和gt的距离最直接的指标就是IoU,但所采用的loss却不适合,如图所示,在loss相同的情况下,regression的效果却大不相同,也就是说loss没有体现出regression的效果,而IoU却可以根据不同的情况得到不同的数值,能最直接反应回归效果。

2.Method

因此本文提出用IoU这个直接的指标来指导回归任务的学习。与其用一个代理的损失函数来监督学习,不如直接用指标本身来的好。此时损失函数为:

但直接用IoU作为损失函数会出现两个问题:

  • 如果两个框没有相交,根据定义,IoU=0,不能反映两者的距离大小(重合度)。同时因为loss=0,没有梯度回传,无法进行学习训练。

  • IoU无法精确的反映两者的重合度大小。如下图所示,三种情况IoU都相等,但看得出来他们的重合度是不一样的,左边的图回归的效果最好,右边的最差。

针对IoU上述两个缺点,本文提出一个新的指标generalized IoU(GIoU):

GIoU的定义很简单,就是先计算两个框的最小闭包区域面积,再计算IoU,再计算闭包区域中不属于两个框的区域占闭包区域的比重,最后用IoU减去这个比重得到GIoU。GIoU有如下4个特点:

  • 与IoU相似,GIoU也是一种距离度量,作为损失函数的话, ,满足损失函数的基本要求

  • GIoU对scale不敏感

  • GIoU是IoU的下界,在两个框无线重合的情况下,IoU=GIoU

  • IoU取值[0,1],但GIoU有对称区间,取值范围[-1,1]。在两者重合的时候取最大值1,在两者无交集且无限远的时候取最小值-1,因此GIoU是一个非常好的距离度量指标。

  • 与IoU只关注重叠区域不同,GIoU不仅关注重叠区域,还关注其他的非重合区域,能更好的反映两者的重合度。

其实GIoU不仅定义简单,在2D目标检测中计算方式也很简单,计算重合区域和IoU一样,计算最小闭包区域只需要得到两者max和min坐标,坐标围城的矩形就是最小闭包区域。

GIoU和IoU作为loss的算法如下所示:

步骤:

  • 分别计算gt和predict box的面积

  • 计算intersection的面积

  • 计算最小闭包区域面积

  • 计算IoU和GIoU

  • 根据公式得到loss

3.Experiments

GIoU loss可以替换掉大多数目标检测算法中bounding box regression,本文选取了Faster R-CNN、Mask R-CNN和YOLO v3 三个方法验证GIoU loss的效果。实验在Pascal VOC和MS COCO数据集上进行。

实验效果如下:

可以看出YOLOv3在COCO上有明显涨点,但在其他模型下涨点并不明显,作者也指出了faster rcnn和mask rcnn效果不明显的原因是anchor很密,GIoU发挥作用的情况并不多。

总体来说,文章的motivation比较好,指出用L1、L2作为regression损失函数的缺点,以及用直接指标IoU作为损失函数的缺陷性,提出新的metric来代替L1、L2损失函数,从而提升regression效果,想法简单粗暴,但work的场景有很大局限性。

图像分类

[5] CVPR2019 分类新文

论文题目:Learning a Deep ConvNet for Multi-label Classification with Partial Labels

作者:Thibaut Durand, Nazanin Mehrasa, Greg Mori

论文链接:https://arxiv.org/abs/1902.09720

摘要: Deep ConvNets have shown great performance for single-label image classification (e.g. ImageNet), but it is necessary to move beyond the single-label classification task because pictures of everyday life are inherently multi-label. Multi-label classification is a more difficult task than single-label classification because both the input images and output label spaces are more complex. Furthermore, collecting clean multi-label annotations is more difficult to scale-up than single-label annotations. To reduce the annotation cost, we propose to train a model with partial labels i.e. only some labels are known per image. We first empirically compare different labeling strategies to show the potential for using partial labels on multi-label datasets. Then to learn with partial labels, we introduce a new classification loss that exploits the proportion of known labels per example. Our approach allows the use of the same training settings as when learning with all the annotations. We further explore several curriculum learning based strategies to predict missing labels. Experiments are performed on three large-scale multi-label datasets: MS COCO, NUS-WIDE and Open Images.

深度卷积网络(Deep ConvNets)在单标签图像分类(如ImageNet)中表现出色,但是有必要超出单标签分类任务,因为日常生活中的图像本质上是多标签的。多标签分类比单标签分类更困难,因为输入图像和输出标签空间都更复杂。此外,与单标签注释相比,大规模地收集干净的多标签注释更难。为了降低标注成本,我们建议训练带有部分标签的模型,即每个图像只知道部分标签。

我们首先对不同的标签策略进行实证比较,证明在多标签数据集上使用部分标签的潜力。然后,为了学习部分标签,我们提出一种新的分类损失,它利用了每个样本中已知标签的比例。我们的方法允许使用与使用所有注释时相同的训练设置。我们进一步探讨了几种预测缺失标签的策略。实验在3个大型多标签数据集上进行:MS COCO, NUS-WIDE和Open Image。

3D目标检测

[6] CVPR2019 3D detection新文

论文题目:Stereo R-CNN based 3D Object Detection for Autonomous Driving

作者:Peiliang Li, Xiaozhi Chen, Shaojie Shen

论文链接:https://arxiv.org/abs/1902.09738

摘要: We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code will be made publicly available.

三维重建

[7] CVPR2019 3D Reconstruction新文

论文题目:Single-Image Piece-wise Planar 3D Reconstruction via Associative Embedding

作者:Zehao Yu, Jia Zheng, Dongze Lian, Zihan Zhou, Shenghua Gao

论文链接:https://arxiv.org/abs/1902.09777

代码链接:https://github.com/svip-lab/PlanarReconstruction

摘要: Single-image piece-wise planar 3D reconstruction aims to simultaneously segment plane instances and recover 3D plane parameters from an image. Most recent approaches leverage convolutional neural networks (CNNs) and achieve promising results. However, these methods are limited to detecting a fixed number of planes with certain learned order. To tackle this problem, we propose a novel two-stage method based on associative embedding, inspired by its recent success in instance segmentation. In the first stage, we train a CNN to map each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then, the plane instances are obtained by grouping the embedding vectors in planar regions via an efficient mean shift clustering algorithm. In the second stage, we estimate the parameter for each plane instance by considering both pixel-level and instance-level consistencies. With the proposed method, we are able to detect an arbitrary number of planes. Extensive experiments on public datasets validate the effectiveness and efficiency of our method. Furthermore, our method runs at 30 fps at the testing time, thus could facilitate many real-time applications such as visual SLAM and human-robot interaction.

点云分割

[8] CVPR2019 点云分割新文

论文题目:Associatively Segmenting Instances and Semantics in Point Clouds

作者:Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, Jiaya Jia

论文链接:https://arxiv.org/abs/1902.09852

代码链接:https://github.com/WXinlong/ASIS

摘要: A 3D point cloud describes the real scene precisely and intuitively.To date how to segment diversified elements in such an informative 3D scene is rarely discussed. In this paper, we first introduce a simple and flexible framework to segment instances and semantics in point clouds simultaneously. Then, we propose two approaches which make the two tasks take advantage of each other, leading to a win-win situation. Specifically, we make instance segmentation benefit from semantic segmentation through learning semantic-aware point-level instance embedding. Meanwhile, semantic features of the points belonging to the same instance are fused together to make more accurate per-point semantic predictions. Our method largely outperforms the state-of-the-art method in 3D instance segmentation along with a significant improvement in 3D semantic segmentation.

3D 人体姿态估计

[9] CVPR2019 3D人体姿态估计新文

论文题目:RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation

作者:Bastian Wandt, Bodo Rosenhahn

论文链接:https://arxiv.org/abs/1902.09868

摘要: This paper addresses the problem of 3D human pose estimation from single images. While for a long time human skeletons were parameterized and fitted to the observation by satisfying a reprojection error, nowadays researchers directly use neural networks to infer the 3D pose from the observations. However, most of these approaches ignore the fact that a reprojection constraint has to be satisfied and are sensitive to overfitting. We tackle the overfitting problem by ignoring 2D to 3D correspondences. This efficiently avoids a simple memorization of the training data and allows for a weakly supervised training. One part of the proposed reprojection network (RepNet) learns a mapping from a distribution of 2D poses to a distribution of 3D poses using an adversarial training approach. Another part of the network estimates the camera. This allows for the definition of a network layer that performs the reprojection of the estimated 3D pose back to 2D which results in a reprojection loss function. Our experiments show that RepNet generalizes well to unknown data and outperforms state-of-the-art methods when applied to unseen data. Moreover, our implementation runs in real-time on a standard desktop PC.

3D 人脸

[10] CVPR2019 3D Face新文

论文题目:Disentangled Representation Learning for 3D Face Shape

作者:Zi-Hang Jiang, Qianyi Wu, Keyu Chen, Juyong Zhang

论文链接:https://arxiv.org/abs/1902.09887

摘要: In this paper, we present a novel strategy to design disentangled 3D face shape representation. Specifically, a given 3D face shape is decomposed into identity part and expression part, which are both encoded and decoded in a nonlinear way. To solve this problem, we propose an attribute decomposition framework for 3D face mesh. To better represent face shapes which are usually nonlinear deformed between each other, the face shapes are represented by a vertex based deformation representation rather than Euclidean coordinates. The experimental results demonstrate that our method has better performance than existing methods on decomposing the identity and expression parts. Moreover, more natural expression transfer results can be achieved with our method than existing methods.

视频描述

[11] CVPR2019 Video Caption新文

论文题目:Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

作者:Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, Ajmal Mian

论文链接:https://arxiv.org/abs/1902.10322

摘要: Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE_L metrics.

语义分割

[12] CVPR2019 弱监督语义分割新文

论文题目:FickleNet: Weakly and Semi-supervised Semantic Image Segmentation using Stochastic Inference

作者:Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, Sungroh Yoon

论文链接:https://arxiv.org/abs/1902.10421

摘要: The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. FickleNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.

视频处理

[13] CVPR2019 视频处理新文

论文题目:Single-frame Regularization for Temporally Stable CNNs

作者:Gabriel Eilertsen, Rafał K. Mantiuk, Jonas Unger

论文链接:https://arxiv.org/abs/1902.10424

摘要: Convolutional neural networks (CNNs) can model complicated non-linear relations between images. However, they are notoriously sensitive to small changes in the input. Most CNNs trained to describe image-to-image mappings generate temporally unstable results when applied to video sequences, leading to flickering artifacts and other inconsistencies over time. In order to use CNNs for video material, previous methods have relied on estimating dense frame-to-frame motion information (optical flow) in the training and/or the inference phase, or by exploring recurrent learning structures. We take a different approach to the problem, posing temporal stability as a regularization of the cost function. The regularization is formulated to account for different types of motion that can occur between frames, so that temporally stable CNNs can be trained without the need for video material or expensive motion estimation. The training can be performed as a fine-tuning operation, without architectural modifications of the CNN. Our evaluation shows that the training strategy leads to large improvements in temporal smoothness. Moreover, in situations where the quantity of training data is limited, the regularization can help in boosting the generalization performance to a much larger extent than what is possible with naïve augmentation strategies.

多视几何

[14] CVPR2019 多视几何新文

论文题目:Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

作者:Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, Long Quan

论文链接:https://arxiv.org/abs/1902.10556

代码链接:https://github.com/YoYo000/MVSNet

摘要: Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. 

视频分类

[15] CVPR2019 Video Classification新文

论文题目:Efficient Video Classification Using Fewer Frames

作者:Shweta Bhardwaj, Mukundhan Srinivasan, Mitesh M. Khapra

论文链接:https://arxiv.org/abs/1902.10640

摘要: Recently,there has been a lot of interest in building compact models for video classification which have a small memory footprint (<1 GB). While these models are compact, they typically operate by repeated application of a small weight matrix to all the frames in a video. E.g. recurrent neural network based methods compute a hidden state for every frame of the video using a recurrent weight matrix. Similarly, cluster-and-aggregate based methods such as NetVLAD, have a learnable clustering matrix which is used to assign soft-clusters to every frame in the video. Since these models look at every frame in the video, the number of floating point operations (FLOPs) is still large even though the memory footprint is small. We focus on building compute-efficient video classification models which process fewer frames and hence have less number of FLOPs. Similar to memory efficient models, we use the idea of distillation albeit in a different setting. Specifically, in our case, a compute-heavy teacher which looks at all the frames in the video is used to train a compute-efficient student which looks at only a small fraction of frames in the video. This is in contrast to a typical memory efficient Teacher-Student setting, wherein both the teacher and the student look at all the frames in the video but the student has fewer parameters. Our work thus complements the research on memory efficient video classification. We do an extensive evaluation with three types of models for video classification,viz.(i) recurrent models (ii) cluster-and-aggregate models and (iii) memory-efficient cluster-and-aggregate models and show that in each of these cases, a see-it-all teacher can be used to train a compute efficient see-very-little student. We show that the proposed student network can reduce the inference time by 30% and the number of FLOPs by approximately 90% with a negligible drop in the performance.

猜你喜欢

转载自blog.csdn.net/zhuiqiuk/article/details/88095617