Person-reID 行人重识别论文合集

2017CVPR、ICCV和NIPS在Person Reidentification方向的相关工作小结

基于融合特征的行人再识别方法

模式识别与人工智能 2017.3

问题

  • 目前常用的行人再识别方法主要集中在行人外形特征的描述和同一行人对应的 2 幅图像之间距离的学习度量.由于行人图像的亮度和相机角度的变化等,提取行人的外形特征的不变性较难,因此在各个图像库上行人再识别的识别率较低

方法

  • 基于融合特征的特征提取
    • 包括 HSV 颜色特征、颜色直方图特征和梯度方向直方图特征.
    • HSV 颜色特征和颜色直方图这 2 种颜色特征的融合可以增强图像颜色信息的鉴别性,
    • 梯度方向直方图特征可以描述图像局部像素点之间的关系

参考

http://kns.cnki.net/KCMS/detail/detail.aspx?filename=mssb201703010&dbname=CJFD&dbcode=CJFQ


多置信度重排序的行人再识别算法

模式识别与人工智能 2017.11

问题

  • 针对行人再识别中相似性度量误差引起的识别效果较差的问题,提出多置信度重排序的行人再识别算

方法

  • 用ResNet50获得描述特征
  • 对目标样本与测试样本之间的相似性进行初始排序
  • 对相似排序得到的样本构建相似样本集合,获得每个类别的聚类中心和样本距离聚类中心的最小、最大、均值距离,设置 3 个置信度不同的置信区间
  • 最后使用 Jaccard 距离对目标样本与测试样本的相似度进行重排序

收获

  • 杰卡德距离( Jaccard Distance) 可以用来度量 2 个集合之间的差异性

参考

http://kns.cnki.net/KCMS/detail/detail.aspx?filename=mssb201711005&dbname=CJFD&dbcode=CJFQ


An Improved Deep Learning Architecture for Person Re-Identification

CVPR’15

问题

  • A typical re-identification system takes as input two images, each of which usually contains a person’s full body, and outputs either a similarity score between the two images or a classification of the pair of images as same (if the two images depict the same person) or different (if the images are of different people)
  • In this paper, we follow this approach and use a novel deep learning network to assign similarity scores to pairs of images of human bodies

方法

这里写图片描述

  • 具体解释参考文献1

参考

行人检索“An Improved Deep Learning Architecture for Person Re-Identification”

https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/2B_062.pdf


Deep Feature Learning with Relative Distance Comparison for Person Re-identification

Pattern Recognition 2015

问题

  • Although the effectiveness of the distance function has been demonstrated, it heavily relies on the quality of the features selected, and such selection requires deep domain knowledge and expertise

方法

  • 提出 Triplet Loss
  • we train the network through a set of triplets. Each triplet contains three images, i.e. a query image, one matched reference (an image of the same person as that in the query image) and one mismatched reference

参考

https://arxiv.org/abs/1512.03622


Deep Transfer Learning for Person Re-identification

arXiv:1611

问题

  • Person re-identification (Re-ID) poses a unique challenge to deep learning: how to learn a deep model with millions of parameters on a small training set of few or no labels

方法

这里写图片描述

  • First , a deep network architecture is designed which differs from existing deep Re-ID models in that (a) it is more suitable for transferring representations learned from large image classification datasets, and (b) classification loss and verification loss are combined, each of which adopts a different dropout strategy
  • Second, a two-stepped fine-tuning strategy is developed to transfer knowledge from auxiliary datasets.
  • Third, given an unlabelled Re-ID dataset, a novel unsupervised deep transfer learning model is developed based on co-training.

收获

  • 表征学习也成为了ReID领域的一个非常重要的baseline,并且表征学习的方法比较鲁棒,训练比较稳定,结果也比较容易复现
  • 表征学习容易在数据集的domain上容易过拟合,并且当训练ID增加到一定程度的时候会显得比较乏力

参考

https://arxiv.org/abs/1611.05244


A Discriminatively Learned CNN Embedding for Person Re-identification

TOMM 2017

问题

  • 在行人重识别的问题上一般有verification and identification 两种模型
  • The two models have their respective advantages and limitations due to different loss functions
  • 作者想结合两种模型的长处提高识别准确率

方法

  • identification loss + verification loss

收获

  • identification loss 做分类的时候容易过拟合(比如一个人背了包,它就认为只要背包就是这个人),这时候需要加正则项,比如verification loss

参考

https://arxiv.org/abs/1611.05666


Person Re-Identification Using CNN Features Learned from Combination of Attributes

ICPR‘16

问题

  1. However, large disparity among the pre-trained task, i.e., ImageNet classification, and the target task, i.e., person image matching, limits performances of the CNN features for person re-identification
  2. Therefore, the discriminative power of CNN features solely fined-tuned on pedestrian attributes is typically insufficient

方法

  • 对于问题1
    • we conduct a fine-tuning of CNN features on a pedestrian attribute dataset to bridge the gap of ImageNet classification and person re-identification
  • 对于问题2
    • we focus on combinations of attributes for grouping similar people.

收获

  • Re-training a pre-trained CNN for another task is called fine-tuning, which transfers the knowledge of pre-training data and significantly improves the performance on another task

参考

https://pdfs.semanticscholar.org/0e80/10baaa8dd93b7077719d6c43629c070da6bf.pdf


Gated Siamese Convolutional Neural Network Architecture for Human Re-Identification

ECCV‘16

问题

  • several end-to-end deep Siamese CNN architectures have been proposed for human re-identification with the objective of projecting the images of similar pairs (i.e. same identity) to be closer to each other and those of dissimilar pairs to be distant from each other. However, current networks extract fixed representations for each image regardless of other images which are paired with it and the comparison with other images is done only at the final level

方法

  • we propose a gating function to selectively emphasize such fine common local patterns by comparing the mid-level features across pairs of images
  • The fundamental CNN architecture is modeled in a siamese fashion optimized by the contrastive loss function

参考

https://arxiv.org/abs/1607.08378


MARS: A Video Benchmark for Large-Scale Person Re-identification

ECCV‘16

问题

  • a few video re-id datasets exist [4, 15, 28, 36]. They are limited in scale: typically several hundred identities are contained, and the number of image sequences doubles
  • image sequences in these video re-id datasets are generated by hand-drawn bboxes. This process is extremely expensive, requiring intensive human labor
  • But in reality, pedestrian detectors will lead to part occlusion or misalignment which may have a non-ignorable effect on re-id accuracy
  • As a result, in practice one identity will have multiple probes and multiple sequences as ground truths. It remains unsolved how to make use of these visual cues

方法

  • collecting and annotating a new person re-identification dataset, named \Motion Analysis and Re-identification Set” (MARS)
  • instead of hand-drawn bboxes, we use the DPM detector [11] and GMMCP tracker [7] for pedestrian detection and tracking, respectively
  • Third, MARS includes a number of distractor tracklets produced by false detection or tracking result
  • the multiplequery and multiple-ground truth mode will enable future research in fields such as query re-formulation and search re-ranking

参考

https://pdfs.semanticscholar.org/c038/7e788a52f10bf35d4d50659cfa515d89fbec.pdf


Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in vitro

ICCV 2017

问题

  • 行人重识别里面的数据比较少

方法

这里写图片描述

  • 用训练数据训练一个DCGAN(无监督学习),然后用generator生成数据,和训练数据一起去训练一个卷及网络(半监督学习)
  • 生成出来的数据是没标签的,作者用LSRO方法把标签变成 1/K (K为总ID数)
    • 有待改进……

收获

  • 数据不够时可以考虑用GAN来填,及时生成的效果不好,在一定程度上能防止过拟合

参考

https://arxiv.org/abs/1701.07717


Multi-pseudo Regularized Label for Generated Samples in Person Re-Identification

arxiv 1801

问题

  • 行人重识别中每个ID的数据少,用GAN生成出来的数据没标签
  • LSRO方法不切实际

方法

  • 作者提出 Multi-pseudo Regularized Label (MpRL) 的方法
    • 标签变成 a k / K (K为总ID数)
    • where α_k is the contribution from k-th pre-defined class in the dictionary α.

参考

https://arxiv.org/pdf/1801.06742.pdf


Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification

CVPR 2016

问题

  • Learning generic and robust feature representations with data from multiple domains for the same problem is of great value, especially for the problems that have multiple datasets but none of them are large enough to provide abundant data variations
  • 即对于同一个问题,从多个数据库学习,对学习具有鲁棒性的一般特征表达是非常有价值的,特别是在有很多不同的数据库,但没有一个数据库有足够的数据情况下。

方法

  • 从多个训练集进行训练
    提出 Domain Guided Dropout (DGD )
  • Domain Guided Dropout — a simple yet effective method of muting non-related neurons for each domain.
  • 这个方法就是能在学习特征时抑制对该特征不活跃的神经元,并促进对该特征活跃神经元的工作,这样在一定程度上能减少训练的参数,以提高程序的性能

收获

  • 用多个数据库训练时,不同特征的学习,神经元的活跃程度是不一样的,这时可以采用DGD方法去正则化

参考

https://arxiv.org/abs/1604.07528


Person Re-identification in the Wild

CVPR’17

问题

  • Our baselines address three issues: the performance of various combination of detectors and recognizers, mechanisms for pedestrian detection to help improve overall re-identification (re-ID) accuracy and assessing the effectiveness of different detectors for re-ID.
  • Current datasets lack annotations for such combined evaluation of person detection and re-ID.
  • person re-ID datasets, such as VIPeR [16] or CUHK03 [21], usually provide just cropped bounding boxes without the complete video frames, especially at a large scale
  • As a consequence, a large-scale dataset that evaluates both detection and overall re-ID is needed

方法

  • 提出新的数据库PRW

收获

  • detectors对于re-id非常重要

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Zheng_Person_Re-Identification_in_CVPR_2017_paper.pdf


CVPR 2017

问题

  • Although numerous person re-id datasets and methods have been proposed, there is still a big gap between the problem setting itself and real-world applications. In most benchmarks, the gallery only contains anually cropped pedestrian images, while in real applications, the goal is to find a target person in a gallery of whole scene images
  • 即许多方法用到的是人工裁剪过的图像,而在现实中首先要图片背景中识别出行人
  • 传统的 pairwise or triplet distance loss functions 计算量太大
  • Softmax loss 随着行人类型的增多,运行时间会变慢甚至函数无法收敛

方法

  • 训练一个含两部分组成的CNN,
    • 一个pedestrian proposal net (Faster RCNN),来产生候选行人的 bounding boxes
    • 一个identification net,来提取特征来进行与检索目标的比较
    • 两者在 joint optimization过程中具有相互适应的特点,从而消除自身外另一网络带来的问题
  • 提出了 Online Instance Matching (OIM) loss function

收获

  • 可以先Detection再处理
  • OIM损失可以更好地解决一个人的类别太多但一个mini-batch里面样本不够多样,导致没法训练分类器的问题

参考

https://github.com/ShuangLI59/person_search

http://openaccess.thecvf.com/content_cvpr_2017/papers/Xiao_Joint_Detection_and_CVPR_2017_paper.pdf


In Defense of the Triplet Loss for Person Re-Identification

arXiv1703

问题

  • Classification Loss: 当目标很大时,会严重增加网络参数,而训练结束后很多参数都会被摒弃。
  • Verification Loss: 只能成对的判断两张图片的相似度,因此很难应用到目标聚类和检索上去。因为一对一对比太慢。
  • Triplet Loss:没有hard mining会导致训练阻塞收敛结果不佳,选择过难的hard又会导致训练不稳定收敛变难

方法

提出了 triplet hard loss

  • 把几种 Triplet Loss 做对比实验
    • Large Margin Nearest Neighbor loss
    • FaceNet Triplet Loss
    • Batch All Triplet Loss
    • Batch Hard Triplet Loss
    • Lifted Embedding Loss

收获

  • Triplet hard Loss 要优于其他 Loss

参考

Re-ID with Triplet Loss

https://arxiv.org/abs/1703.07737


Beyond triplet loss: a deep quadruplet network for person re-identification

CVPR’17

问题

  • the triplet loss pays main attentions on obtaining correct orders on the training set
  • It still suffers from a weaker generalization capability from the training set to the testing set, thus resulting in inferior performance
  • 即 triplet loss 泛化能力不好

方法

  • we design a quadruplet loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss
  • L q u a d = i , j , k N [ g ( x i , x j ) 2 g ( x i , x k ) 2 + α 1 ] +
    • + i , j , k , l N [ g ( x i , x j ) 2 g ( x l , x k ) 2 + α 2 ] +
    • s i = s j , s l s k , s i s l , s i s k
  • 前一项是传统的 Triplet Loss,后一项用于进一步缩小类内差距
  • 由于前一项的重要更大,因此作者控制 ( α 1 > α 2 ) .

参考

http://arxiv.org/abs/1704.01719


Improving Person Re-identification by Attribute and Identity Learning

arXiv 1703

问题

  • Attribute recognition 关注一个人的局部表征
  • person re-ID 关注整体
  • 作者想结合它们

方法

  • 训练了一个CNN用于学习re-ID,同时预测行人属性
  • This multi-task method integrates an ID classificationloss and a number of attribute classification losses, and back-propagates the weighted sum of the individual losses

收获

  • 识别问题可以用属性来约束(但是数据可能是问题)
  • 可以结合LSTM,attention的机制试试

参考

https://arxiv.org/abs/1703.07220


SVDNet for Pedestrian Retrieval

ICCV 2017

问题

  • 当训练一个用于提取re-ID问题中行人特征的深度卷积神经网络(CNN)时,与在其它所有典型的深度学习训练一样,通常所学到的权向量是“杂乱无章”的,这种杂乱无章体现在,网络同一层中的权向量,通常是存在较强的相关性(注意不是线性相关linear dependent)。这种相关性,对于特征表达可能会造成不必要甚至是非常有害的冗余

方法

  • 基础网络为resnet-50

    训练方法分为3步,称之为Restraint and Relaxation Iteration (RRI)

    1. 去相关——每次训练模型收敛之后,对特征表达层的权矩阵W进行奇异值分解,即W=USV’,然后,用US去取代原来的W,这时,W变成了一个正交阵(每个权向量彼此正交),且新的权向量是原来权矩阵WW’的本征向量。经过这样一次去相关之后,原本已经收敛的模型偏离原先的局部最优解、在训练集上的分类损失变大了。
    2. 紧张训练(Restraint)——固定住步骤1中的W不更新,学习其它层参数,直至网络重新收敛。需要注意的是,在这种情况下,网络会收敛到一个次优解:因为它有一层的W是受限制。因此,在接下来,我们会取消这个限制,继续训练。
    3. 松弛训练(Relaxation)——在步骤2之后,取消W固定的限制,这个时候,网络会发现对于拟合训练样本会这个目标会有一个更好的解:请注意,仅仅是针对拟合训练样本这个目标。我们实验发现,这个模型使用在训练集上(包含全新的ID)时,它的泛化能力是相对较弱的。

    而在步骤3之后,W里的权向量重新变的相关起来。因此,我们把这3步迭代起来,形成RRI,直最终收敛。

收获

  • 在训练过程中施加正交约束,可以采用“硬”的去相关方法,也可以采用“软”的loss或正则约束
  • 正交化可以去相关
  • 紧松交错训练

参考

https://zhuanlan.zhihu.com/p/29326061

https://arxiv.org/abs/1703.05693


Person Re-Identification by Deep Joint Learning of Multi-Loss Classification

IJCAI’17

问题

  • Existing person re-identification (re-id) methods rely mostly on either localised or global feature representation alone. This ignores their joint benefit and mutual complementary effects

方法

  • 提出 Joint Learning Multi-Loss
  • 这里写图片描述
  • This JLML model consists of a twobranches CNN network:
    • One local branch of m streams of an identical structure with each stream learning the most discriminative local visual features for one of m local image regions of a person bounding box image;
    • Another global branch responsible for learning the most discriminative global level features from the entire person image
  • Sharing the low-level conv layer reduces the model parameter size therefore model overfitting risks
  • the per-branch learning behaviour is conditioned independently on the respective feature representation
  • noise and data covariance between local and global representations
  • l g l o b a l = l + λ g l o b a l | | W G | | 2 , 1 , l l o c a l = l + λ l o c a l | | W L | | 1 , 2

参考

https://arxiv.org/abs/1705.04724


HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

ICCV’17

问题

  • 目前的主流方法仅仅捕获全局特征,局部响应的语义特征很难获取到

方法

这里写图片描述

  • 提出了HP-Net ,包含 M-Net和AF-Net。
    • 主网络是一个CNN结构
    • 注意力特征网络AF-Net包含MDA模块的多分支,应用到不同语义特征层上
  • HP-Net 不仅捕获局部和全局信息,而且根据不同层的语义集合特征

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Liu_HydraPlus-Net_Attentive_Deep_ICCV_2017_paper.pdf


Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

arXiv:1710

问题

  • Quadruplet loss相对于Triplet loss考虑了正负样本对之间的绝对距离,而TriHard loss则是引入了hard sample mining的思想,MSML则吸收了这两个优点

方法

  • L m s m l = ( max a , p d a , p min m , n d m , n + α ) +

收获

  • MSML是一种新的度量学习方法[参考1],吸收了目前已有的一些度量学习方法的优点,能过进一步提升模型的泛化能力
  • 本文在行人重识别问题上发表了这个损失函数,但是这是一个在图像检索领域可以通用的度量学习方法。

参考

Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

https://arxiv.org/abs/1710.00478v3


Re-ranking Person Re-identification with k-reciprocal Encoding

CVPR’17

问题

  • Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets.

方法

  • 提出了一种重排方法,通过编码特征降低计算量,并引入Jaccard距离,与原始距离做加权进行最终的rank list计算
  • 该方法的好处是不需要人工交互或者任何标签数据,即可对任意的reid方法应用该手段来提升reid性能,并且可以应用到大规模数据集上

这里写图片描述

收获

  • Re-ranking is an effective technique for boosting the performance of ReID

参考

https://arxiv.org/abs/1701.08398


AlignedReID: Surpassing Human-Level Performance in Person Re-Identification

arxiv 1711

问题

  • Traditional approaches have focused on low-level features such as colors, shapes, and local descriptors. With the renaissance of deep learning, the convolutional neural network (CNN) has dominated this field
  • 即传统的方法大多采用CNN提取低级别的特征
  • Many CNN-based approaches learn a global feature, without considering the spatial structure of the person. This has a few major drawbacks:
    • inaccurate person detection boxes might impact feature learning.
    • the pose change or non-rigid body deformation makes the metric learning difficult.
    • occluded parts of the human body might introduce irrelevant context into the learned feature.
    • it is non-trivial to emphasis local differences in a global feature, especially when we have to distinguish two people with very similar apperances.
  • 即许多基于CNN的方法只学习了全局的特征,而没有考虑人体的空间结构,这会导致以下这些问题:
    • 不准确的人物检测框可能会影响特征的学习;
    • 姿势的改变和人体的变形可能会导致度量学习的困难;
    • 人体的部分身体部位被遮挡可能会引入无关的上下文信息;
    • 在全局特征上强调局部差异是非常重要的,尤其是在区分两个外貌非常相似的人的时候
  • 为了解决以上问题,过去的研究将重心放在part-based, local feature learning。有些研究将整个身体分割为几个固定的部分,而不考虑这几个部分之间的对应关系。这样的话无法解决以上问题。还有研究使用pose estimation帮助人体几个部分的对齐,但这样需要额外的supervision and a pose estimation step。

方法

这里写图片描述

  • In this paper, we propose a new approach, called AlignedReID, which still learns a global feature, but perform an automatic part alignment during the learning, without requring extra supervision or explicit pose estimation.
  • 即作者提出的方法中,仍然是学习全局的特征,但是能自动进行各部分的对齐,且这一操作不需要额外的supervision 和 explicit pose estimation.
  • In the local branch, we align local parts by introducing a shortest path loss.
  • 即在局部特征的学习中,我们通过计算最短路径进行对齐操作。
  • In the inference stage, we discard the local branch and only extract the global feature.
  • 即在预测阶段,只使用了全局特征而没有采用局部特征。
  • In other words, the global feature itself, with the aid of local features learning, can greatly address the drawbacks we mentioned above, in our new joint learning framework.
  • 换句话说,在基于局部特征学习得到的全局特征能够解决基于CNN方法遇到的那四个问题。
  • In addition, the form of global feature keeps our approach attractive for the deployment of a large ReID system, without costly local features matching.
  • 作者还说,全局特征的形式使得他们的方法在大型的人物重识别中仍然能够很好的工作,而不需采用消耗巨大的局部特征匹配。
  • We also adopt a mutual learning approach in the metric learning setting, to allow two models to learn better representations from each other.
  • 对于度量学习,作者采用的是mutual learning 的方法,并取得了很好的结果。

收获

  • the end-to-end learning with structure prior is more powerful than a “blind” end-to-end learning.

参考

Re-ID:AlignedReID: Surpassing Human-Level Performance in Person Re-Identification 论文解析

https://arxiv.org/abs/1711.08184


Natural Language Object Retrieval

CVPR‘16

问题

  • 想用自然语言与检索物体
  • Although both text-based image retrieval and natural language object retrieval involve jointly modeling images and text, they are different vision and language domains with domain shift from whole images to bounding boxes

方法

这里写图片描述

参考

https://arxiv.org/abs/1511.04164


Attention-based Natural Language Person Retrieval

CVPR‘17 Workshow

问题

  • 想用自然语言描述属性来检索人
  • 没有数据库

方法

  • 数据收集

    • 基于 CITYSCAPES dataset
  • Fast R-CNN 标记区域

  • 亚马逊众包标记属性
  • 网络
    • 这里写图片描述

收获

  • BLSTM 好于 LSTM
  • 目前这种方法效果不好,还有工作可做

参考

https://arxiv.org/abs/1705.08923


Deep Subspace Clustering Networks

NIPS’17

问题

  • 之前的子空间聚类大多是线性的
  • 非线性的方法the selection of different kernel types is largely empirical, and there is no clear reason to believe that the implicit feature space corresponding to a predefined kernel is truly well-suited to subspace clustering

方法

  • 提出了 a new layer that encodes the notion of self-expressiveness

参考

https://arxiv.org/abs/1709.02508


Pose Guided Person Image Generation

NIPS’17

问题

  • It is difficult for a complete end-to-end framework to do this because it has to generate both correct poses and detailed appearance simultaneously.

方法

这里写图片描述

  • dividing the problem into two stages
    1. At stage-I, we explore different ways to model pose information
    2. At stage-II, a variant of Deep Convolutional GAN (DCGAN) model is used to further refine the initial generation result.

收获

  • 可以用来扩充数据库

参考

http://papers.nips.cc/paper/6644-pose-guided-person-image-generation


Consistent-Aware Deep Learning for Person Re-identification in a CameraNetwork

CVPR’17

问题

  • pairwise re-identification methods cannot obtain the globally optimal matching results for a whole camera network.

方法

这里写图片描述

  • we used a gradient descent algorithm to seek the globally optimal matching by maximizing the sum of all matching similarity for all camera pairs, while satisfying all the consistent constraints simultaneously

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Lin_Consistent-Aware_Deep_Learning_CVPR_2017_paper.pdf


Fast Person Re-identification via Cross-camera Semantic Binary Transformation

CVPR’17

问题

  • Numerous methods have been proposed for person reidentification, most of which however neglect the matching efficiency.
  • Recently, several hashing based approaches have been developed to make re-identification more scalable for large-scale gallery sets. Despite their efficiency, these works ignore cross-camera variations, which severely deteriorate the final matching accuracy.

方法

  • 提出CSBT,CSBT aims to transform original high-dimensional feature vectors into compact identitypreserving binary codes
    • CSBT first employs a subspace projection to mitigate cross-camera variations, by maximizing intra-person similarities and inter-person discrepancies
    • Subsequently, a binary coding scheme is proposed via seamlessly incorporating both the semantic pairwise relationships and local affinity information
    • Finally, a joint learning framework is proposed for simultaneous subspace projection learning and binary coding based on discrete alternating optimization

收获

  • 提升速度

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Chen_Fast_Person_Re-Identification_CVPR_2017_paper.pdf


Learning Deep Context aware Features over Body and Latent Parts for Person Reidentification

CVPR’17

问题

there are still two problems

  • First, for feature learning, current popular DCNN models typically stack single-scale convolution and max pooling layers to generate deep networks. With the increase of the number of layers, these DCNN models could easily miss some small scale visual cues, such as sunglasses and shoes. However, these fine-grained attributes are very useful to distinguish the pedestrian pairs with small inter-class variations. Thus these DCNN models are not the best choice for pedestrian feature learning.
  • Second, due to the pose variations and imperfect pedestrian detectors, the pedestrian image samples may be misaligned. Sometimes they may have some backgrounds or lack some parts, e.g. legs. In these cases, for part-based representation, the predefined rigid grids may fail to capture correct correspondence between two pedestrian images. Thus the rigid predefined grids are far from robust for effective part-based feature learning

方法

  • To solve the first problem, we propose a Multi-Scale Context-Aware Network (MSCAN) , for each convolutional layer of the MSCAN, we adopt multiple convolution kernels with different receptive fields to obtain multiple feature maps. Feature maps from different convolution kernels are concatenated as current layer’s output. To decrease the correlations among different convolution kernels, the dilated convolution [45] is used rather than general convolution kernels. Through this way, multi-scale context knowledge is obtained at the same layer. Thus the local visual cues for fine-grained discrimination is enhanced. In addition, through embedding contextual features layer-by-layer (convolution operation across layers), MSCAN can obtain more context-aware representation for input image.
  • To solve the second problem, instead of using rigid body parts, we propose to localize latent pedestrian parts through Spatial Transform Networks (STN) [13], which is originally proposed to learn image transformation. To adapt it to the pedestrian part localization task, we propose three new constraints on the learned transformation parameters. With these constraints, more flexible parts can be localized at the informative regions, so as to reduce the distraction of background contents.

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Li_Learning_Deep_Context-Aware_CVPR_2017_paper.pdf


Pedestrian Alignment Network for Large-scale Person Re-identification

arXiv:1707

问题

  • person re-ID usually adopts automatic detectors to obtain cropped pedestrian images. However, this process suffers from two types of detector errors: excessive background and part missing. Both errors deteriorate the quality of pedestrian alignment and may compromise pedestrian matching due to the position and scale variances

方法

  • 提出了 pedestrian alignment network (PAN)
  • 这里写图片描述

参考

行人对齐+重识别网络

https://arxiv.org/abs/1707.00408


Scalable Person Re-identification on Supervised Smoothed Manifold

CVPR’17

问题

  • However, the underlying manifold which those images reside on is rarely investigated. That raises a problem that the learned metric is not smooth with respect to the local geometry structure of the data manifold
  • 为什么不用半监督和无监督呢,论文Introduction第四段有详细说明

方法

  • 提出了 Supervised Smoothed Manifold (SSM)
  • 使用CNN特征作为本算法的输入
  • 核心思想是使用监督嵌入流形来对行人图像特征向量进行降维表示,
  • 同时保证了降维时对类内距离和类间距离的约束
  • 属于子空间学习(流形学习)

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Bai_Scalable_Person_Re-Identification_CVPR_2017_paper.pdf


Divide and Fuse: A Re-ranking Approach for Person Re-identification

BMVC’17

问题

  • 特征的多样性很重要,但是 in many circumstances, only one type of pedestrian feature is available

方法

  • 提出了 “Divide and Fuse” re-ranking framework for person re-ID
  • It exploits the diversity from different parts of a high-dimensional feature vector for fusion-based re-ranking,
  • Features are divided into sub-vectors before re-encoded into a new vector. The new vectors are fused into one vector for ranking

参考

https://arxiv.org/abs/1708.04169


Recurrent Convolutional Network for Video-based Person Re-Identification

CVPR’16

问题

  • The problem of re-identification has been extensively explored for still images, however the video-based re-identification problem has not had the same attention
  • 这篇是早期研究视频re-id的工作

方法

![](Recurrent Convolutional Network for Video-based Person Re-Identification.png)

  • 用一个CNN对单帧图片进行特征提取,用RNN把帧与帧之间的关系联系起来
  • Siamese Network
  • Joint Identification and Verification

参考

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/McLaughlin_Recurrent_Convolutional_Network_CVPR_2016_paper.pdf


Top-push Video-based Person Re-identification

CVPR‘16

问题

  • However, we find that when using video-based representation, some inter-class difference can be much more obscure than the one when using still-imagebased representation, because different people could not only have similar appearance but also have similar motions and actions which are hard to align

方法

  • 文中针对图片序列(视频)提取 HOG3D 等特征
  • 对于外观特征的提取,我们使用 color histograms and LBP features
  • we propose a top-push distance learning model (TDL)
  • In TDL, we specially consider the optimization of Mahalanobis distance :
    • D ( x i , x j ) = ( x i x j ) T M ( x i x j )
    • X i , j 来表示两个向量的外积
    • X i , j = ( x i x j ) ( x i x j ) T
    • 这样距离就能表示成:
    • D ( x i , x j ) = t r ( M X i , j )
  • TDL 目标
    • 一是最小化类内距离:
    • m i n x i , x i , y i = y j D ( x i , x j )
    • 二是使最小类间距离小于类内距离
    • D ( x i , x j ) + ρ < min y k y i D ( x i , x j ) , y i = y j
    • 将上式改写成:
    • m i n x i , x i , y i = y j max { D ( x i , x j ) min y k y i D ( x i , x k ) + ρ , 0 }
  • TDL 损失函数
    • f ( M ) = ( 1 α ) x i , x i , y i = y j t r ( M X i , j ) α x i , x i , y i = y j max { D ( x i , x j ) min y k y i D ( x i , x k ) + ρ , 0 }
    • M 求偏导,得到梯度函数
    • G t = f M | M = M t = ( 1 α ) i , j X i , j + α ( i , j , k ) N ( M t ) ( X i , j X i , k )

参考

学习笔记: Top-push Video-based Person Re-identification

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/You_Top-Push_Video-Based_Person_CVPR_2016_paper.pdf


See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification

CVPR‘17

问题

  • Two steps are usually involved in previous approaches, namely feature learning and metric learning. But most of the existing approaches only focus on either feature learning or metric learning. Meanwhile, many of them do not take full use of the temporal and spatial information

方法

  • build an end-to-end deep neural network architecture to jointly learn features and metrics
  • 这里写图片描述
  • It accepts a triplet of image sequences as the inputs. After extracting features by a CNN, we apply a temporal RNN to improve feature learning. Meanwhile, spatial RNNs are exploited to learn a good metric. Therefore, the proposed method jointly performs feature learning and metric learning, and integrates both temporal and spatial information at the same time.

收获

  • Sptial Recurrent Model(SRM)可以用于视频领域的度量学习

参考

http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhou_See_the_Forest_CVPR_2017_paper.pdf


Person Re-identification by Local Maximal Occurrence Representation and Metric Learning

CVPR’15

问题

  • Two fundamental problems are critical for person re-identification, feature representation and metric learning. An effective feature representation should be robust to illumination and viewpoint changes, and a discriminant metric should be learned to match various person images

方法

  • 作者提出 Retinex algorithm 解决不同相机照明度不同的问题
  • With the Retinex images, we apply the HSV color histogram to extract color features.
  • In addition to color description, we also apply the Scale Invariant Local Ternary Pattern (SILTP) [26] descriptor for illumination invariant texture description
  • 提出 LOMO 特征提取方法去 Dealing with Viewpoint Changes

收获

  • 不是DL方法
  • 预处理方法可以借鉴

参考

https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liao_Person_Re-Identification_by_2015_CVPR_paper.pdf


A Siamese Long Short-Term Memory Architecture for Human Re-Identification

ECCV‘16

问题

  • In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions.

方法

  • 用 LOMO 和 Color Names 提取特征后分块送入LSTM
  • 最后在融合在一起

收获

  • 这种分块的方法,对齐要求很高,但是 AlignedReID 中有自动对齐的方法

参考

https://arxiv.org/abs/1607.08381


Deeply-Learned Part-Aligned Representations for Person Re-Identification

ICCV’17

问题

  • We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem

方法

  • Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information
  • The part-aligned representation extractor, is a deep neural network, consisting of a fully convolutional neural network (FCN) whose output is an image feature map, followed by a part net which detects part maps and outputs the part features extracted over the parts

这里写图片描述

收获

  • 本文的 part net 可以借鉴

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Zhao_Deeply-Learned_Part-Aligned_Representations_ICCV_2017_paper.pdf


Pose Invariant Embedding for Deep Person Re-identification

arXiv:1701

问题

  • Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With bad alignment, the background noise will significantly compromise the feature learning and and matching process

方法

  • this paper introduces the pose invariant embedding (PIE) as a pedestrian descriptor
    • First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations
    • Second, to reduce the impact of pose estimation errors and information loss during PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input
  • 一个行人通常被分为14个关键点,这14个关键点把人体结果分为若干个区域。
  • 为了提取不同尺度上的局部特征,作者设定了三个不同的PoseBox组合。
  • 之后这三个PoseBox矫正后的图片和原始为矫正的图片一起送到网络里去提取特征,这个特征包含了全局信息和局部信息

参考

https://arxiv.org/abs/1701.07732


Spindle Net: Person Re-identification with Human Body Region Guided Feature Decomposition and Fusion

CVPR’17

问题

  • Moreover, the person body misalignment caused by detectors or pose variations is sometimes too severe for feature matching across images

方法

  • 提出 Spindle Net, based on human body region guided multi-stage feature decomposition and tree-structured competitive feature fusion
  • 这里写图片描述
  • 首先通过骨架关键点提取的网络提取14个人体关键点,之后利用这些关键点提取7个人体结构ROI。
  • 网络中所有提取特征的CNN(橙色表示)参数都是共享的,这个CNN分成了线性的三个子网络FEN-C1、FEN-C2、FEN-C3。对于输入的一张行人图片,有一个预训练好的骨架关键点提取CNN(蓝色表示)来获得14个人体关键点,从而得到7个ROI区域,其中包括三个大区域(头、上身、下身)和四个四肢小区域。这7个ROI区域和原始图片进入同一个CNN网络提取特征。
  • 原始图片经过完整的CNN得到一个全局特征。三个大区域经过FEN-C2和FEN-C3子网络得到三个局部特征。四个四肢区域经过FEN-C3子网络得到四个局部特征。之后这8个特征按照图示的方式在不同的尺度进行联结,最终得到一个融合全局特征和多个尺度局部特征的行人重识别特征

参考

Spindle Net阅读笔记

http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhao_Spindle_Net_Person_CVPR_2017_paper.pdf


GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval

ACM MM2017

问题

  • The huge variance of human pose and the misalignment of detected human images significantly increase the difficulty of person Re-Identification (Re-ID). Moreover, efficient Re-ID systems are required to cope with the massive visual data being produced by video surveillance systems

方法

  • 提出了 Global-Local-Alignment Descriptor (GLAD) and an efficient indexing and retrieval framework
  • 与Spindle Net类似,GLAD利用提取的人体关键点把图片分为头部、上身和下身三个部分
  • 将整图和三个局部图片一起输入到一个参数共享CNN网络中,最后提取的特征融合了全局和局部的特征
  • 为了适应不同分辨率大小的图片输入,网络利用全局平均池化(Global average pooling, GAP)来提取各自的特征
  • 和Spindle Net略微不同的是四个输入图片各自计算对应的损失,而不是融合为一个特征计算一个总的损失
  • 这里写图片描述

参考

https://arxiv.org/abs/1709.04329


Region-based Quality Estimation Network for Large-scale Person Re-identification

AAAI‘18

问题

  • One of the major restrictions on the performance of videobased person re-id is partial noise caused by occlusion, blur and illumination. Since different spatial regions of a single frame have various quality, and the quality of the same region also varies across frames in a tracklet, a good way to address the problem is to effectively aggregate complementary information from all frames in a sequence, using better regions from other frames to compensate the influence of an image region with poor quality.

方法

  • 贡献了一个数据库,to alleviate the lack of clean large-scale person re-id datasets for the community
  • 提出了 Region-based Quality Estimation Network (RQEN)
  • 这里写图片描述
  • 如上图,如果遮挡严重,一般的pooling效果不好,所以RQEN中让网络对每帧进行一个权重判断
  • 给高质量帧打上高权重,然后对feature map进行一个线性叠加

收获

  • 对有遮挡的工作可以借鉴此方法的思路,
  • 别的工作也可借鉴(像SENet)

参考

https://arxiv.org/abs/1711.08766


Camera Style Adaptation for Person Re-identification

arXiv:1711

问题

  • 数据增强

方法

  • 使用GAN将一个摄像头的图片transfer到另外一个摄像头
  • with CycleGAN, labeled training images can be style-transferred to each camera, and, along with the original training samples, form the augmented training set
  • This method, while increasing data diversity against over-fitting, also incurs a considerable level of noise. In the effort to alleviate the impact of noise, the label smooth regularization (LSR) is adopted
  • 这里写图片描述

收获

  • CycleGAN 生成数据是可控的,ID是明确的

参考

https://arxiv.org/abs/1711.10295


Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification

arXiv:1711

问题

  • Person re-identification (re-ID) models trained on one domain often fail to generalize well to another.
  • 对于每张图片,ID信息对于识别有重要意义,需要保留 –> self-similarity*
  • source 和 target domain中包含的人员是没有overlap的,因此,转换得到的图片应该要和target domain的任何一张图片都不相似–> domain-dissimilarity

方法

  • 提出 “learning via translation” framework,使用GAN把source domain的图片转换到target domain中,并使用这些translated images训练ReID model
  • 作者提出Similarity Preserving GAN (SPGAN)

参考

ReID论文阅读之——SPGAN

https://arxiv.org/abs/1711.07027


Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)

arXiv:1711

问题

  • 对于均匀分割或者其它统一的分割,不同图像在同一part可能因为没有对齐出现不同的语意信息

方法

  • A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy,

  • 这里写图片描述

  • A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency
  • 将 average pooling 前后的向量做最近邻
  • 这里写图片描述

参考

[论文笔记] Person Retrieval with Refined Part Pooling

https://arxiv.org/abs/1711.09349


Cross-view Asymmetric Metric Learning for Unsupervised Person Re-identification

ICCV’17

问题

  • While metric learning is important for Person reidentification (RE-ID), a significant problem in visual
    surveillance for cross-view pedestrian matching, existing metric models for RE-ID are mostly based on supervised learning that requires quantities of labeled samples in all pairs of camera views for training. However, this limits their scalabilities to realistic applications, in which a large amount of data over multiple disjoint camera views is available but not labelled

方法

  • we propose unsupervised asymmetric metric learning for unsupervised RE-ID
  • 先用 K-means 给出初始类标签,在进行度量学习

参考

https://arxiv.org/abs/1708.08062


Group Re-Identification via Unsupervised Transfer of Sparse Features Encoding

ICCV’17

问题

  • We believe that the additional information carried by neighboring individuals provides a relevant visual context that can be exploited to obtain a more robust match of single persons within the group. Despite this, re-identifying groups of people compound the common single person re-identification problems by introducing changes in the relative position of persons within the group and severe self-occlusions.

方法

  • we propose a solution for group re-identification that grounds on transferring knowledge from single person reidentification to group re-identification by exploiting sparse dictionary learning
    • First, a dictionary of sparse atoms is learned using patches extracted from single person images.
    • Then, the learned dictionary is exploited to obtain a sparsity-driven residual group representation, which is finally matched to perform the re-identification
  • 这里写图片描述

参考

https://arxiv.org/abs/1707.09173


Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identificatio

ICCV‘17

问题

  • Then exploit a distance function to judge their extent of matching. However, most of these ap-proaches derive each sequence’s representation separately, rarely considering the impact of the others, which neglect the mutual influence of the two video sequences in the context of the matching task

方法

  • we proposed jointly Attentive Spatial-Temporal Pooling Networks (ASTPN) , a powerful mechanism for
    learning the representation of video sequences by taking into account the interdependence among them
    • ASTPN first learns a similarity measure over the features extracted from recurrent-convolutional networks of the two input items, and uses the similarity scores between the features to compute attention vectors in both spatial (regions in each frame) and temporal (frames over sequences) dimensions.
    • Next, the attention vectors are used to perform pooling.
    • Finally, a Siamese network architecture is deployed over the attention vectors. The proposed architecture can be trained efficiently with the end-to-end training schema
    • 这里写图片描述

收获

  • 对时序建模可借鉴NLP领域的方法和模型

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Xu_Jointly_Attentive_Spatial-Temporal_ICCV_2017_paper.pdf


Pose-driven Deep Convolutional Model for Person Re-identification

ICCV’17

问题

  • The large pose deformations and the complex view variations exhibited by the captured person images significantly increase the difficulty of learning and matching of the features from person images
  • Although these approaches have achieved remarkable results on mainstream person ReID datasets, most of them do not consider pose variation of human body

方法

  • The proposed PDC model learns the global representation depicting the whole body and local representations depicting body parts simultaneously.
    • The global representation is learned using the Softmax Loss with person ID labels on the whole input image.
    • For the learning of local representations, a novel Feature Embedding sub-Net (FEN) is proposed to learn and readjust human parts so that parts are affine transformed and re-located at more reasonable regions which can be easily recognizable through two different cameras.
    • In Feature Embedding subNet, each body part region is first automatically cropped. The cropped part regions are hence transformed by a Pose Transformation Network (PTN) to eliminate the pose variations. The local representations are hence learned on the transformed regions.
    • We further propose a Feature Weighting sub-Net (FWN) to learn the weights of global representations and local representations on different parts.
    • Therefore, more reasonable feature fusion is conducted to facilitate feature similarity measurement
    • 这里写图片描述

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Su_Pose-Driven_Deep_Convolutional_ICCV_2017_paper.pdf


RGB-Infrared Cross-Modality Person Re-Identification

ICCV‘17

问题

  • Currently, most works focus on RGB-based Re-ID. However, in some applications, RGB images are not suitable, e.g. in a dark environment or at night. Infrared (IR) imaging becomes necessary in many visual systems. To that end, matching RGB images with infrared images is required, which are heterogeneous with very different visual characteristics.

方法

  • To explore the RGB-IR Re-ID problem, we evaluate existing popular cross-domain models, including three commonly used neural network structures (one-stream, twostream and asymmetric FC layer) and analyse the relation between them.
  • We further propose deep zero-padding for training one-stream network towards automatically evolving domain-specific nodes in the network for cross-modality matching

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Wu_RGB-Infrared_Cross-Modality_Person_ICCV_2017_paper.pdf


Neural Person Search Machines

ICCV’17

问题

  • 作者调查了一下室外真实场景下的Person ReID工作,大部分相关工作都是detection+ReID分成两步来做的,这篇文章提出NPSM方法来实现一步到位

方法

  • NPSM主要借助LSTM和attention的思想,逐步衰减原图中所应该关注的ROI区域,直到最后得到一个很精确的ROI区域,这个区域就是应该搜索的person目标

参考

http://openaccess.thecvf.com/content_ICCV_2017/papers/Liu_Neural_Person_Search_ICCV_2017_paper.pdf


Machine Learning

On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes

NIPS 2001

  • 本文主要讲的判别式模型和生成模型,用LR和NB举例并进行了理论上的分析

收获

  • 生成模型估计它们的联合概率分布P(x,y),再用bayes求p(y|x)
    • p(x,y) can also be used for other purposes
    • For example you could use p(x,y) to generate likely (x,y) pairs.
  • 判别式模型是直接对conditional probability distribution p(y|x)建模

  • 当数据集比较小的时候,应该选用Naive Bayes,为了能够取得很好的效果,数据的需求量为O(log n)

  • 当数据集比较大的时候,应该选用Logistic Regression,为了能够取得很好的效果,数据的需求量为O( n)

参考

https://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

http://cs229.stanford.edu/notes/cs229-notes2.pdf

http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jcyxx-zxk201512275

http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf


Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

JMLR 2014

问题

  • 如题所示

方法

  • We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearestneighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods)
  • We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems

收获

  • The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

参考

http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf

猜你喜欢

转载自blog.csdn.net/u013982164/article/details/79608100