Learning Memory-guided Normality for Anomaly Detection异常检测中的memory机制论文

前法没有明确考虑正常模式的多样性。论文为无监督的异常检测学习方法,使用一个具有更新方案的内存模块,其中内存中的项目记录正常数据的原型模式(prototypical patterns of normal data)。

我们还提出了新的特征紧凑性(novel feature compactness)和分离性损失(separateness losses)来训练记忆模块,提高了记忆项目和深度学习特征对正常数据的辨别能力。在标准基准上的实验结果证明了我们的方法的有效性和效率,它优于现有的技术水平。

    We address(致力于) the problem of anomaly detection, that is, detecting anomalous events in a video sequence. Anomaly detection methods based on convolutional neural networks (CNNs) typically leverage(杠杆效力,利用) proxy tasks, such as reconstructing input video frames, to learn models describing normality without seeing anomalous samples at training time, and quantify(量化) the extent of(…的范围) abnormalities using the reconstruction error at test time. The main drawbacks of these approaches are that they do not consider the diversity of normal patterns explicitly, and the powerful representation capacity(表示能力) of CNNs allows to reconstruct abnormal video frames. To address(解决) this problem, we present an unsupervised learning approach to anomaly detection that considers the diversity of normal patterns explicitly, while lessening(减少) the representation capacity of CNNs. To this end(为此), we propose to use a memory module with a new update scheme(方法) where items in the memory record prototypical patterns of normal data. We also present novel feature compactness and separateness losses(密实度和分离度损失) to train the memory, boosting the discriminative power of both memory items and deeply learned features from normal data. Experimental results(实验结果) on standard benchmarks(基准) demonstrate(证明) the effectiveness and efficiency of our approach, which outperforms(超过) the state of the art.

1. Introduction

        The problem of detecting abnormal events in video sequences, e.g., vehicles on sidewalks, has attracted significant attention over the last decade, which is particularly important for surveillance and fault detection systems(监控和故障检测系统). It is extremely challenging for a number of reasons: First, anomalous events are determined differently according to circumstances. Namely(即), the same activity could be normal or abnormal (e.g., holding a knife in the kitchen or in the park). Manually annotating(注释) anomalous events is in this context labor intensive(劳动密集型). Second, collecting anomalous datasets requires a lot of effort, as anomalous events rarely happen in real-life situations. Anomaly detection is thus typically deemed to be an unsupervised learning problem, aiming(力求) at learning a model describing normality without anomalous samples. At test time, events and activities not described by
the model are then considered as anomalies.

        检测视频序列中的异常事件(例如人行道上的车辆)的问题在过去十年中引起了极大的关注,尤其视频序列中异常事件的检测问题,对于监控和故障检测系统尤为重要。极大的挑战来自以下原因:第一,异常事件由环境定义。即同一的活动可能是正常的,也可能是不正常的(例如,在厨房或公园里持刀)。在这种情况下,手动注释异常事件是劳动密集型的。其次,收集异常数据集需要付出大量努力,因为在现实生活中,异常事件很少发生。因此,异常检测通常被无监督学习问题,力求在没有异常样本的情况下学习一个描述常态的模型。在测试时,然后,模型不能描述的事件和活动被视为异常

        There are many attempts(试验) to model normality in video sequences using unsupervised learning approaches. At training time, given normal video frames as inputs, they typically extract feature representations and try to reconstruct the inputs again. The video frames of large reconstruction
errors are then treated as anomalies at test time. This assumes(假设) that abnormal samples are not reconstructed well, as the models have never seen them during training. Recent methods based on convolutional neural networks (CNNs) exploit(开发;利用) an autoencoder (AE) [1, 17]. The powerful representation capacity of CNNs allows to extract better feature representations. The CNN features from abnormal frames, on the other hand, are likely to be reconstructed by combining those of normal ones [22, 8].In this case, abnormal frames have low reconstruction errors, often occurring when a majority of the abnormal frames are normal (e.g., pedestrians in a park).In order to lessen the capacity of CNNs, a video prediction framework [22] is introduced that minimizes the difference between a predicted future frame and its ground truth. The drawback of these
methods [1, 17, 22] is that they do not detect anomalies directly [35]. They instead leverage proxy tasks for anomaly detection, e.g., reconstructing input frames [1, 17] or predicting future frames [22], to extract general feature representations rather than normal patterns. To overcome this problem, Deep SVDD [35] exploits the one-class classification objective to map normal data into a hypersphere.Specifically(具体地), it minimizes the volume of the hypersphere such that normal samples are mapped closely to the center of the sphere. Although a single center of the sphere represents a universal characteristic of normal data, this does not consider various patterns of normal samples.

       已有许多使用无监督方法的试验来建模视频序列中的常态。训练时,将正常视频帧作为输入,然后一般抽取表示特征然后重建。测试时,有大的重建错误的视频帧被视为异常帧。这建立在异常样本不能很好重建的基础上,因为训练时模型从未见过异常样本。最近的方法建立在CNN上利用了自编码器[1, 17]。CNN 强大的表示能力可以提取更好的特征表示。另一方面,来自异常帧的CNN特征可能通过结合正常帧的特征来重建 [22, 8]。因此,在大部分的异常帧是正常的时候(例如,公园中的行人)异常帧有小重建错误。为了减少CNN的容量,一种视频预测框架[22]来尝试最小化 预测帧和本源之间的不同。这些方法[1, 17, 22]的缺点是其不直接检测异常[35]。相反,它们利用代理任务实现异常检测,例如,重建输入帧[1,17]或预测未来帧[22],提取的是一般特征表示而不是正常模式。为了克服这个问题,Deep SVDD[35]利用一分类方法将正常数据映射到超球体。具体地。它使超球体的体积最小化,如此正常样本将紧密地映射到球的中心。虽然球面的单一中心代表正常数据的普遍特征,但这并不考虑正常样本的“各种”模式。

       We present in this paper an unsupervised learning approach to anomaly detection in video sequences considering the diversity of normal patterns. We assume that a single prototypical feature is not enough to represent various patterns of normal data. That is, multiple prototypes (i.e., modes or centroids of features) exist in the feature space of normal video frames (Fig. 1). To implement this idea, we propose a memory module for anomaly detection, where individual items in the memory correspond to prototypical features(原型特征) of normal patterns. We represent(代替) video frames using the prototypical features in the memory items, lessening the capacity of CNNs. To reduce the intra-class variations of CNN features, we propose a feature compactness loss(特征紧致度损失), mapping the features of a normal video frame to the nearest item in the memory and encouraging them to be close. Simply updating memory items and extracting CNN features alternatively in turn give a degenerate solution(退化解), where all items are similar and thus all features are mapped closely in the embedding space. To address this problem, we propose a feature separateness loss(特征分离损失). It minimizes the distance between each feature and its nearest item, while maximizing the discrepancy between the feature and the second nearest one, separating individual items in the memory and enhancing the discriminative power of the features and memory items. We also introduce an update strategy(更新策略) to prevent the memory from recording features of anomalous samples at test time. To this end,we propose a weighted regular score(加权规则分数) measuring how many anomalies exist within a video frame, such that the items are updated only when the frame is determined as a normal one. Experimental results on standard benchmarks, including UCSD Ped2 [21], CUHK Avenue [24] and ShanghaiTech [26], demonstrate the effectiveness and efficiency of our approach, outperforming the state of the art.

本文提出了一种考虑正常模式多样性的视频序列异常检测的无监督学习方法。我们假设单个原型特征不足以表示正常数据的各种模式。即如图1,普通视频帧的特征空间中存在多个原型(i.e.模型或特征的质心)。
为了实现这个想法,我们提出了一个异常检测的内存模块,其中内存中的单个项目对应于正常模式的原型特征。我们使用内存模块中的原型特征来表示视频帧,从而减少了CNN的容量。为了减少CNN特征类的类内变化,我们提出了一种特征紧致度损失,将普通视频帧的特征映射到内存中最近的项,并鼓励它们接近。如此简单地更新内存项并交替提取CNN特征,就会得到退化的解决方案,这样项目都相似,因此所有特征都在嵌入空间中映射地紧密。为了解决这个问题,我们提出了一种特征分离损失。它最小化了每个特征与其最近项之间的距离,同时最大化了特征与第二最近项之间的差异,分离了内存中的单个项,并增强了特征和内存项的识别能力。我们还引入了一种更新策略,以防止在测试时内存记录异常样本的特征。为此,我们提出了加权规则分数,用于测量视频帧中存在多少异常,这样,仅当帧被确定为正常帧时,项目才会更新。在标准基准上的实验结果,包括UCSD Ped2 [21], CUHK Avenue [24] 和ShanghaiTech [26], ,证明了该方法的有效性和效率,我们的方法,超越了最先进的水平。
在这里插入图片描述

The main contributions of this paper can be summarized as follows:
• We propose to use multiple prototypes to represent the diverse patterns of normal video frames for unsupervised anomaly detection. To this end, we introduce a memory module recording prototypical patterns of normal data on the items in the memory.
• We propose feature compactness and separateness losses to train the memory, ensuring the diversity and discriminative power of the memory items. We also present a new update scheme of the memory, when both normal and abnormal samples exist at test time.
• We achieve a new state of the art on standard benchmarks for unsupervised anomaly detection in video sequences. We also provide an extensive experimental analysis with ablation(消融) studies.

本文的主要贡献可总结如下:
•我们建议使用多个原型来表示正常视频帧的不同模式,以便进行无监督异常检测。为此,我们介绍了一种内存模块,用于记录内存中项目的正常数据的原型模式。
• 我们提出了特征紧致性和分离损失来训练记忆,确保记忆项的多样性和辨别能力。我们还提出了一种新的内存更新方案,当测试时同时存在正常和异常样本时。
•我们在视频序列中的无监督异常检测的标准基准上达到了一个新的水平。我们还通过消融研究提供了广泛的实验分析。

2.Related work

Anomaly detection.

       Anomaly detection. Many works formulate anomaly detection as an unsupervised learning problem, where anomalous data are not available at training time. They typically adopt reconstructive or discriminative approaches to learn models describing normality. Reconstructive models encode normal patterns using representation learning methods such as an AE [48, 36], a sparse(稀少的) dictionary learning [6, 49, 24], and a generative model [43]. Discriminative models characterize the statistical distributions of normal samples and obtain decision boundaries around the normal instances e.g., using Markov random field (MRF) [15], a mixture of dynamic textures (MDT) [28], Gaussian regression [4], and one-class classification [39, 27, 14]. These approaches, however, often fail to capture the complex distributions of high-dimensional data such as images and videos [3].
       许多工作将异常检测描述为无监督学习问题,其中异常数据在训练时不可用。他们通常采用重建或鉴别的方法来学习描述正常模式。重建模型使用表示学习方法(如AE)对正常模式进行编码[48,36],稀疏字典学习[6,49,24],和生成模型[43]。判别模型描述正常样本的统计分布,并获得正常样本周围的决策边界,例如,使用马尔可夫随机场(MRF)[15],混合动态纹理(MDT)[28],高斯回归[4]和一类分类。[39,27,14]。然而,这些方法往往无法捕获高维数据(如图像和视频)的复杂分布[3]。
       CNNs have allowed remarkable advances in anomaly detection over the last decade. Many anomaly detection methods leverage reconstructive models [9, 26, 5, 33] exploiting feature representations from e.g., a convolutional AE (Conv-AE) [9], a 3D Conv-AE [50], a recurrent neural network (RNN) [29, 26, 25], and a generative adversarial network (GAN) [33]. Although CNN-based methods outperform classical approaches by large margins, they even reconstruct anomalous samples with a combination of normal ones, mainly due to the representation capacity(代表能力) of CNNs. This problem can be alleviated by using predictive or discriminative models [22, 35]. The work of [22] assumes that anomalous frames in video sequences are unpredictable, and trains a network for predicting future frames rather than the input itself [22]. It achieves a remarkable performance gain over reconstructive models, but at the cost of runtime for estimating optical flow between video frames. It also requires ground-truth optical flow to train a sub-network for computing flow fields. Deep SVDD [35] leverages CNNs as mapping functions that transform normal data into the center of the hypersphere, whereas forcing anomalous samples to fall outside the sphere, using the one-class classification objective.Our method also lessens the representation capacity of CNNs but using a different
way. We reconstruct or predict a video frame with a combination of items in the memory, rather than using CNN
features directly from an encoder, while considering various patterns of normal data.
In case of future frame prediction, our model does not require computing optical flow(光流), and thus it is much faster than the current method [22]. DeepCascade [37] detects(发现; 查明) various normal patches(小块) using cascaded deep networks. In contrast, our method leverages memory items to record the normal pattern explicitly even in test sequences. Concurrent(并行) to our method, Gong et al. introduce a memory-augmented autoencoder (MemAE) for anomaly detection [8]. It also uses CNN features but using a 3D Conv-AE to retrieve relevant memory items that record normal patterns, where the items are updated during training only thus it is much faster than the current method [22]. Unlike this approach, our model better records diverse and discriminative normal patterns by separating memory items explicitly using feature compactness and separateness losses, enabling using a small number of items compared to MemAE (10 vs 2,000 for MemAE). We also update the memory at test time, while discriminating anomalies simultaneously, suggesting that our model also memorizes normal patterns of test data.
       CNN在异常检测方面取得了显著进展在过去十年间。许多异常检测方法利用重建模型[9,26,5,33],例如,利用卷积模型等的特征表示AE(Conv AE)[9]、3D Conv AE[50]、递归神经网络(RNN)[29,26,25]和生成性对抗网络(GAN)[33]。尽管基于CNN的方法在很大程度上优于经典方法,它们甚至用正常样本的组合来重建异常样本,但这主要是由于CNN的代表能力。这个问题可以通过使用预测或判别模型来缓解[22,35]。[22]的工作假设视频序列中的异常帧是不可预测的,并且训练网络预测未来帧,而不是输入帧[22]。与重建模型相比,它实现了显著的性能增益,但是以估计视频帧之间光流的运行时间为代价。它还需要地面真实光流来训练用于计算流场的子网络。 Deep SVDD[35]利用CNN作为映射函数,将普通数据转换为超球体的中心,而强迫异常样本落在球体之外,使用一类分类目标。我们的方法也减少了CNN的表示能力,但使用了不同的方法。我们用内存中的项目组合重建或预测视频帧,而不是直接从编码器使用CNN特征,同时考虑正常数据的各种模式。在未来帧预测的情况下,我们的模型不需要计算光流,并且因此快于当前地方法[22]。DeepCascade[37]使用级联深度网络检测多种正常小块。相反,我们的方法利用内存项显式记录正常模式,即使在测试序列中也是如此。我们地方法中,Gong等人介绍一种用于异常检测的内存增强自动编码器(MemAE)[8]。它也使用CNN功能,但使用3D Conv AE检索记录正常模式的相关记忆项目,这些项目仅在训练期间更新,因此比当前方法快得多[22]。与此方法不同,我们的模型通过使用特征紧凑性和分离损失明确分离记忆项,从而更好地记录了多样性和区别性的正常模式,与MemAE相比,使用了少量的项(10项,而MemAE为2000项)。我们还在测试时更新记忆,同时识别异常,这表明我们的模型也能记忆测试数据的正常模式

Memory networks

       There are a number of attempts to capture long-term dependencies in sequential data. Long short-term memory (LSTM) [11] addresses this problem using local memory cells, where hidden states of the network record information in the past partially(部分地). The memorization performance is, however, limited, as the size of the cell is typically small and the knowledge in the hidden state is compressed. To overcome the limitation, memory networks [45] have recently been introduced. It uses a global memory that can be read and written to, and performs a memorization task better than classical approaches.The memory networks, however, require layer-wise(分层) supervision to learn models, making it hard to train them using standard backpropagation. More recent works use continuous memory representations [40] or key-value pairs [30] to read/write memories, allowing to train the memory networks end-to-end. Several works adopt the memory networks for computer vision tasks including visual question answering [19, 7], one-shot learning [38, 13, 2], image generation [51], and video summarization [20]. Our work also exploits a memory module but for anomaly detection with a different memory updating strategy. We record various patterns of normal data to individual items in the memory, and consider each item as a prototypical feature.
       已有许多尝试来获取序列数据中地依赖关系。长短时记忆(LSTM)[11]使用本地存储单元解决了这个问题,其中网络的隐藏状态部分记录了过去的信息。然而,记忆的表现是有限的,因为记忆的大小单元通常很小,处于隐藏状态的知识被压缩。为了克服这一限制,最近引入了内存网络[45]。它使用一个可以读写的全局内存,并且比经典方法更好地执行记忆任务。然而,内存网络需要分层监督来学习模型,因此很难使用标准反向传播来训练它们。最近的作品使用连续内存表示[40]或键值对[30]来读/写内存,允许端到端训练内存网络。一些工程采用记忆网络来完成计算机视觉任务,包括视觉问答[19, 7],单镜头学习[38,13,2],图像生成[51]和视频摘要[20]。我们的工作也利用了内存模块,但使用不同的内存更新策略进行异常检测。我们将各种模式的正常数据记录到内存中的各个项目,并将每个项目视为一个典型特征

3. Approach

       We show in Fig. 2 an overview of our framework. We reconstruct input frames or predict future ones for unsupervised anomaly detection. Following [22], we input four successive video frames to predict the fifth one for the prediction task. As the prediction can be considered as a reconstruction of the future frame using previous ones, we use almost the same network architecture with the same losses
for both tasks. We describe hereafter our approach for the reconstruction task in detail.
       我们在图2中展示了我们框架的概述。我们重建输入帧或预测未来的帧,用于无监督异常检测。按照[22]我们的预测任务中输入四个连续的视频帧来预测预测任务的第五个视频帧。由于预测可以被认为是使用以前的帧重建未来帧,对于这两项任务,我们使用几乎相同的网络体系结构和相同的losses。我们将在下文详细描述重建任务的方法。
在这里插入图片描述

       Our model mainly consists of three components: an encoder, a memory module, and a decoder. The encoder inputs a normal video frame and extracts query features. The features are then used to retrieve prototypical normal patterns in the memory items and to update the memory. We feed
the query features and memory items aggregated(聚合) (i.e., read) to the decoder for reconstructing the input video frame. We train our model using reconstruction, feature compactness, and feature separateness losses end-to-end. At test time, we use a weighted regular score in order to prevent the memory from being updated by abnormal video frames. We compute the discrepancies(差异) between the input frame and its reconstruction and the distances between the query feature and the nearest item in the memory to quantify the extent of abnormalities in a video frame.

       我们的模型主要由三部分组成:编码器、存储器模块和解码器。编码器输入普通视频帧并提取查询特征。然后,这些特征用于检索内存项中的原型正常模式并更新内存。我们将查询特征和聚合(即读取)的内存项提供给解码器,以重构输入视频帧。我们将聚集的查询特征和内存项提供给解码器,用于重建输入视频帧。我们使用重构,特征紧致度损失和特征分离度损失端到端训练模型。在测试时,我们使用一个加权的规则分数,以防止内存被异常视频帧更新。我们计算输入帧与其重建之间的差异以及查询特征与内存中最近项之间的距离,以量化视频帧中异常的程度

3.1. Network architecture

3.1.1 Encoder and decoder

       We exploit the U-Net architecture [34], widely used for the tasks of reconstruction and future frame prediction [22], to extract feature representations from input video frames and to reconstruct the frames from the features. Differently, we remove the last batch normalization [12] and ReLU layers [18] in the encoder, as the ReLU cuts off negative values, restricting diverse feature representations. We instead add an L2 normalization layer to make the features have a common scale.Skip connections in the U-Net architecture may not be able to extract useful features from the video frames especially for the reconstruction task, and our model may learn to copy the inputs for the reconstruction. We thus remove the skip connections for the reconstruction task, while retaining(固定,保留) them for predicting future frames. We denote by I t I_t It and q t q_t qt a video frame and a corresponding feature (i.e., a query) from the encoder at time t, respectively. The encoder inputs the video frame I t I_t It and gives the query map q t q_t qtof size H × W × C, where H, W, C are height, width,and the number of channels, respectively. We denote by
q t k ∈ R C q^k_t∈ R^C qtkRC (k = 1, . . . K), where K = H × W, individual queries of size 1 × 1 × C in the query map q t q_t qt. The queries are then inputted to the memory module to read the items in the memory or to update the items, such that they record prototypical( 典型的) normal patterns. The detailed descriptions of the memory module are presented in the following section. The decoder inputs the queries and retrieved memory items and reconstructs the video frame I ^ t \hat I_t I^t.
       我们开发了U-Net架构[34],广泛用于重建和未来帧预测任务,从输入视频帧中提取特征表示,并从特征中重建帧。不同地,我们删除了编码器中的最后一批规范化[12]和ReLU层[18],因为ReLU切断了负值,限制不同的特征表示。相反,我们添加了一个L2规范化层,以使特征具有通用的比例。U-Net体系结构中的跳过连接可能无法从视频帧中提取有用的特征,尤其是对于重建任务,我们的模型可以学习复制重建的输入。因此,我们删除了重建任务的跳过连接,保留它们以预测未来的帧。我们分别用 I t I_t It q t q_t qt表示在时间t处来自编码器的视频帧和对应特征(即查询),编码器输入视频帧 I t I_t It 并给出查询映射 q t q_t qt, q t q_t qt大小为H × W × C,其中H、W、C分别为高度、宽度和通道数。我们用 q t k ∈ R C q^k_t∈ R^C qtkRC (k = 1, . . . K), 其中 K = H × W,individual queries of size 1 × 1 × C in the 查询映射 q t q_t qt.然后将查询输入到内存模块,以读取内存中的项目或更新项目,这样他们就能记录典型的正常模式。内存模块的详细说明将在下一节中介绍。解码器输入查询和检索到的内存项,并重构视频帧 I ^ t \hat I_t I^t

3.1.2 Memory

       The memory module contains M items recording various prototypical patterns of normal data. We denote by p m ∈ R C p_m∈R^C pmRC (m = 1, . . . , M) the item in the memory. The memory
performs(执行) reading and updating the items (Fig. 3).

       内存模块包含M个项目,记录正常数据的各种原型模式。我们用 p m ∈ R C p_m∈R^C pmRC表示内存中的各项 (m = 1, . . . , M) 。如图3,内存执行读取和更新项目(的功能)。
在这里插入图片描述

Read

       To read the items, we compute the cosine similarity between each query q t k q^k_t qtkand all memory items p m p_m pm, resulting in a 2-dimensional correlation map of size M × K. We then apply a softmax function along(沿着) a vertical direction, and obtain matching probabilities w t k , m w^{k,m}_t wtk,m as follows:
       为了读取这些项,我们计算每个查询 q t k q^k_t qtk和所有内存项 p m p_m pm之间的余弦相似性,结果得到大小为M×K的二维相关图。然后,我们沿垂直方向应用softmax函数,并获得匹配概率 w t k , m w^{k,m}_t wtk,m,如下所示:
在这里插入图片描述

       For each query q t k q^k_t qtk, we read the memory by a weighted average of the items p m p_m pm with the corresponding weights w t k , m w^{k,m}_t wtk,m , and obtain the feature p ^ t k ∈ R C \hat p^k_t ∈ R^C p^tkRC as follows:
       对于每个查询 q t k q^k_t qtk,我们通过项 p m p_m pm的加权平均值以及相应的权重来读取内存,并获取特征 p ^ t k ∈ R C \hat p^k_t ∈ R^C p^tkRC如下:
在这里插入图片描述
       Using all items instead of the closest item allows our model to understand diverse normal patterns, taking into account the overall normal characteristics. That is, we represent the query q t k q^k_t qtk with a combination of the items p m p_m pm in the memory. We apply the reading operator to individual queries, and obtain a transformed feature map p ^ t ∈ R H × W × C \hat p_t∈R^{H×W×C} p^tRH×W×C (i.e., aggregated items). We concatenate it with the query map q t q_t qt along the channel dimension, and input
them to the decoder. This enables the decoder to reconstruct the input frame using normal patterns in the items,
lessening the representation capacity of CNNs, while understanding the normality.
       使用所有项目而不是最近的项目可以让我们的模型理解不同的正常模式,考虑到整体正常特征。,也就是,我们用内存中 p m p_m pm项的组合来表示查询 q t k q^k_t qtk。我们将"读"运算符应用于单个查询,并得到变换后的特征映射 p ^ t ∈ R H × W × C \hat p_t∈R^{H×W×C} p^tRH×W×C (即汇总项目)。我们将其通过通道维与查询映射 q t q_t qt连接起来, 并将其输入到解码器。这使得解码器能够使用项目中的正常模式重构输入帧,在理解正态性的同时,降低CNN的表示能力。

Update

       For each memory item, we select all queries declared that the item is the nearest one, using the matching probabilities in (1).Note that multiple queries can be assigned to a single item in the memory.See, for example, Fig. 5 in Sec 4.3. We denote by U t m U^m_t Utm the set of indices for the corresponding queries for the m-th item in the memory. We update the item using the queries indexed by the set U t m U^m_t Utm only as follows:

       对于每个内存项,我们使用式子(1)中的匹配概率选择所有声明该项为最近项的查询。请注意,可以将多个查询分配给内存中的单个项。例如,参见第4.3节中的图5。我们用 U t m U^m_t Utm内存中第m项的相应查询的索引集。我们仅使用集合 U t m U^m_t Utm索引的查询更新该项,如下所示:
在这里插入图片描述

       where f(·) is the L2 norm. By using a weighted average of the queries, rather than summing them up, we can concentrate more on the queries near the item. To this end(到此为止), we compute matching probabilities v t k , m v^{k,m}_t vtk,m similar to (1) but by applying the softmax function to the correlation map of size M × K along a horizontal direction as
       其中f(·)是L2范数。通过使用查询的加权平均值,我们不必对它们进行总结,而可以将更多精力集中在项目附近的查询上。到此为止,我们计算匹配概率 v t k , m v^{k,m}_t vtk,m类似于(1),但通过将softmax函数应用于沿水平方向大小为M×K的相关图,如下所示:
在这里插入图片描述
       and renormalize it to consider the queries indexed by the set Umt as follows:
       并重整它,以考虑由集合UMT索引的查询如下:
在这里插入图片描述
       We update memory items recording prototypical features at both training and test time, since normal patterns in training and test sets may be different and they could vary with various factors, e.g., illumination and occlusion. As both normal and abnormal frames are available at test time, we
propose to use a weighted regular score to prevent the memory items from recording patterns in the abnormal frames.Given a video frame I t I_t It, we use the weighted reconstruction error between I t I_t It and I ^ t \hat I_t I^t as the regular score ε t ε_t εt:
       我们在训练和测试时更新记录原型特征的内存项目,因为训练和测试集中的正常模式可能不同,并且可能随各种因素而变化,例如,照明和遮挡。由于正常和异常帧在测试时都可用,我们建议使用加权规则分数来防止记忆项在异常帧中记录模式。给定帧 I t I_t It,我们使用 I t I_t It I ^ t \hat I_t I^t之间加权的重建错误作为规则分数 ε t ε_t εt
在这里插入图片描述
where the weight function Wij (·) is
在这里插入图片描述
       and i and j are spatial indices. When the score ε t ε_t εtis higher
than a threshold(阈) γ, we regard the frame It as an abnormal sample, and do not use it for updating memory items. Note that we use this score only when updating the memory. The weight function allows to focus more on the regions of large reconstruction errors, as abnormal activities typically appear within small parts of the video frame.
       i和j是空间指数。权重函数允许更多地关注重建误差较大的区域,因为异常活动通常出现在视频帧的一小部分中。

3.2. Training loss

       We exploit the video frames as a supervisory signal to discriminate normal and abnormal samples. To train our model, we use reconstruction, feature compactness, and feature separateness losses ( L r e c , L c o m p a c t a n d L s e p a r a t e L_{rec}, L_{compact} and L_{separate} Lrec,LcompactandLseparaterespectively), balanced by the parameters λc and λs as follows:

       我们利用视频帧作为监控信号来区分正常和异常样本。为了训练我们的模型,我们使用重构、特征紧凑性和特征分离损失,( L r e c , L c o m p a c t 和 L s e p a r a t e L_{rec}, L_{compact} 和L_{separate} Lrec,LcompactLseparaterespectively),用参数 λc 和λs 做平衡:
在这里插入图片描述

Reconstruction loss

       Specifically, we minimize the L2 distance between the decoder output I ^ t \hat I_t I^t and the ground truth I t I_t It:

在这里插入图片描述

       where we denote T by the total length of a video sequence. We set the first time step to 1 and 5 for reconstruction and prediction tasks, respectively(分别地).

Feature compactness loss.

       The feature compactness loss encourages the queries to be close to the nearest item in the memory, reducing intra-class variations. It penalizes the discrepancies between them in terms of the L2 norm as:
在这里插入图片描述

       where p is an index of the nearest item for the query q t k q^k_t qtk defined as,
在这里插入图片描述

       Note that the feature compactness loss and the center loss [44] are similar, as the memory item p p p_p pp corresponds the center of deep features in the center loss. They are different in that the item in (10) is retrieved from the memory, and it is updated without any supervisory signals, while the cluster center in the center loss is computed directly using the features learned from ground-truth class labels. Note also(还要注意的是) that our method can be considered as an unsupervised learning of joint clustering and feature representations. In this task, degenerate solutions are likely to be obtained [44, 47]. As will be seen in our experiments, training our model using the feature compactness loss only makes all items similar, and thus all queries are mapped closely in the embedding space, losing the capability of recording diverse normal patterns.

       注意,特征紧凑度损失和中心损失[44]相似,因为内存项 p p p_p pp对应于中心损失中的深部特征中心。它们的不同之处在于(10)中的项目是从内存中检索的,并且在没有任何监控信号的情况下进行更新,而中心损失中的聚类中心则直接使用从地面真值类标签中学习的特征进行计算。还要注意的是,我们的方法可以看作是联合聚类和特征表示的无监督学习。在这项任务中,,可能会得到退化解[44,47]。从我们的实验中可以看出,使用特征紧致度损失训练我们的模型只会使所有项目相似,因此,所有查询都紧密地映射在嵌入空间中,失去了记录各种正常模式的能力。

Feature separateness loss.

       Similar queries should be allocated to the same item in order to reduce the number of items and the memory size. The feature compactness loss in (10) makes all queries and memory items close to each other, as we extract the features (i.e., queries) and update the items alternatively, resulting that all items are similar. The items in the memory, however, should be far enough apart from each other to consider various patterns of normal data. To prevent this problem while obtaining compact feature representations, we propose a feature separateness loss, defined with a margin of α as follows:

       应将类似查询分配给同一项,以减少项的数量和内存大小。(10)中的特征紧凑性损失使得所有查询和内存项彼此接近,当我们提取特征(即查询)并交替更新项目时,所有项目都是相似的。然而,内存中的项目应该彼此足够远,以考虑正常数据的各种模式。为了在获得紧凑的特征表示时防止此问题,我们提出了一种特征分离损失,定义为α裕度,如下所示:
在这里插入图片描述
       where we set the query q t k q^k_t qtk, its nearest item p p p_p pp and the second nearest item p n p_n pn as an anchor, and positive and hard
negative samples, respectively. We denote by n an index of the second nearest item for the query q t k q^k_t qtk
Note that this is different from the typical use of the triplet loss that requires ground-truth positive and negative samples for the anchor. Our loss encourages the query and the
second nearest item to be distant, while the query and the nearest one to be nearby. This has the effect of placing the items far away. As a result, the feature separateness loss allows to update the item nearest to the query, whereas discarding the influence of the second nearest item, separating all items in the memory and enhancing the discriminative power.
       查询项 q t k q^k_t qtk其最近的项目 p p p_p pp和第二最近的项目 p n p_n pn分别设置为锚定,我们用n表示查询q的第二个最近项的索引:
在这里插入图片描述
       请注意,这与三重态损耗的典型使用不同,三重态损耗要求锚的原始真值正样本和负样本。我们的损失鼓励查询和第二个最近的项目保持距离,而查询和最近的项目保持距离。这样可以将项目放置距离较远。因此,特征分离度损失允许更新最接近查询的项,而丢弃第二个最近项的影响,分离内存中的所有项并增强识别能力。

3.3. Abnormality score

       We quantify the extent of normalities or abnormalities in a video frame at test time. We assume that the queries obtained from a normal video frame are similar to the memory items, as they record prototypical patterns of normal data.We compute the L2 distance between each query and the nearest item as follows:
       我们量化测试时视频帧中正常或异常的程度。我们假设从正常视频帧获得的查询似于内存项,其记录正常数据的典型模式。我们计算每个查询与最近项之间的L2距离,如下所示:
在这里插入图片描述
       We also exploit the memory items implicitly to compute the abnormality score. We measure how well the video frame is reconstructed using the memory items. This assumes that anomalous patterns in the video frame are not reconstructed by the memory items. Following [22], we compute the PSNR between the input video frame and its reonstruction:
       我们还隐式地利用内存项来计算异常分数。我们测量使用内存项重建视频帧的效果(假设视频帧中的异常模式不由存储器项重建)。按照[22],我们计算输入视频帧与其重构之间的PSNR:
在这里插入图片描述

       where N is the number of pixels in the video frame. When the frame I t I_t It is abnormal, we obtain a low value of PSNR and vice versa.Following [22, 8, 26], we normalize each error in (14) and (15) in the range of [0, 1] by a min-max normalization [22]. We define the final abnormality score St for each video frame as the sum of two metrics, balanced by the parameter λ, as follows:
       其中N是帧的像素个数。当帧 I t I_t It异常时,我们获得较低的PSNR值,反之亦然。按照[22,8,26],我们通过最小-最大归一化[22]对[0,1]范围内的(14)和(15)中的每个错误进行归一化。我们将每个视频帧的最终异常分数St定义为两个度量的总和,由参数λ平衡,如下所示:
在这里插入图片描述
where we denote by g(·) the min-max normalization [22] over whole video frames, e.g.,
在这里插入图片描述

       

相关链接

论文与代码链接
【论文笔记】 Memory Anomaly Detection@2020 CVPR
添加链接描述
添加链接描述
减少CNN的代表能力

猜你喜欢

转载自blog.csdn.net/ResumeProject/article/details/120719125