Anomaly Detection - Defect Detection - Paper Intensive Reading PaDiM

Abstract

We propose a new patch distribution modeling framework, PaDiM, to simultaneously detect and localize anomalies in images in the setting of one-class learning. PaDiM utilizes a pre-trained convolutional neural network (CNN) for patch embedding and a multivariate Gaussian distribution for probabilistic representation of normal classes. It also exploits the correlation between different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-of-the-art methods in anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to evaluate the performance of anomaly localization algorithms on unaligned datasets. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.

Introduction

Humans are able to detect heterogeneous or unexpected patterns in a homogenous set of natural images. This task is known as anomaly detection and has numerous applications, including visual industrial inspection. However, anomalies are very rare events on the production line, and manual detection is cumbersome. Therefore, anomaly detection automation can enable continuous quality control by avoiding reducing operator attention span and facilitating operator work. In this paper, we mainly focus on anomaly detection, especially anomaly localization in the context of industrial inspection. In computer vision, anomaly detection is all about giving an image an anomaly score. Anomaly localization is a more complex task, which assigns an outlier value to each pixel or each patch of pixels to output an anomaly map. Therefore, anomaly localization yields more precise and interpretable results. An example of an anomaly map for anomaly localization in images of the MVTec Anomaly Detection (MVTec AD) dataset by our method is shown in Figure 1.

Figure 1 Image sample from MVTec AD. Left column: normal images for transistor, capsule and wood classes. Middle column: images of the same category, with ground truth anomalies highlighted in yellow. Right column: Anomaly heatmap obtained by our PaDiM model. Yellow areas represent detected anomalies, while blue areas represent normal areas.

Anomaly detection is a binary classification between normal and abnormal classes. However, since we are often short of examples of anomalies, and anomalies may have unexpected patterns, it is impossible to train a model for this task under full supervision. Therefore, anomaly detection models are usually estimated under the one-class learning setting, i.e., the training dataset only contains images of the normal class. At test time, samples that differ from the normal training dataset are classified as abnormal samples.

Recently, several methods have been proposed to combine anomaly localization and detection tasks in a class learning setting. However, they either require deep neural network training, which can be cumbersome, or use the k-nearest neighbor (K-NN) algorithm on the entire training dataset at test time. The linear complexity of the KNN algorithm increases in time and space complexity as the training dataset grows larger. These two scalability issues may hinder the deployment of anomaly localization algorithms in industrial settings.

To address the above issues, we propose a new anomaly detection and localization method, named PaDiM, for Patch Distribution Modeling. It utilizes a pre-trained Convolutional Neural Network (CNN) for embedding extraction with the following two properties:

  1. Each patch position is described by a multivariate Gaussian distribution;
  2. PaDiM considers the correlation between different semantic levels of a pre-trained CNN

With this new and effective approach, PaDiM outperforms existing state-of-the-art anomaly localization and detection methods on MVTec AD. In addition, it has low time and space complexity at test time, which is not affected by the training size of datasets for industrial applications. We also extend the evaluation protocol to evaluate the model's performance under more realistic conditions, namely on non-aligned datasets.

Related Work

Anomaly detection and localization methods can be divided into reconstruction based methods and embedding similarity based methods

Reconstruction-based methods

Reconstruction-based methods are widely used in anomaly detection and localization. Train neural network architectures like autoencoders, variational autoencoders, or generative adversarial networks to only reconstruct normal training images. Therefore, abnormal images can be found because they are not well reconstructed. At the image level, the simplest approach is to use the reconstruction error as an anomaly score, but outlier images can be better identified from latent spaces, intermediate activations, or a discriminator. To localize anomalies, reconstruction-based methods can incorporate pixel-level reconstruction errors as anomaly scores or structural similarities. Alternatively, the anomaly map can be a visual attention map generated from the latent space. Although reconstruction-based methods are quite intuitive and interpretable, their performance is limited because AE can sometimes produce good reconstruction results for abnormal images.

Embedding similarity-based methods

Similarity-based embedding methods use deep neural networks to extract meaningful vectors that describe the entire image for anomaly detection, or image patches for anomaly localization. Nevertheless, similarity-based embedding methods that only perform anomaly detection give promising results, but often lack interpretability, since it is impossible to know which part of the anomalous image is responsible for the high anomaly score. In this case, the anomaly score is the distance between the embedding vector of the test image and a reference vector representing the normality of the training dataset. The normal reference can be the center of the nsphere containing the normal image embedding, a Gaussian distribution parameter, or the entire set of normal embedding vectors. The last option is SPADE, which reports the best results for anomaly localization. However, at test time, it runs the K-NN algorithm on a regular set of embedding vectors, so the inference complexity grows linearly with the dataset training size. This may hinder the industrial application of this method.

Patch Distribution Modeling

Embedding extraction

A pretrained CNN is able to output relevant features for anomaly detection. Therefore, we choose to use a pretrained CNN to generate patch embedding vectors, thus avoiding tedious neural network optimization. The process of embedding patches in PaDiM is similar to that in SPADE, as shown in Figure 2. During the training phase, each patch of a normal image is associated with its spatially corresponding activation vector in the pre-trained CNN activation map.

Activation vectors from different layers are then concatenated to obtain embedding vectors carrying information from different semantic levels and resolutions, thus encoding fine-grained and global context. Since the activation map has a lower resolution than the input image, many pixels have the same embedding, which then form blocks of pixels that do not overlap in the original image resolution. Therefore, the input image can be divided into ( i , j ) ∈ [ 1 , W ] × [ 1 , H ] (i,j)∈[1,W]×[1,H](i,j)[1,W]×[1,H ] grid of locations, where WxH is the resolution of the largest activation map used to generate the embedding. Finally, each patch position( i , j ) in this grid (i,j)(i,j ) with the embedding vector xij calculated as abovex_{ij}xijAssociated.

insert image description here

The generated patch embedding vectors may carry redundant information, so we experimentally investigate the possibility of reducing their size (Sections V-A). We note that randomly selecting a few dimensions is more efficient than the classical principal component analysis (PCA) algorithm. This simple random dimensionality reduction significantly reduces our model complexity, training and testing time while maintaining state-of-the-art performance. Finally, the patch embedding vectors of the test images are used to output anomaly maps with the parameters of normal classes described in the next subsection.

Learning of the normality

To learn position ( i , j ) (i, j)(i,j ) , we first calculate from N normal training images at( i , j ) (i, j)(i,The set of patch embedding vectors at j ) , X ij = xijk , k ∈ [ 1 , N ] X_{ij}= {x^k_{ij}, k∈[1,N]}Xij=xijk,k[1,N ] , as shown in Figure 2. To summarize the information carried by the collection, we assume that the multivariate Gaussian distributionN ( µ ij , Σ ij ) N(µ_{ij}, Σ_{ij})N ( mij, Sij) producesX ij X_{ij}Xij, where µ ij µ_{ij}mijis the sample mean, ∑ ij ∑_{ij}ij 样本协方差估计如下:
μ i j = 1 N ∑ k N x i j k Σ i j = 1 N − 1 ∑ k N ( x i j k − μ i j ) ( x i j k − μ i j ) T + ε I \mu_{i j}=\frac{1}{N} \sum_{k}^{N} x_{i j}^{k} \\ \Sigma_{i j}=\frac{1}{N-1} \sum_{k}^{N}\left(x_{i j}^{k}-\mu_{i j}\right)\left(x_{i j}^{k}-\mu_{i j}\right)^{T} +εI mij=N1kNxijkSij=N11kN(xijkmij)(xijkmij)T+ε I
where the regularization termε I εIε I makes the sample covariance matrixΣ ij ΣijΣ ij is full rank and reversible. Finally, each possible patch location is associated with a multivariate Gaussian distribution as shown in Figure 2 via a Gaussian parameter matrix.

Our patch embedding vectors carry information from different semantic levels. Therefore, each estimated multivariate Gaussian distribution N ( µ ij , Σ ij ) N(µ_{ij}, Σ_{ij})N ( mij, Sij) also capture different levels of information, andΣ ij ΣijΣ ij contains the correlation between levels. We show experimentally (Section V-A) that modeling the relationship between different semantic levels of a pre-trained CNN helps improve the performance of anomaly localization.

Inference : computation of the anomaly map

Inspired by "Modeling the distribution of normal data in pre-trained deep features for anomaly detection" and "A simple unified framework for detecting out-of-distribution samples and adversarial attacks", we use the Mahalanobis distance M ( xij ) M ( xij)M ( x ij ) gives test image position( i , j ) (i, j)(i,j ) patch an anomaly score. M ( xij ) M(x_{ij})M(xij) can be interpreted as embeddingxij x_{ij}xijThe test patch and learning distribution N ( µ ij , Σ ij ) N(µ_{ij}, Σ_{ij})N ( mij, Sij) , whereM ( xij ) M(x_{ij})M(xij) is calculated as follows:

M ( x i j ) = ( x i j − μ i j ) T Σ i j − 1 ( x i j − μ i j ) (7) \mathcal{M}\left(x_{i j}\right)=\sqrt{\left(x_{i j}-\mu_{i j}\right)^{T} \Sigma_{i j}^{-1}\left(x_{i j}-\mu_{i j}\right)} \\ \tag{7} M(xij)=(xijmij)TSij1(xijmij) (7)

Therefore, the Mahalanobis distance matrix that constitutes the anomaly graph can be calculated:
M = ( M ( xij ) ) 1 < i < W , 1 < j < HM=(M(x_{ij}))_{1<i<W ,1<j<H}M=(M(xij))1<i<W,1<j<H

The final anomaly score of the entire image is the maximum value of the anomaly map M.

Finally, at test time, our method does not suffer from the scalability problem of K-NN based methods, since we do not need to compute and sort a large number of distance values ​​to get a patch's anomaly score.

Experiments

Datasets and metrics

Datasets

We first evaluate our model on MVTec AD, which is designed to test anomaly localization algorithms for industrial quality control in a one-class learning setting. It contains 15 classes with about 240 images. The resolution of the original image is between 700x700 and 1024x1024. There are 10 objects and 5 texture classes. Objects are always centered in the same way in the dataset, as shown by the transistor and capsule classes in Figure 1. In addition to the original dataset, to evaluate the performance of the anomaly localization model in a more realistic context, we create a modified version of MVTec AD, called RdMVTec AD, where we apply random rotation (-10 + 10) and random crop (from 256 x256 224x224) for training and testing sets. This modified version of MVTec AD better describes the real use case of anomaly localization for quality control, where the object of interest is not always centered and aligned in the image.

For further evaluation, we also test PaDiM on the University of Shanghai for Technology (STC) dataset [8], which simulates video surveillance with static cameras. It contains 274515 training and 42883 testing frames, divided into 13 scenes. The resolution of the original image is 856x480. The training videos consist of normal sequences, while the test videos have anomalies, such as vehicles in pedestrian areas or people fighting.

Metrics

To evaluate localization performance, we compute two threshold-independent metrics. We use the area under the receiver operating characteristic curve (AUROC), where the true positive rate is the percentage of pixels correctly classified as abnormal. Since AUROC is biased towards large anomalies, we also use a per-region overlap score (PRO-score). It consists of plotting, for each connected component, a mean curve of the rate of correctly classified pixels as a function of the false positive rate between 0 and 0.3. PRO-score is the normalized integral of this curve. A high PRO-score means that both large and small anomalies are well localized.

Experimental setups

We train PaDiM with different backbones, ResNet18 (R18), Wide ResNet-50-2 (WR50) and EfficientNet-B5, all pre-trained on ImageNet. When the backbone is ResNet, patch embedding vectors are extracted from the first three layers to combine information from different semantic levels while maintaining a high enough resolution for the localization task. Following this idea, if using EfficientNet-B5, we extract patch embedding vectors from layer 7 (level 2), layer 20 (level 4) and layer 26 (level 5). We also apply random dimensionality reduction (RD) (see Section III-A and Section VA). Our model name indicates the backbone and the dimensionality reduction method used (if any). For example, PaDiM-R18-RD100 is a PaDiM model with a ResNet18 backbone, using 100 randomly chosen dimensions for the patch embedding vectors. By default we use ε = 0.01 ε=0.01e=0.01 in Equation 1.

We replicated the model SPADE described in the original publication with Wide ResNet-50-2 (WR50) as the backbone. For Spade and PaDim, we use the same preprocessing as in . We resized the image from MVTec AD to 256x256 and center cropped it to 224x224. For images from STC we only use 256x256 resizing. We resize the image and localization maps using bicubic interpolation and apply a Gaussian filter on the outlier map with parameter σ=4 as in .

We also implement our own VAE as a reconstruction-based baseline, using ResNet18 as the encoder, and implementing it with 8x8 convolutional latent variables. It is trained on each MVTec AD class with the following data augmentation operations: random rotation (−2◦, +2◦), resizing to 292x292, random cropping to 282x282, and finally center cropping to 256x256. The training is performed over 100 epochs using the ADAM optimizer with an initial learning rate of 1 0 − 4 10^{−4}104 with a batch size of 32 images. The anomaly map used for localization corresponds to the pixel-wise L2 error used for reconstruction.

Guess you like

Origin blog.csdn.net/weixin_45755332/article/details/128532057