Deep Learning Paper: A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop

深度学习论文: A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD
A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD
PDF: https://arxiv.org/pdf/2305.17382.pdf
PyTorch代码: https://github.com/shanglianlm0525/CvPytorch
PyTorch代码: https://github.com/shanglianlm0525/PyTorch-Networks

1 Overview

To address the wide diversity of product types in industrial visual inspection, we build a single model that can quickly adapt to numerous categories and requires no or few normal reference images, providing a more efficient solution for industrial visual inspection. A solution to zero/few-sample tracking for the 2023 VAND challenge is proposed.

1) In the zero-shot task, the proposed solution adds an additional linear layer to the CLIP model to map image features to a joint embedding space, thereby enabling it to be compared with text features and generate anomaly maps.
2) When a reference image is available (few-shot), the proposed solution utilizes multiple memory banks to store the reference image features and compare them with the query image at test time.

In this challenge, our method achieved first place in Zero-Shot and performed well on segmentation, with an F1 score improvement of 0.0489 over the second-place competitor. In Few-Shot we achieved 4th place overall and 1st place in classified F1 score.

Core points:

  • Use state and template prompt integration to create text prompts.
  • In order to localize abnormal areas, an additional linear layer is introduced to map the image features extracted from the CLIP image encoder to the linear space where the text features are located.
  • Compare the similarity between the mapped image features and text features to obtain the corresponding anomaly maps.
  • In few-shot, extra linear layers in the zero-shot stage are retained and their weights are maintained. Additionally, an image encoder is used during the testing phase to extract features of the reference image and save them to memory banks for comparison with the features of the test image.
  • In order to make full use of shallow and deep features, the features of different stages of the image encoder are simultaneously utilized.

2 Methodology

Overall, we adopt the overall framework of CLIP for zero-shot classification and use a combination of states and template collections to build our text prompts. To localize abnormal regions in images, we introduce additional linear layers that map image features extracted from the CLIP image encoder into the linear space where text features reside. Then, we perform similarity comparison on the mapped image features and text features to obtain the corresponding anomaly map. For the few-sample case, we retain the extra linear layers of the zero-sample stage and maintain their weights. Furthermore, we use an image encoder to extract the features of the reference image and save them into a memory bank, which are compared with the features of the test image during the testing phase. It should be noted that in order to fully utilize shallow and deep features, we use features from different stages in both null and few-shot settings.

Insert image description here

2-1 Zero-shot AD

Anomaly Classification
is based on the WinCLIP anomaly classification framework. We propose a text prompt integration strategy that significantly improves Baseline's anomaly classification accuracy without using complex multi-scale window strategies. Specifically, the integration strategy contains two parts: template-level and state-level:
1) The state-level text prompt uses general text to describe normal or abnormal targets (such as flawless, damaged) without using "chip around edge" and corner";
2) template-level text prompts. The proposed solution screened 85 templates for ImageNet in CLIP and removed "a photo of the weird [obj.]" and other inapplicable items. Template for anomaly detection tasks.
These two text hints will be extracted as final text features through CLIP's text encoder: F t ∈ R 2 × C F_{t} \in R^{2 \times C}FtR2×C

def encode_text_with_prompt_ensemble(model, texts, device):
    prompt_normal = ['{}', 'flawless {}', 'perfect {}', 'unblemished {}', '{} without flaw', '{} without defect', '{} without damage']
    prompt_abnormal = ['damaged {}', 'broken {}', '{} with flaw', '{} with defect', '{} with damage']
    prompt_state = [prompt_normal, prompt_abnormal]
    prompt_templates = ['a bad photo of a {}.', 
                        'a low resolution photo of the {}.', 
                        'a bad photo of the {}.', 
                        'a cropped photo of the {}.', 
                        'a bright photo of a {}.', 
                        'a dark photo of the {}.', 
                        'a photo of my {}.', 
                        'a photo of the cool {}.', 
                        'a close-up photo of a {}.', 
                        'a black and white photo of the {}.', 
                        'a bright photo of the {}.', 
                        'a cropped photo of a {}.', 
                        'a jpeg corrupted photo of a {}.', 
                        'a blurry photo of the {}.', 
                        'a photo of the {}.', 
                        'a good photo of the {}.', 
                        'a photo of one {}.', 
                        'a close-up photo of the {}.', 
                        'a photo of a {}.', 
                        'a low resolution photo of a {}.', 
                        'a photo of a large {}.', 
                        'a blurry photo of a {}.', 
                        'a jpeg corrupted photo of the {}.', 
                        'a good photo of a {}.', 
                        'a photo of the small {}.', 
                        'a photo of the large {}.', 
                        'a black and white photo of a {}.', 
                        'a dark photo of a {}.', 
                        'a photo of a cool {}.', 
                        'a photo of a small {}.', 
                        'there is a {} in the scene.', 
                        'there is the {} in the scene.', 
                        'this is a {} in the scene.', 
                        'this is the {} in the scene.', 
                        'this is one {} in the scene.']

    text_features = []
    for i in range(len(prompt_state)):
        prompted_state = [state.format(texts[0]) for state in prompt_state[i]]
        prompted_sentence = []
        for s in prompted_state: # [prompt_normal, prompt_abnormal]
            for template in prompt_templates:
                prompted_sentence.append(template.format(s))
        prompted_sentence = tokenize(prompted_sentence).to(device)
        class_embeddings = model.encode_text(prompted_sentence)
        class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
        class_embedding = class_embeddings.mean(dim=0)
        class_embedding /= class_embedding.norm()
        text_features.append(class_embedding)
    text_features = torch.stack(text_features, dim=1).to(device).t()

    return text_features

The corresponding image features through the image encoder are: F c ∈ R 1 × C F_{c} \in R^{1 \times C}FcR1 × C .
Integrated implementation of state-level and template-level, using CLIP text encoder to extract text features and averaging normal and abnormal features respectively. Finally, the average values ​​of normal and abnormal features are compared with the image features, and the abnormal category probability is obtained after softmax as the classification score
s = softmax ( F c F t T ) s = softmax(F_{c}F_{t}^ {T})s=softmax(FcFtT)
Finally selectssThe second dimension of s as a result of the anomaly detection classification problem.

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
results['pr_sp'].append(text_probs[0][1].cpu().item())

Anomaly Segmentation
is analogous to image-level anomaly classification methods to anomaly segmentation. A natural idea is to measure the similarity between different levels of features extracted by Backbone and text features. However, the CLIP model is designed based on a classification scheme, that is, except for the abstract image features used for classification, other image features are not mapped to a unified image/text space. Therefore, we propose a simple but effective solution to solve this problem: use additional linear layers to map image features at different levels into the image/text joint embedding space, that is, the linear layer maps patch_tokens, and then sums them based on each patch_token. The text features are used for similarity calculation to obtain anomaly map. , see the blue Zero-shot Anomaly Map process in the picture above. Specifically, features at different levels are jointly embedded in feature space transformation through a linear layer, and the transformed features are compared with text features to obtain abnormality maps at different levels. Finally, the anomaly graphs at different levels are simply summed to obtain the final result.

patch_tokens = linearlayer(patch_tokens)
anomaly_maps = []
for layer in range(len(patch_tokens)):
  patch_tokens[layer] /= patch_tokens[layer].norm(dim=-1, keepdim=True)
  anomaly_map = (100.0 * patch_tokens[layer] @ text_features.T)
  B, L, C = anomaly_map.shape
  H = int(np.sqrt(L))
  anomaly_map = F.interpolate(anomaly_map.permute(0, 2, 1).view(B, 2, H, H),
  size=img_size, mode='bilinear', align_corners=True)
  anomaly_map = torch.softmax(anomaly_map, dim=1)[:, 1, :, :]
  anomaly_maps.append(anomaly_map.cpu().numpy())
anomaly_map = np.sum(anomaly_maps, axis=0)

The training of Linear Layer (the parameters of the CLIP part are frozen) uses focal loss and dice loss.

2-2 Few-shot AD

Anomaly Classification
For the few-shot setting, the anomaly prediction of the image comes from two parts. The first part is the same as the zero-shot setup. The second part follows the conventional approach used in many AD methods, considering the maximum value of the anomaly map. The proposed scheme adds these two parts as the final anomaly score.

Anomaly Segmentation
few-shot segmentation task uses memory bank, as shown in the yellow background in Figure 1.
To put it bluntly, the cosine similarity between the query sample and the supporting sample in the memory bank is calculated, and then the anomaly map is obtained through reshape, and finally the anomaly map obtained by zero-shot is added to the anomaly map to obtain the final segmentation prediction.
In addition, in the few-shot task, there is no need to fine-tune the linear layer mentioned above, but the weights trained in the zero-shot task are directly used.

3 Experiments

Insert image description here
Simply put, in simpler images, zero-shot and few-shot have similar effects, but when faced with difficult tasks, few-shot will improve.

Guess you like

Origin blog.csdn.net/shanglianlm/article/details/132276540