NIPS2019《Cross Attention Network for Few-shot Classification》

Insert image description here
Published in NIPS2019! ! !
Paper link: https://proceedings.neurips.cc/paper/2019/file/01894d6f048493d2cacde3c579c315a3-Paper.pdf
Code link: https://github.com/blue-blue272/fewshot-CAN

1. Motive

Insert image description here
Although promising, few have paid enough attention to the recognizability of the extracted features. They usually extract features from supporting classes and unlabeled query samples independently, so the features are not discriminative enough. On the one hand, the test images in the support/query set are from invisible classes, so their features can hardly be used for target objects . Specifically, for test images containing multiple objects, the extracted features can focus on objects in seen classes with a large number of labeled samples in the training set, while ignoring objects in unseen classes. As shown in Figure 1© and (d) above, two images from the test class curtains, the extracted features only capture information about objects related to the training class, such as people or chairs in Figure 1 (a) and (b). On the other hand, the low data problem makes the features of each test class not represent the true class distribution because it is obtained by very few label support samples . In summary, independent feature representation may fail in small sample classification.

2. Contribution

In this work, a new Cross Attention Network (CAN) is proposed to improve feature discriminability for small sample classification.
1) First, the Cross Attention Module (CAM) is introduced to solve the invisible class problem . The idea of ​​cross-attention is inspired by human few-shot learning behavior. To identify a sample from an undiscovered class, humans tend to first locate the most relevant regions among pairs of labeled and unlabeled samples. Similarly, given a class feature map and a query example feature map, CAM generates a cross-attention map for each feature to highlight the target object. To achieve this purpose, correlation estimation and meta-fusion methods are employed. This allows the target object in the test sample to gain attention, and the features weighted by the cross attention map are more discriminative. As shown in Figure 1 (e), the features extracted by CAM can be used to roughly locate the target object screen area.
2) Secondly, we introduce a direct inference algorithm that utilizes the entire label-free query set to alleviate the low data problem . The algorithm iteratively predicts labels for query samples and selects pseudo-labeled query samples to expand the support set. The more supporting samples for each class, the more representative the resulting class features are, thus alleviating the low data problem.

3. Method

3.1 Problem definition

Few-shot classification usually includes a training set, a support set and a query set. The training set contains a large number of classes and labeled samples. The support set of a few labeled samples and the query set of unlabeled samples share the same label space, and the label space is not connected to the label space of the training set. The purpose of few-shot classification is to classify unlabeled query samples given a training set and a support set. If the support set consists of C classes and K labeled samples of each class, the target few-shot problem is called C-way K-shot.
Based on existing experience, this article also uses the episode training mechanism, which has been proven to be an effective few-sample learning method. The episodes used in training simulate the settings in testing. Each episode is composed of randomly sampled CCClass C and every classKKK labeled samples as support groupS = { ( xas , yas ) } a = 1 ns ( ns = C × K ) \mathcal{S} = \{ (x^s_a, y^s_a)\}^{n_s} _{a=1} (n_s = C \times K)S={ (xas,yas)}a=1ns(ns=C×K) C C A small part of the remaining samples in class C is used as the query set Q = { ( xbq , ybq ) } b = 1 nq \mathcal{Q} = \{ (x^q_b, y^q_b)\}^{n_q}_{ b=1}Q={ (xbq,ybq)}b=1nqcomposition. We will S k \mathcal{S}^kSk is expressed as thekthA supported subset of k classes. How to represent each support classS k \mathcal{S}^kSk and query samplexbqx^q_bxbq, and measuring the similarity between them is a key issue in few-shot classification.

3.2 Cross Attention Module

In this work, we obtain appropriate feature representations for each pair of support classes and query samples through metric learning. This paper proposes the Cross Attention Module (CAM), which can model the semantic correlation between class features and query features, thereby attracting attention to the target object and facilitating subsequent matching.
Insert image description here
The CAM is shown in (a) above. Class feature map P k ∈ R c × h × w P^k \in \mathbb{R}^{c \times h \times w}PkRc × h × w is obtained from the support sampleS k ( k ∈ { 1 , 2 , ⋯ , C } ) \mathcal{S}^k (k \in \{ 1, 2, \cdots, C\})Sk(k{ 1,2,,C } ) , while querying the feature mapQ b ∈ R c × h × w Q^b \in \mathbb{R}^{c \times h \times w}QbRc × h × w is from the query samplexbq ( b ∈ { 1 , 2 , ⋯ , nq } ) x^q_b (b \in \{ 1, 2, \cdots, n_q\})xbq(b{ 1,2,,nq} ) . wherecc_c h h h w w w are the channel number, height, and width of the feature map respectively. CAM isP k ( Q b ) P^k (Q^b)Pk(Qb )Generate cross attention mapA p (A q) A^p (A^q)Ap(Aq ), and then useA p ( A q ) A^p (A^q)Ap(Aq )Weight the feature map to achieve a more discriminative feature representationP ˉ bk ( Q ˉ kb ) \bar{P}^k_b (\bar{Q}^b_k)Pˉbk(

Guess you like

Origin blog.csdn.net/weixin_43994864/article/details/123349370