CVPR2018 Relation Net small sample learning based on flying propeller reproduction

Relation Net is a paper of CVPR2018, the paper link: https://arxiv.org/abs/1711.06025.
Insert picture description here
Deep learning has achieved great success in visual recognition tasks. The author of the article pointed out that training models require a large number of labeled pictures, and multiple iterations are required to train parameters. It takes time to label each time a new object category is added. At the same time, there may not be a large number of labeled images in some emerging object categories and rare object categories. Humans can achieve small sample (FSL) and sample-free learning (ZSL) with very little cognitive learning. The author gave an example. As long as a child knows a zebra in a picture or a book, or just hears the description of a zebra as a "striped horse", they can identify the zebra without difficulty. In order to solve the problem that the classification effect of the model with few samples in deep learning will be very poor, and at the same time, inspired by the human's small sample and sample-free learning ability, small sample learning has recovered some heat. The Fine-tune technology in deep learning can be used in some cases with relatively few samples, but in the case of only one or a few samples, even if data enhancement and regularization techniques are used, there will still be overfitting problems. At present, the reasoning mechanism of other small sample learning is more complicated, so the author of the paper proposes a model Relation Net that can be trained end-to-end and has a simple structure.
In the FSL task, the data set is generally divided into three data sets: training set/support set/testing set. The support set and the testing set have the same label, and the training set does not include the label of the support set and the testing set. There are K labeled data and C different categories in the support set, which is called C-way K-shot. In the training process, select the sample set/query set from the training set to correspond to the support set/testing set. The specific method will be explained in detail in the training strategy later.
Relation Network consists of embedding model and relation model. The core idea of ​​the Relation Network is to first extract the feature maps of the images in the support set and the test set through the embedding model, and then stitch the dimensions representing the number of channels in the feature maps to obtain a new feature map. Then send the new feature map to the relation model for calculation to get the relation score, which represents the similarity of the two images.
The following figure shows the network structure and process of accepting 1 sample in the case of 5-way 1-shot. 5 pictures in the sample set and 1 picture in the query set will be extracted and spliced ​​through the embedding model to obtain 5 new feature maps, and then sent to the Relation Net to calculate the relation score, and finally get a one- A vector of shots, with the highest score representing the corresponding category. Insert picture description here
The loss function used for training is also relatively simple, and the mean square error is used as the loss function. In the formula, ri,j represent the similarity between pictures i and j. yi and yj represent the true label of the picture.
Insert picture description here
For the
definition of the Relation Network model structure based on the flying paddle reproduction , please check:
https://github.com/txyugood/paddle_RN_FSL/blob/master/RelationNet.py
Below I will share the technical details of the reproduction with developers.
1. Build a Relation Network
The model consists of two parts: embedding model and relation model. Both networks are mainly composed of [Conv+BN+Relu] modules, so first define a BaseNet class and implement the conv_bn_layer method in it. The code is as follows:

class BaseNet:
    def conv_bn_layer(self,
                      input,
                      num_filters,
                      filter_size,
                      stride=1,
                      groups=1,
                      padding=0,
                      act=None,
                      name=None,
                      data_format='NCHW'):
        n = filter_size * filter_size * num_filters
        conv = fluid.layers.conv2d(
            input=input,
            num_filters=num_filters,
            filter_size=filter_size,
            stride=stride,
            padding=padding,
            groups=groups,
            act=None,
            param_attr=ParamAttr(name=name + "_weights", initializer=fluid.initializer.Normal(0,math.sqrt(2. / n))),
            bias_attr=ParamAttr(name=name + "_bias",
                                initializer=fluid.initializer.Constant(0.0)),
            name=name + '.conv2d.output.1',
            data_format=data_format)

        bn_name = "bn_" + name

        return fluid.layers.batch_norm(
            input=conv,
            act=act,
            momentum=1,
            name=bn_name + '.output.1',
            param_attr=ParamAttr(name=bn_name + '_scale',
                                 initializer=fluid.initializer.Constant(1)),
            bias_attr=ParamAttr(bn_name + '_offset',
                                initializer=fluid.initializer.Constant(0)),
            moving_mean_name=bn_name + '_mean',
            moving_variance_name=bn_name + '_variance',
            data_layout=data_format)

The flying paddle supports two network definition modes: static graph and dynamic graph. Here I choose the static graph. The above code defines the conv_bn layer that appears most frequently in a convolutional neural network, but it should be noted that the momentum of the batch_norm layer is set to 1. The achieved effect is that the global mean and variance are not recorded.
The meanings of specific parameters are as follows:

  • input: Pass in the tensor object to be convolution processed
  • num_filter: the number of convolution kernels (the number of channels of the output convolution result)
  • filter_size: Convolution kernel size
  • stride: convolution step size
  • groups: the number of groups for grouped convolution
  • padding: padding size, set to 0 here, which means no padding after convolution.
  • act: the activation function connected to the BN layer, if it is None, the activation function is not used
  • name: the name of the object in the operation graph

Then we define the embedding model part of the Relation Network.

class EmbeddingNet(BaseNet):
    def net(self,input):
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv3')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv4')
        return conv

First create an EmbeddingNet class, inherit the BaseNet class, it inherits the conv_bn_layer method. Define the net method in EmbeddingNet. Its parameter input represents the input image tensor. This method is used to create a static image of the network. The input input first passes through a [Conv+BN+relu] module to obtain the feature map embed_conv1, and then performs a maximum pooling operation. The role of pooling is to reduce the feature map while preserving important features, and the subsequent convolution and pooling operations have the same effect. Finally, the shape of the feature map output by embed_conv4 is [-1,64,19,19], a total of 4 dimensions, the first latitude represents batch_size, because batch_size is uncertain when creating a static network, so use -1 to indicate that you can Is any value. The second latitude represents the number of channels of the feature map. After the embedding model, the number of channels of the feature map is 64. Finally, the third and fourth dimensions represent the width and height of the feature map, here it is 19x19.
The code part of the Relation model is as follows:

class RelationNet(BaseNet):
    def net(self, input, hidden_size):
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        fc = fluid.layers.fc(conv,size=hidden_size,act='relu',
                             param_attr=ParamAttr(name='fc1_weights',
                                                  initializer=fluid.initializer.Normal(0,0.01)),
                             bias_attr=ParamAttr(name='fc1_bias',
                                                 initializer=fluid.initializer.Constant(1)),
                             )
        fc = fluid.layers.fc(fc, size=1,act='sigmoid',
                             param_attr=ParamAttr(name='fc2_weights',
                                                  initializer=fluid.initializer.Normal(0,0.01)),
                             bias_attr=ParamAttr(name='fc2_bias',
                                                 initializer=fluid.initializer.Constant(1)),
                             )
        return fc

Create a RelationNet class, which also inherits from the BaseNet class and inherits the conv_bn_layer method. In the net method, the first few layers of the model are similar to the embeding model using the [Conv+BN+Relu] module for feature extraction. At the end, two fully connected layers are used to map the feature value to a scalar relation score, which represents two The similarity of the pictures.

During the training process, the pictures in the sample set and the pictures in the query set have passed through the embedding model to obtain a feature map of shape [-1,64,19,19], which needs to be spliced ​​before being sent to the relation model. This code It's a bit complicated, so let me explain it in sections.

sample_image = fluid.layers.data('sample_image', shape=[3, 84, 84], dtype='float32')
query_image = fluid.layers.data('query_image', shape=[3, 84, 84], dtype='float32')
         
sample_query_image = fluid.layers.concat([sample_image, query_image], axis=0)
sample_query_feature = embed_model.net(sample_query_image)

This part of the code is to splice the tensor of sample image and query image at the latitude of batch_size to get the tensor sample_query_image, and send them to the embedding model to extract the features to get sample_query_feature.

sample_batch_size = fluid.layers.shape(sample_image)[0]
query_batch_size = fluid.layers.shape(query_image)[0]

This part of the code takes the 0 dimension of the image tensor as the batch_size.

sample_feature = fluid.layers.slice(
                sample_query_feature,
                axes=[0],
                starts=[0],
                ends=[sample_batch_size])
if k_shot > 1:
# few_shot
      sample_feature = fluid.layers.reshape(sample_feature, shape=[c_way, k_shot, 64, 19, 19])
      sample_feature = fluid.layers.reduce_sum(sample_feature, dim=1)
query_feature = fluid.layers.slice(
      sample_query_feature,
      axes=[0],
      starts=[sample_batch_size],
      ends=[sample_batch_size + query_batch_size])

Since the previous picture was spliced, after the feature, it is also necessary to slice on the 0 dimension corresponding to the batch_size of sample_query_feature to obtain sample_feature and query_feature respectively. Here, if K-shot is greater than 1, you need to change the shape of sample_feature, then sum the K-shot tensors on the 1 dimension corresponding to K-shot and delete the dimension, then the shape of sample_feature becomes [C-way ,64,19,19]. At this time, the value of sample_batch_size should be C-way.

sample_feature_ext = fluid.layers.unsqueeze(sample_feature, axes=0)
query_shape = fluid.layers.concat(
       [query_batch_size, fluid.layers.assign(np.array([1, 1, 1,1]).astype('int32'))])
sample_feature_ext = fluid.layers.expand(sample_feature_ext, query_shape)

Because each picture feature in the sample set needs to be spliced ​​with C types of picture features, a new dimension is added here through unsqueeze. According to the parameter requirements of the expand interface, a new query_shape tensor is created here to copy the sample_feature tensor query_batch_size to obtain a tensor with shape [query_batch_size, sample_batch_size, 64, 19, 19].

query_feature_ext = fluid.layers.unsqueeze(query_feature, axes=0)
if k_shot > 1:
sample_batch_size = sample_batch_size / float(k_shot)
sample_shape = fluid.layers.concat(
      [sample_batch_size, fluid.layers.assign(np.array([1, 1, 1, 1]).astype('int32'))])
query_feature_ext = fluid.layers.expand(query_feature_ext, sample_shape)

Like the above operation, the feature of the query set also needs to add a dimension, where sample_batch_size needs to be copied. It is worth noting that if the k-shot is greater than 1, because the reduce_mean operation has been done before, the sample_batch_size must be divided by the k-shot to get the new sample_batch_size. Finally, a tensor of [sample_batch_size, query_batch_size, 64, 19, 19] is obtained by copying.

query_feature_ext = fluid.layers.transpose(query_feature_ext, [1, 0, 2, 3, 4])
relation_pairs = fluid.layers.concat([sample_feature_ext, query_feature_ext], axis=2)
relation_pairs = fluid.layers.reshape(relation_pairs, shape=[-1, 128, 19, 19])

Finally, the transpose method is used to make the shape of sample_feature_ext and query_feature_ext consistent, and finally the two features are spliced ​​and modified to obtain a tensor relation_pairs with a shape of [query_batch_size x sample_batch_size, 128, 19, 19].

relation = RN_model.net(relation_pairs, hidden_size=8)
relation = fluid.layers.reshape(relation, shape=[-1, c_way])	

Finally, the previously spliced ​​features are sent to the relation model module. First, a vector of query_batch_size x sample_batch_size is obtained, and then the shape is changed to obtain a tensor of [query_batch_size, sample_batch_size] (sample_batch_size is actually equal to C-way), and the vector of sample_batch_size length is The form of one-hot expresses the category of each query image.

The code of the loss function is as follows:

one_hot_label = fluid.layers.one_hot(query_label, depth=c_way)
loss = fluid.layers.square_error_cost(relation, one_hot_label)
loss = fluid.layers.reduce_mean(loss)

First, the label query_label of the query image is converted into one-hot form, the relation obtained before is also in one-hot form, and then the MSE of relation and one_hot_label is calculated to obtain the loss function.
2. Training strategy
In the FSL task, if you only use the support set to train, you can also perform inference prediction on the test set, but because the number of samples in the support set is relatively small, the performance of the classifier is generally not good. Therefore, the training set is generally used for training, so that the classifier will have a better performance. There is an effective method here, called episode based training.
The implementation steps of episode based training are as follows:

  • Training requires loop iterations of N episodes, each episode will randomly select K data from C categories in the training set to form a sample set data set. C and K correspond to the C-way K-shot in the support set, and there are a total of C x K samples.
  • Then randomly select several samples from the remaining samples in the C categories as query sets for training.

For 5-way 1-shot learning, the batch_size of the sample set is 5, and the batch_size of the query set is 15. For 5-way 5-shot learning, the batch_size of the sample set is 25 (5 images per category), and the batch_size of the query set is 10.
For the trained optimizer, Adam optimization is selected, and the learning rate is set to 0.001.
For data augmentation, AutoAugment is used for both sample set and query set images during data reading to increase the diversity of data.

3. Model reproduction effect
The data set at the time of verification only uses the minImageNet used in the experiment in the paper, with a total of 100 categories, each with 600 images. The 100 categories are divided into training/validation/testing three data sets, the numbers are 64, 16, and 20 respectively.
The article mentioned that the accuracy of the model on the minImageNet test data set is as follows:

Insert picture description here
Relation Net achieved accuracy rates of around 50.44 and 65.32 in 5-way 1-shot and 5-way 5-shot, respectively.
The
5-way 1-shot accuracy rate on the minImageNet test data set of the Relation Net based on the flying paddle is also used :
Insert picture description here
5-way 5-shot accuracy rate:
Insert picture description here
consistent with the accuracy rate in the paper, the model reproduction is complete.
Code address: https://github.com/txyugood/paddle_RN_FSL

Guess you like

Origin blog.csdn.net/txyugood/article/details/111008891