FitNets: Hints for thin deep nets paper notes

Paper address: https://arxiv.org/abs/1412.6550
github address: https://github.com/adri-romsor/FitNets

This article proposes an algorithm for setting initial parameters. At present, the training of many networks requires the use of pre-trained network parameters. For the training of a thin but deeper network, the author proposes a knowledge distillation method to distill the output of the middle layer of another large network into the network as a pre-training parameter initialization network.

Motivation

The existing top-performing network (the paper was published in ICLR in 2015) is usually very deep and wide, which makes the number of parameters very large and difficult to train, and the inference time is relatively long. But the depth does have an effect on the training of the network, and the fitting effect on the features is better. Therefore, the author proposes a method for training a thin and deep network.

Methods

First of all, the paper uses the knowledge distillation based on softmax transformation proposed by Hinton as the basis, and introduces the output of the intermediate layer as a guide for student network training, which is similar to the knowledge distillation based on feature map. The overall framework is shown in the figure below:
overall framework
First, select the intermediate layer to be distilled (ie, the teacher's Hint layer and the student's Guided layer), as shown in the green and red boxes in the figure. Since the output sizes of the two may be different, an additional convolutional layer is added after the guided layer so that the output size matches the hint layer of the teacher.

Then, all layers before the guided layer of the student network are trained through knowledge distillation, so that the middle layer of the student network learns the output of the hint layer of the teacher, and its loss function is the L2Norm of the output of the added convolutional layer and the output of the hint layer. :
loss
When selecting the middle layer, the author proposes that the earlier layer should be selected, because as the number of layers increases, the amount of information contained is more, and simply making the output the same may cause the network to overfit.

After training the layer before the guided layer, the current parameters are used as the initial parameters of the network, and all layer parameters of the student network are trained by knowledge distillation, so that the student learns the output of the teacher. Since the teacher's prediction for simple tasks is very accurate, and it is almost one-hot output in the classification task, in order to weaken the prediction output and make the information contained more abundant, the author uses the softmax transformation method proposed by Hinton et al., that is, introduces τ before softmax \tauτ scaling factor, divide the pre-softmax output of teacher and student byτ \tauτ . The loss function at this time is:
loss2
the first part is the cross-entropy loss of the student's output and groundtruth, and the second part is the cross-entropy loss of the softmax output of the student and teacher. λ \lambdaλ is used to adjust the weight ratio of the two cross entropies.

Experiments

Dataset: CIFAR-10, CIFAR-100, SVHN, MNIST, AFLW
Network: Teacher: maxout convolutional networks, Student: FitNet
tribulations

Results

result1
result2
result3

Thoughts

This article is relatively old, but I learned some tricks from it. For example, the distillation of feature map is equivalent to adding regularization constraints. In order to prevent overfitting, it is necessary to select a higher output for distillation. On the other hand, for the output distillation of softmax It is best to attenuate when . In addition, the multi-stage training method to train the network should be effective for complex training methods.

Guess you like

Origin blog.csdn.net/qq_43812519/article/details/105332565