PRN(20200908):Frosting Weights for Better Continual Training

Zhu, Xiaofeng, et al. “Frosting Weights for Better Continual Training.” 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2019.

1. Problem background

The problem is still to solve the problem of catastrophic forgetting in deep learning . For details, see the introduction of the problem of catastrophic forgetting in deep learning and PRN (20200816): A Hierarchical Deep Convolutional Neural Network for Incremental Learning [Tree-CNN] .

However, in the fifth paragraph of the introduction, the author clearly pointed out that this article can be used as a special case of Continual learning, because the focus of the paper and the setting of data are different from Continual learning.

In Continual learning and Sequential learning, only new data is available for model training.

The author claims that the purpose of this article is different from Continual learning and Sequential learning, here is to ensure that the retrained model can have good performance on both old and new data. Therefore, when the article encounters new data, it will retrain the network with the new data and part of the old data. The author added: In order to ensure that the amount of calculation does not increase with the increase of new data, the paper uses a fixed-size training set, that is, the total number of new data + old data is a constant value. The relevant data screening method in the paper is not given. Of course, there are a series of doubts that will be given below.

The author's method to alleviate the catastrophic forgetting problem belongs to the ensemble model method. The main contribution is to propose two ensemble models (BoostNet and FrostNet) for continuous training tasks.

2. Three ways to solve catastrophic forgetting

  • Parameter regularization: This type of method balances the learning problems of new and old tasks by restricting the change of node weights that are important to historical tasks. The internal problem of the method is how to evaluate the importance of node weights to historical tasks? For example, synaptic intelligence falls into this category.
  • Model integration: You can use the concept of integrated learning to understand it preconceivedly. The way to solve the catastrophic forgetting problem is to train multiple sub-networks for the pre-trained network to solve different tasks. The problem with this method is the storage consumption and the number of training stages. Directly proportional. Simply put, you need to create a new sub-network to train on new data
  • Memory Consolidation: Memory consolidation is to solve the problem of catastrophic forgetting from the data level. It regenerates virtual training data that consolidates historical knowledge by learning to store different data patterns or memories to continuously stimulate the network. Representative methods include knowledge distillation, Self-refreshing Memory Approaches, and GAN-based methods.

The goal of this article is to continuously learn a more generalized network, that is, to correct the errors of historical training models by acquiring knowledge in new data.

3. Method

In order to better understand the method, first give the paper training process:

  • Use data set 1 to pre-train a teacher network; (data set 1 is called historical data, and data set 2 is called new data)
  • Then use half of the data set 1 and half of the new data set 2 to form a new training data set to train the new network. (This constructed data set is called the retraining data set)

3.1 BoostNet

  • For a network trained with historical data sets (as shown in the dashed box A in the figure below), it is called network A here;
  • We combine the new data and a part of the old data (a subset of the old data) to form a retraining data set;
  • Input the retraining data set into network A, and calculate the model's output and residuals (Residuals), where the residuals are the difference between the true value and the predicted value, which is used to measure how much network A needs to improve in order to perform well on the retraining data; Under normal circumstances, at this time, the residual of the old data in the retraining data set is close to zero, and the residual of the new data is very large.
  • The parameters and structure of network A remain unchanged, and then the residual calculated above is used as the label of the training data to train network B;
  • Finally, the sum of the output of network A and the output of network B is used as the final output in the inference (that is, the prediction stage).
    Insert picture description here

The idea is simple, but the problem with BoostNet is that every time new data arrives, a new network needs to be built to fit the residuals of the previous network.

3.2 FrostNet (updated on 2020.10.13)

For the pre-trained network, if new data is added, add a layer of frosted network layer with parameters as input between each layer of the network with parameter connections, and then use half of the new and old data sets to add the frosted layer of the network Conduct training. After training, multiply the frosted layer with the parameters of the network itself to obtain the new parameters of the network. In fact, the coefficients of the network parameters are trained. Training a new network is actually training a network with more layers of frosted parameters.

Insert picture description here

Finally: Thanks to Mr. Li (to make up on October 13, 2020)

At first, because I had too much expectations for the paper, the word "Frosting" was magical in my mind. Subconsciously, there is no way to understand this frosty network on the rougher road. The following is the content of the previous blog. Is it full of the desire for knowledge and the helplessness of a knowledge seeker? . . =^=

I didn't understand, I posted the original text (please tell my brother if you
Insert picture description here
Insert picture description here
understand ): Since I don't understand, I don't comment on the quality of FrostNet. However, is the writing of the paper too simplistic, and the important things are not stated.
In order to prevent some people from saying that I did not look good, I posted the content of the email I exchanged with the author to prove how much I want to figure out how the frosting network is frosted:
my email to the author [don’t hesitate to paste me Crappy English]:
Insert picture description here
Author's reply:
Insert picture description here
First of all, thanks to the author for his reply, I am still very kind. However, there is really no answer to the point. Alas, I don't want to stay here any more, and I withdrew after replying "Thank you very much".

The above is the content before quoting the blog.

Last night, the postgraduate tutor said to check the information and see my blog:
Insert picture description here
Well, I actually don’t want to believe this fact:

Insert picture description here
Thank you teacher again!

This paragraph is written here, thank you!


by windSeS
2020.10.13

Guess you like

Origin blog.csdn.net/u013468614/article/details/108464647