Group Normalization Paper Notes

Paper author: Yuxin Wu Kaiming He from Facebook AI Research (FAIR)

Email :{yuxinwu,kaiminghe}@fb.com

Paper link: https://arxiv.org/abs/1803.08494

Code link: https://github.com/shaohua0116/Group-Normalization-Tensorflow

1 Introduction:

The proposal of Batch Normalization (BN) is a milestone for the development of deep learning. However, BN has serious problems with small batches. As the training batch is smaller, the error increases. This shortcoming limits the application of BN in some computer vision tasks, especially some tasks that can only use small batches due to insufficient memory, such as semantic segmentation, video classification.

In this paper, the authors propose the Group Normalization (GN) method as an "improved" version of BN. BN calculates the mean and variance of the entire batch and performs regularization (mean is 0, variance is 1); GN divides channels into some groups, and then calculates the mean and variance of each group. The calculation of GN has nothing to do with the size of the batch size. And regardless of whether the batch size is large or small, the performance of GN (improved accuracy) remains stable.

Experimental results: Using the ResNet-50 network for training on the ImageNet dataset, using batch size = 2, the error rate of GN is 10.6% lower than that of BN. When using the commonly used batch sizes of 16, 32, and 64, the performance of GN is also better than BN's. And GN is very naturally used for fine-tuing with pre-trained models. GN outperforms BN on COCO object detection and semantic segmentation datasets, as well as on Kinetics video classification datasets. The experimental results show that in many deep learning tasks, GN can replace BN, and GN is very easy to use code to implement.

Enter image description Enter image description

Layer Normalization (LN) and Instance Normalization (IN) are regularization methods proposed after BN. They do not need to face the entire Batch, but relatively speaking, LN and IN are more suitable for RNN/LSTM or GANS, and are not very suitable for computer vision. aspect. But in turn, GNs are not only suitable for computer vision tasks, but also for RNNs and GANs.

2. Regularization related research work:

Regularizing the training data can make training faster, but because when regularizing, we usually assume that the distribution of the data is known, and when the training changes, it may lead to inefficient or even ineffective training. Before BN is applied, some deep learning networks like AlexNet usually apply Local Rsponse Normalization (LRN). Unlike current methods, LRN usually only uses the neighborhood range of each pixel for regularization. BN usually relies heavily on Batch Size. BN is to regularize the entire Bathch. In the prediction stage, the batch is usually 1, and BN is invalid, so BN is greatly limited. In response to the shortcomings of BN's over-reliance on Batch size, some new methods have been proposed, but these methods are only slightly improved (usually accompanied by requirements for computing resources and data sets), and do not solve the problem.

Group-wise refers to improving network performance by increasing the width of each layer while maintaining a certain depth of the network. As far as AlexNet, there are MobileNet and Xception, and the method of Group convolutions is proposed. Group convolutions refers to grouping convolutions and keeping the number of groups and channels equal. ShuffleNet does the shuffling operation. GN is somewhat related to these methods, but there is a big difference, GN is a general layer rather than a convolutional layer.

3. Detailed explanation of GN:

Generally speaking, the channels of visual representation are not completely independent. The traditional handcrafted feature methods SIFT, HOG, GIST and advanced VLAD and FV methods all utilize the relationship between channels, and are composed of basic channels to form advanced features. . Generally speaking, the features of deep learning are not unorganized feature vectors. Taking the first convolutional layer of a network as an example, the effect of this convolutional layer is generally similar to that of a filter (for example, the image can be flipped, etc.), The filters of related channels are normalized together to achieve more advanced effects. For example, the filter for detecting horizontal lines and the filter for detecting vertical lines can be combined to detect rectangles. The performance of the advanced layer is more obvious. Inspired by these studies, this paper proposes a new general group-level regularization method.

3.1. GN algorithm concept

Enter image description Enter image description Enter image description

GN, LN, and IN have nothing to do with Batch size. They can be transformed with each other. If group channels=1 is set, GN becomes LN, but this cannot combine each channel, so the effect of LN is worse than that of GN.

3.2 GN code implementation

The code implementation of GN is very simple and can be completed with only a few lines of code. Take Tensorflow as an example:

Enter image description

4. Experimental results

Enter image description Enter image description Enter image description Enter image description Enter image description

5. Work summary

The GN proposed in this paper has been proved to maintain stable network performance when the batch size is small, but the maximum performance of GN has not been excavated, and there may be a network architecture more suitable for GN in the future to obtain better results. Also, GN works very well on RNNs and GANs and is also very promising on reinforcement learning tasks.

6. Personal Summary

Personally think that the application scenarios of GN at this stage are suitable:

1. It is suitable for small batch size tasks caused by computing resources (insufficient GPU memory).

2. It is suitable for shallow networks (such as VGG). The author's experiments show that on the ImageNET dataset using the VGG16 network, the performance of GN is 0.4% higher than that of BN. But relative to the deep network, if the Batch size can be kept large enough, it is more recommended to use BN.

7. Questions that can be researched

1. Whether the use of GN with group convolutional (eg MobileNet, ShuffleNet) can prompt the effect.

2. Will BN and GN be used together for better results? CONV1 > BN > CONV2 > GN.

。。。。。。。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325970906&siteId=291194637