Norm Compare

table of Contents

Introduction


Shape referred to as an input image [N, C, H, W]

BATCH Batch Norm is on, to make NHW normalization, every single channel is to be normalized input, this small batchsize ineffective;
Layer Norm in the channel direction, of CHW normalized, that is, the depth of each of the input normalized, mainly RNN obvious effect;
Instance Norm on the image pixels, normalized to do HW, i.e. of the length and width of an image pixel of a normalized with the stylized migration;
Group Norm will channel grouping, somewhat similar to LN, the channel also just GN has been divided, refinement, and then do the normalization;
Switchable Norm is to combine BN, LN, IN, is weighted, so that the network themselves to learning normalization layer what method should be used

BN

We will in the training data prior to the data sets are normalized, the normalized object of the object is normalized so that the data is pre-defined within a certain range (for example, [0,1] or [-1, 1]), so as to eliminate the adverse effects caused by the singular sample data. Although the data input layer, has been normalized, the distribution of each layer of the network behind the input data has been changing, updating the front layer will lead to changes in the training parameter input data distribution rear layer, each layer will inevitably cause the rear input change the data distribution. Further, small changes in front of the network layers, back layers will gradually accumulate the change amplification. Changing the training process, the network data distribution of the intermediate layer is called: "Internal Covariate Shift". BN proposed, it is to solve in the training process, the intermediate layer of data distribution changes. So we introduced the concept of BN, to eliminate this effect. So every incoming network data in each layer of the network are to conduct a BN, the data back to a normal distribution, data distribution is done so that consistent and avoid the gradient disappears.
In addition, internal corvariate shift and covariate shift are two different things, the former is an internal network, which is for the input data, such as we do normalization preprocessing operation before the training data. Note that when using a small batch-size BN will destroy the performance, there will be bad results when unbalanced binary classification task when having distribution. Because if a small batch-size of a normalized reasons, such mean and variance of the original data deviate from the original data, the mean and variance is insufficient to replace the entire data distribution. Classification task uneven distribution will happen!
BN need to calculate the actual use of a layer of the neural network and stored batch mean and variance statistics, etc., for a constant depth of feedforward neural networks (DNN, CNN) using the BN, it is convenient; but it RNN, sequence the length is inconsistent, in other words the depth RNN is not fixed, different time-step statics need to save different features, there may be a special number sequence other than the sequence length, so that when Training, calculation is troublesome.

Batch Norm depth study can be described as a very important technology, not only can make the training a deeper network becomes easy to accelerate convergence, there is a certain regularization effect, can prevent over-fitting model. In many classification tasks based on CNN, it has been widely used.
But I recently did some practice in the super-resolution image generation and image terms found in this type of task, Batch Norm's performance is not good, he joined the Batch Norm, but makes the slow train speeds, instability, and eventually diverge.
The following are my personal views on this phenomenon is not critical need to continue testing.

First, the super-resolution image, the image output from the network in color, contrast, brightness requirements and consistent input, and change only the resolution of some of the details, and Batch Norm, is similar to the one kind of image contrast stretching, any image after Batch Norm, its color distribution will be normalized, that is to say, it destroys the original image contrast information, Batch Norm join but affected the quality of the network output. Although Batch Norm shift in the scale and offset parameters can be normalized effect, but this increases the difficulty and time of training, not as a direct need. However, there may be a type of network structure, that is, the residual web (Residual Net), but it is only used in the residual block among such SRResNet, is a super-resolution image for the residual network. Why such a network can use the Batch Norm it? My personal understanding is that because the contrast of the image information can be transmitted directly by skip connection, so there is no need to worry about the destruction of the Batch Norm.

Based on this idea, it can also be explained Batch Norm from another point of view on why so effective image classification task. Image Classification not need to keep the contrast information of the image, using the image of structural information can be classified, so the images are normalized by Batch Norm, but reduces the difficulty of training, and even some obvious structure, after the Batch Norm It will be highlighted (contrast is opened).

For Picture Style transfer, why can Batch Norm it? The reason is that, after the style of the image, its color, contrast, brightness, and are unrelated to the original image, only the image related style, only the structure of the original image to the image information is represented in the last generation. Therefore, the network Picture Style metastasis Batch Norm or Instance Norm is not surprising, moreover, Instance Norm is a single image is more direct than Batch Norm normalization operation, even the scale and shift all.

Some put it more widely, the absolute difference between Batch Norm ignores the image pixels (or feature) (zero because the mean and variance normalization), but only consider the relative difference, so the task does not require absolute difference in (such as Classification ), has the effect of icing on the cake. As for the super-resolution image of this task requires the use of absolute differences, Batch Norm will only add to the trouble.

LN

The difference is that with the BN, LN for all neurons in each layer are normalized, and BN is different:

LN owns neurons in the input layer with the same mean and variance of different input samples have different mean and variance;
 BN in the calculation of the mean and variance for different input neurons, have the same mean and variance of the same input a batch of . 
LN is not dependent on the batch size and depth of the input sequence, it can therefore be used batchsize 1 and RNN normalize the operation of the side length of the input sequence.
Generally, LN is often used RNN network!

IN

BN focus on each batch is normalized to ensure consistent data distribution, because the results of the discriminant model depends on the overall distribution of the data. In the image style, generating image relies primarily an example, so in this case the entire batch is not suitable for the normalization, but is required between pixels for normalization, we can accelerate the convergence of the model, and keep each instance of the image independence!

GN

Batch Normalization mainly for small batchsize poor, sub-group GN the channel direction, then within each group do normalized count (C // G) H mean W, so that regardless of the batchsize, not bound.

SN

  1. While increasing normalization model generalization, but operating normalization layer is artificially designed. In practical applications, designed to address the needs of different normalized operating on a different principle, and not a common normalization methods can solve all application problems;
  2. Neural networks often contain a depth of several tens normalization layer, typically these layers normalized normalized using the same operation as a manual for each layer is designed to operate normalization requires a lot of experiments.
    Therefore, the authors propose self-adaptive normalization method --Switchable Normalization (SN) to solve the problem. And different reinforcement learning, using the SN differentiable learning, a depth of each of the network layer to determine the appropriate normalized normalized operation.

Conclusion

Guess you like

Origin www.cnblogs.com/icodeworld/p/11887788.html