Artificial intelligence learning (2) Batch Normalization

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


foreword

Tip: Here you can add the general content to be recorded in this article:
For example: With the continuous development of artificial intelligence, the technology of machine learning is becoming more and more important. Many people have started learning machine learning. This article introduces the basics of machine learning. content.

Can refer to: Batch Normalization principle and actual combat


Tip: The following is the text of this article, and the following cases are for reference

一、Why we need Batch Normalization

1.1 Internal covariate shift

Concept: covariate shift: The input distribution of the learning system changes.
In the neural network, we will face the following problems:
1. In the neural network, since we mostly use MBGD (Mini Batch Gradient Descent), we need to carefully adjust the model hyperparameters, especially the learning rate and initial parameter values . Not only that, but training is complicated by the fact that the input to each layer in a neural network is influenced by the parameters of the previous layer—thus, small changes in network parameters are amplified as the network deepens.
This is the so-called gradient disappearance problem. If the network structure is too deep, the gradient passed to the previous network layer will be very small, which makes training very difficult.

2. For deep learning, a network structure that contains many hidden layers, during the training process, because the parameters of each layer are constantly changing, each hidden layer will face the problem of covariate shift, which is the so-called Internal Covariate Shift. That is, if the parameters of each layer are constantly changing, the distribution of the input hidden layer will continue to change, that is, for the hidden layer, it is beneficial for the distribution of x to remain fixed for a period of time, so that It means that the hidden layer does not need to re-learn how to cater to changes in the distribution of x, which speeds up the training in disguise.

1.2 Former Research

Previous studies have shown that if the input image is whitened (Whiten) in image processing, the neural network will converge faster.
So the author thought of using a normalization operation before the input of each layer. (simple whitening operation)

二、Batch Normalization

The Batch Normalization operation makes the input of each layer roughly in the same distribution. At the same time, for the Sigmoid function, it also normalizes most of the data to its middle area, which can reduce the problems of gradient disappearance and explosion. (Larger learning rate can be used)

But there are also problems with simple normalization. Sometimes, we don't want all the data to always have a mean of 0 and a variance of 1. This may change what the original network layer can represent, such as limiting sgmoid to the linear region near 0, which means that the activation function is more of a representation of a linear function, and the combination of multiple linear functions , the meaning of depth is lost. (It is difficult to increase the expressive ability of the neural network)
Therefore, the author introduced two learnable parameters.
insert image description here
Specifically, we can consider that when γ is exactly equal to a certain value to make the normalization invalid, it means that the batch Normalization doesn't work. The specific parameters are learned by the neural network itself.

When testing, we will save the corresponding mean and variance estimators obtained during training, that is, the values ​​​​obtained by the moving average as the mean and variance of each test sample BN.

The specific formula is as follows:
insert image description here

Supplement:
Although the data in each batch is sampled from the overall sample, the mean and variance of different mini-batches will be different, which adds random noise to the learning process of the network. It is similar to the noise that the element brings to the network training, and it has a regularization effect on the model to a certain extent.
In CV, BN represents the corresponding normalization for each channel in a batch, that is, find the mean value of channel1 in a batch, and then normalize the variance.

Guess you like

Origin blog.csdn.net/weixin_43869415/article/details/121820580