Scientific knowledge
ICML is the abbreviation of International Conference on Machine Learning, the International Conference on Machine Learning. ICML has now developed into an annual top international conference on machine learning hosted by the International Machine Learning Society (IMLS).
# Preface
SEP.
In the last deep learning theory article, we learned a deep representative network-GoogLeNet at one time, which built a deeper network through the proposed Inception module. Today, we will walk into the most representative network in the deep network of deep learning - the ResNet network. How is it different from the previous network structure? And why is the network proposed? Please see the breakdown below.
ResNet network
The title of this shared paper is: Deep Residual Learning for Image Recognition, the title translated means: deep residual learning for image recognition. As soon as the network was proposed, it completely refreshed the cognition of deep network in the field of computer vision, and then derived multiple variants of residual network in many fields, and its far-reaching influence has continued to this day. There is a kind of exaggerated It is said that in the field of image recognition, only the residual network is known, but not other networks. Although it is a bit exaggerated, it reflects that the residual network is popular among researchers.
Screenshot of the paper:
Paper address: https://arxiv.org/pdf/1512.03385.pdf
1. The reason for the residual network
The above two pictures show the derivation process of the previous backpropagation and gradient descent algorithm. For detailed articles, see: Deep Learning Theory (5) -- Mathematical Derivation of Gradient Descent Algorithm , Deep Learning Theory (7) -- Reverse spread
For the reason for the proposal of the residual network, the author proposed at the beginning of the abstract, that is, the deeper the neural network training is, the more difficult it is, but why? Didn't we say that the deeper the network extracts more features, the deeper the layer, the more information it represents? Yes, generally speaking, this is indeed the case, but there is a limit to this sentence, that is, within the scope of a certain number of network layers, how big is this scope? Generally speaking, it can be compared to the previous VGG network and GoogLeNet network. These two networks are deep enough. If they go deeper, the network may not be able to be trained. Why? This needs to be explained from the perspective of gradient. Remember what we shared in our original deep learning article, the update of network parameters depends on the backpropagation algorithm, and the backpropagation algorithm usually uses the gradient descent algorithm? What is the relationship between the depth of the neural network and the gradient descent algorithm? We know that the gradient descent algorithm is to derive the derivative of the chain rule in the entire network, and the problem arises here. The network depth is within a certain range, and the chain rule is completely fine. The more multiplication items, and these gradient values (even multiplication items) are floating point numbers in many cases, more and more multiplications will cause the final gradient value to be very small, so that the final gradient is 0, which is also The gradient disappears. When the gradient disappears, according to the gradient descent formula, the last gradient is 0, and the current gradient value remains unchanged. Therefore, the network parameters are no longer updated, and the next step of training cannot be performed. . Therefore, as the depth of the network deepens, the problem of gradient disappearance occurs. Therefore, the deeper the network training is, the more difficult it is, which can usually be explained as the reason proposed by the residual network .
2. Network structure
The picture above shows the fast residual learning proposed in the article, which is also the basic module of the residual network. After careful observation, is there any difference from the previous network? In fact, it is very simple, that is, there is an extra jump connection, which connects the input terminal directly to the output terminal. People ignore. The ordinary network has two advantages after adding one more step to the input x:
The high-level information is integrated with the low-level information to make the feature expression more abundant.
Due to the appearance of the input x, when performing backpropagation, there will always be one more derivative of the current x in the gradient descent algorithm, which makes the gradient never appear very small, and solves the gradient The problem disappears, and deeper networks can be trained.
The network structure configuration in the paper: from 18 layers to 152 layers
One of the examples: 32-layer residual network structure
Since the basic residual block is relatively simple, we will not explain the structure of each layer in detail. We will explain it in detail when we share it in actual combat.
END
epilogue
This is the end of the sharing of this issue. The emergence of the residual network has led the process of deep learning, which also guides us to a certain extent. A better solution is not more difficult. Maybe a small change will appear For greater improvement, we need to start from the basic principles, so that we can go further.
See you in the next issue!
Editor: Layman Yueyi|Review: Layman Xiaoquanquan
Advanced IT Tour
Past review
Deep Learning Theory (16) -- GoogLeNet's Re-exploration of the Mystery of Depth
Deep Learning Theory (15) -- VGG's initial exploration of the mystery of depth
Deep Learning Theory (14) -- AlexNet's next level
What have we done in the past year:
[Year-end Summary] Saying goodbye to the old and welcoming the new, 2020, let's start again
[Year-end summary] 2021, bid farewell to the old and welcome the new
Click "Like" and let's go~