Article directory
foreword
This paper mainly compares two regularization methods: drop_out
anddrop_path
1. drop_out
1.1. Reasons
In a machine learning model, if the model has too many parameters and too few training samples, the trained model is prone to overfitting. Over-fitting problems are often encountered when training neural networks. Over-fitting is specifically manifested in: the loss function of the model on the training data is small, and the prediction accuracy is high; but the loss function on the test data is relatively large, and the prediction accuracy is relatively high. The accuracy rate is lower.
Overfitting is a common problem in many machine learning. If the model is overfit, the resulting model is hardly usable. In order to solve the overfitting problem, the method of model integration is generally adopted, that is, multiple models are trained for combination. At this time, the time-consuming of training the model becomes a big problem. Not only is it time-consuming to train multiple models, but it is also time-consuming to test multiple models.
Regularization can effectively alleviate the above two problems
1.2. Concept
drop_out
It is a regularization method proposed to alleviate the overfitting of the convolutional neural network CNN, also known as random deactivation.
drop_out
It can be used as a trick for training deep neural networks. Overfitting can be significantly reduced by ignoring half of the feature detectors (making half of the hidden nodes value 0) in each training batch. This approach can reduce the interaction between feature detectors (hidden layer nodes). Detector interaction means that some detectors rely on other detectors to function.
To put it simply, during the forward propagation process of the model training phase, the activation values of some neurons stop working with a certain probability.
drop_out
It can indeed effectively alleviate the occurrence of over-fitting phenomenon, but it may slow down the speed of model convergence, because only a part of the parameters are updated in each iteration, which may lead to slower gradient descent.
1.3. Working principle
drop_out
The working principle of is introduced through a three-layer simple neural network , and the input is XXX , the output isYYY. _ The normal training process is to first pass the input X through the network for forward propagation, and then backpropagate the error to determine how to update the parameters.
- Traversing each layer of neural network nodes, set the node retention probability keep_prob, that is, the probability that the nodes of this layer have keep_prob is retained, the value range of keep_prob is between 0 and 1, assuming keep_prob = 0.5
- By setting the retention probability of the nodes in this layer of the neural network, the neural network will not be biased towards a certain node (because the node may be deleted), so that the weight of each node will not be too large, so as to reduce the overload of the neural network. fit
- Delete the nodes of the neural network, and delete the connections between the network and the removed nodes
- Input samples and use the simplified network for training
- Let enter XXX performs forward propagation through a new network with partial neuron deactivation (as shown in the figure on the right above), and then calculates the loss and backpropagates the loss. After a small batch of samples execute this process, the parameters are updated according to the gradient descent algorithm
- Repeat this process over and over again:
- Restoring inactivated neurons
- Re-inactivate all neurons with a certain probability p (the neurons inactivated this time are not necessarily the same as the neurons inactivated last time)
- Let the input pass forward through the new network with partial neuron deactivation, then calculate the loss and backpropagate the loss. After the new batch of samples executes this process, update the parameters according to the gradient descent algorithm
It should be noted that drop_out
it is generally only used in the training phase of the network, but not in the testing phase drop_out
. This is because using during the testing phase drop_out
may lead to random changes in the predicted values (since drop_out
deactivating nodes randomly). Moreover, the weight parameter has been divided by keep_prob in the training phase to ensure that the expected value of the output remains unchanged, so there is no need to use it in the testing phasedrop_out
1.4. Scale matching problem
As mentioned above, it is drop_out
generally only used in drop_out
the training phase of the network, but not in the testing phase. That is to say, the forward propagation only uses the part of neurons that are not deactivated during training, and all neurons are used during testing. Then training There will be problems with different data scales in the testing phase.
So when testing, all weight parameters WWW must be multiplied by1 − p 1 - p1−p , to ensure consistent scale changes during training and testing.
For example, drop_out
assuming the value of the first neuron in the first layer of the hidden layer can be expressed as:
Z 1 1 = ∑ i = 1 100 w 1 i x 1 i Z_1^1 = \displaystyle\sum^{100}_{i=1} w_1^i x_1^i Z11=i=1∑100w1ix1i
Uninterrupted ω 1 ix 1 i = a \omega _{1}^{i}x_{1}^{i}=aoh1ix1i=a , thenZ 1 1 = 100 a at this time Z_{1}^{1}=100aZ11=100a
drop_out
When using in the training phase , if the inactivation rate p = 0.3 p = 0.3p=0.3 , it can be understood that only 70 neurons are active, at this timeZ 1 1 = ∑ i = 1 70 ω 1 ix 1 i = 70 a Z_{1}^{1}=\displaystyle\sum_{i=1} ^{70}\omega_{1}^{i}x_{1}^{i}=70aZ11=i=1∑70oh1ix1i=70 a- But not in the test
drop_out
, all neurons are used, that is, 100a, it is not difficult to find that 30a is missingdrop_out
after , this is the inconsistency of the scale of the data in the training phase and the testing phase
In order to ensure the consistency of the scale, all weight parameters WWW must be multiplied by1 − p 1 - p1−p,即 Z 1 1 = ∑ i = 1 100 ( 0.7 ω 1 i ) x 1 i = 70 a Z_{1}^{1}=\displaystyle\sum_{i=1}^{100} (0.7\omega _{1}^{i})x_{1}^{i}=70a Z11=i=1∑100( 0.7 o1i)x1i=70 a , so that the scales of the training set drop_out
useddrop_out
the test set not used are the same. So it is different drop_out
in
Pay attention to this point when implementing the code, that is, use train()
the function to indicate that the model enters the training phase, which drop_out
is working normally, and use eval()
the function to indicate that the model enters the testing phase and drop_out
will stop working
1.5. Reasons for effectively alleviating overfitting
- Using
drop_out
can deactivate some nodes, which can simplify the structure of the neural network and thus play a role in regularization - The strategy of taking the average can usually effectively prevent over-fitting problems.
drop_out
Dropping different hidden neurons is similar to training different networks. Randomly deleting some hidden neurons leads to a different network structure. The wholedrop_out
process is equivalent to many different. neural network averaging - Use
drop_out
can randomly deactivate the nodes of the neural network, so that the neural network will not overweight a certain node during training, so that the nodes of the neural network will not depend on any input features
1.6. Code implementation
nn.Dropout
and nn.functional.dropout
two specific implementation methods:
class Dropout1(nn.Module):
def __init__(self):
super(Dropout1, self).__init__()
self.fc = nn.Linear(100,20)
def forward(self, input):
out = self.fc(input)
out = F.dropout(out, p=0.5, training=self.training) # 这里必须给traning设置为True
return out
# 如果设置为F.dropout(out, p=0.5)实际上是没有任何用的, 因为它的training状态一直是默认值False. 由于F.dropout只是相当于引用的一个外部函数, 模型整体的training状态变化也不会引起F.dropout这个函数的training状态发生变化. 所以,在训练模式下out = F.dropout(out) 就是 out = out.
Net = Dropout1()
Net.train()
#或者直接使用nn.Dropout() (nn.Dropout()实际上是对F.dropout的一个包装, 自动将self.training传入,两者没有本质的差别)
class Dropout2(nn.Module):
def __init__(self):
super(Dropout2, self).__init__()
self.fc = nn.Linear(100,20)
self.dropout = nn.Dropout(p=0.5)
def forward(self, input):
out = self.fc(input)
out = self.dropout(out)
return out
Net = Dropout2()
Net.train()
2. drop_path
2.1. Differences from drop_out
drop_path
The multi-branch structure in the deep learning model is randomly "failed", but the neurons drop_out
are randomly "failed". In other words, drop_out
random point-to-point path closures, drop_path
random point-to-layer closures
Suppose there is a Linear layer that inputs 4 nodes and outputs 5 nodes, then there are a total of 20 point-to-point paths. drop_out
will randomly close these paths, and drop_path
will randomly select the input node to close all the 5 paths connected to it
2.2. Working principle
drop_path
In drop_out
fact, the mathematical principle is similar: the passing range is ( 0 , 1 ) (0,1)(0,1 ) The random rand value, when a drop_rate is applied, the probability of being closed p = rand + drop_rate
- Only need to ppBy rounding down p , you can get the data that obeys the 0-1 distribution. By multiplying the point with the weight, you can close the node of the drop_rate ratio
- However, during the propagation process, the number of summary points has not changed (still includes closed nodes), so the input data mean u = sum ( x ) / N u=sum(x)/Nu=s u m ( x ) / N , it is enlarged
Suppose the original data XXX , the number of nodes isNNN , the mean isuuu
- After a ratio of rrAfter the drop operation of r , the total data has n = N ∗ rn = N * rn=N∗r nodes are set to 0
- So the new mean is u ′ = ( N − n ) ∗ u / N u' = (Nn)*u/Nu′=(N−n)∗u / N , obviously the mean changes, and the data distribution and gradient also change accordingly
- In order to keep the data consistent, the mean needs to be pulled back: u ′ ÷ ( N − n ) / N u' ÷ (Nn)/Nu′÷(N−n)/N,即 u ′ ÷ r u' ÷ r u′÷r
- But
drop_path
the output is an adjustment to the original data, and the function of drop is completed through the activation function
2.3. Application in the network
Suppose you have the following code in the forward pass:
x = x + self.drop_path( self.conv(x) )
Then in drop_path
the branch , each batch has a probability sample of drop_prob self.conv(x)
that will not be "executed" and will be passed directly as 0.
If xxx is an input tensor whose channels are[ B , C , H , W ] [B,C,H,W][B,C,H,W ] , then the meaning drop_path
of
It should be noted that drop_path
it cannot be used directly like this:
x = self.drop_path(x)
2.4. Code implementation
def drop_path(x, drop_prob: float = 0., training: bool = False):
if drop_prob == 0. or not training:
return x
keep_prob = 1 - drop_prob
shape = (x.shape[0],) + (1,) * (x.ndim - 1) # work with diff dim tensors, not just 2D ConvNets
random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
random_tensor.floor_() # binarize
output = x.div(keep_prob) * random_tensor
return output
class DropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)