CV【3】:drop_out & drop_path


foreword

This paper mainly compares two regularization methods: drop_outanddrop_path


1. drop_out

1.1. Reasons

In a machine learning model, if the model has too many parameters and too few training samples, the trained model is prone to overfitting. Over-fitting problems are often encountered when training neural networks. Over-fitting is specifically manifested in: the loss function of the model on the training data is small, and the prediction accuracy is high; but the loss function on the test data is relatively large, and the prediction accuracy is relatively high. The accuracy rate is lower.

Overfitting is a common problem in many machine learning. If the model is overfit, the resulting model is hardly usable. In order to solve the overfitting problem, the method of model integration is generally adopted, that is, multiple models are trained for combination. At this time, the time-consuming of training the model becomes a big problem. Not only is it time-consuming to train multiple models, but it is also time-consuming to test multiple models.

Regularization can effectively alleviate the above two problems

1.2. Concept

drop_outIt is a regularization method proposed to alleviate the overfitting of the convolutional neural network CNN, also known as random deactivation.

drop_outIt can be used as a trick for training deep neural networks. Overfitting can be significantly reduced by ignoring half of the feature detectors (making half of the hidden nodes value 0) in each training batch. This approach can reduce the interaction between feature detectors (hidden layer nodes). Detector interaction means that some detectors rely on other detectors to function.

To put it simply, during the forward propagation process of the model training phase, the activation values ​​of some neurons stop working with a certain probability.

drop_outIt can indeed effectively alleviate the occurrence of over-fitting phenomenon, but it may slow down the speed of model convergence, because only a part of the parameters are updated in each iteration, which may lead to slower gradient descent.

1.3. Working principle

drop_outThe working principle of is introduced through a three-layer simple neural network , and the input is XXX , the output isYYY. _ The normal training process is to first pass the input X through the network for forward propagation, and then backpropagate the error to determine how to update the parameters.

insert image description here

  • Traversing each layer of neural network nodes, set the node retention probability keep_prob, that is, the probability that the nodes of this layer have keep_prob is retained, the value range of keep_prob is between 0 and 1, assuming keep_prob = 0.5
    • By setting the retention probability of the nodes in this layer of the neural network, the neural network will not be biased towards a certain node (because the node may be deleted), so that the weight of each node will not be too large, so as to reduce the overload of the neural network. fit
  • Delete the nodes of the neural network, and delete the connections between the network and the removed nodes
    insert image description here
  • Input samples and use the simplified network for training
    • Let enter XXX performs forward propagation through a new network with partial neuron deactivation (as shown in the figure on the right above), and then calculates the loss and backpropagates the loss. After a small batch of samples execute this process, the parameters are updated according to the gradient descent algorithm
  • Repeat this process over and over again:
    • Restoring inactivated neurons
    • Re-inactivate all neurons with a certain probability p (the neurons inactivated this time are not necessarily the same as the neurons inactivated last time)
    • Let the input pass forward through the new network with partial neuron deactivation, then calculate the loss and backpropagate the loss. After the new batch of samples executes this process, update the parameters according to the gradient descent algorithm

It should be noted that drop_outit is generally only used in the training phase of the network, but not in the testing phase drop_out. This is because using during the testing phase drop_outmay lead to random changes in the predicted values ​​(since drop_outdeactivating nodes randomly). Moreover, the weight parameter has been divided by keep_prob in the training phase to ensure that the expected value of the output remains unchanged, so there is no need to use it in the testing phasedrop_out

1.4. Scale matching problem

As mentioned above, it is drop_outgenerally only used in drop_outthe training phase of the network, but not in the testing phase. That is to say, the forward propagation only uses the part of neurons that are not deactivated during training, and all neurons are used during testing. Then training There will be problems with different data scales in the testing phase.

So when testing, all weight parameters WWW must be multiplied by1 − p 1 - p1p , to ensure consistent scale changes during training and testing.

For example, drop_outassuming the value of the first neuron in the first layer of the hidden layer can be expressed as:

Z 1 1 = ∑ i = 1 100 w 1 i x 1 i Z_1^1 = \displaystyle\sum^{100}_{i=1} w_1^i x_1^i Z11=i=1100w1ix1i

Uninterrupted ω 1 ix 1 i = a \omega _{1}^{i}x_{1}^{i}=aoh1ix1i=a , thenZ 1 1 = 100 a at this time Z_{1}^{1}=100aZ11=100a

  • drop_outWhen using in the training phase , if the inactivation rate p = 0.3 p = 0.3p=0.3 , it can be understood that only 70 neurons are active, at this timeZ 1 1 = ∑ i = 1 70 ω 1 ix 1 i = 70 a Z_{1}^{1}=\displaystyle\sum_{i=1} ^{70}\omega_{1}^{i}x_{1}^{i}=70aZ11=i=170oh1ix1i=70 a
  • But not in the test drop_out, all neurons are used, that is, 100a, it is not difficult to find that 30a is missing drop_outafter , this is the inconsistency of the scale of the data in the training phase and the testing phase

In order to ensure the consistency of the scale, all weight parameters WWW must be multiplied by1 − p 1 - p1p,即 Z 1 1 = ∑ i = 1 100 ( 0.7 ω 1 i ) x 1 i = 70 a Z_{1}^{1}=\displaystyle\sum_{i=1}^{100} (0.7\omega _{1}^{i})x_{1}^{i}=70a Z11=i=1100( 0.7 o1i)x1i=70 a , so that the scales of the training set drop_outuseddrop_outthe test set not used are the same. So it is different drop_outin

Pay attention to this point when implementing the code, that is, use train()the function to indicate that the model enters the training phase, which drop_outis working normally, and use eval()the function to indicate that the model enters the testing phase and drop_outwill stop working

1.5. Reasons for effectively alleviating overfitting

  • Using drop_outcan deactivate some nodes, which can simplify the structure of the neural network and thus play a role in regularization
  • The strategy of taking the average can usually effectively prevent over-fitting problems. drop_outDropping different hidden neurons is similar to training different networks. Randomly deleting some hidden neurons leads to a different network structure. The whole drop_outprocess is equivalent to many different. neural network averaging
  • Use drop_outcan randomly deactivate the nodes of the neural network, so that the neural network will not overweight a certain node during training, so that the nodes of the neural network will not depend on any input features

1.6. Code implementation

nn.Dropoutand nn.functional.dropouttwo specific implementation methods:

class Dropout1(nn.Module):
   def __init__(self):
       super(Dropout1, self).__init__()
       self.fc = nn.Linear(100,20)
 
   def forward(self, input):
       out = self.fc(input)
       out = F.dropout(out, p=0.5, training=self.training)  # 这里必须给traning设置为True
       return out
# 如果设置为F.dropout(out, p=0.5)实际上是没有任何用的, 因为它的training状态一直是默认值False. 由于F.dropout只是相当于引用的一个外部函数, 模型整体的training状态变化也不会引起F.dropout这个函数的training状态发生变化. 所以,在训练模式下out = F.dropout(out) 就是 out = out. 
Net = Dropout1()
Net.train()

#或者直接使用nn.Dropout() (nn.Dropout()实际上是对F.dropout的一个包装, 自动将self.training传入,两者没有本质的差别)
class Dropout2(nn.Module):
  def __init__(self):
      super(Dropout2, self).__init__()
      self.fc = nn.Linear(100,20)
      self.dropout = nn.Dropout(p=0.5)
 
  def forward(self, input):
      out = self.fc(input)
      out = self.dropout(out)
      return out
Net = Dropout2()
Net.train()

2. drop_path

2.1. Differences from drop_out

drop_pathThe multi-branch structure in the deep learning model is randomly "failed", but the neurons drop_outare randomly "failed". In other words, drop_outrandom point-to-point path closures, drop_pathrandom point-to-layer closures

Suppose there is a Linear layer that inputs 4 nodes and outputs 5 nodes, then there are a total of 20 point-to-point paths. drop_outwill randomly close these paths, and drop_pathwill randomly select the input node to close all the 5 paths connected to it

2.2. Working principle

drop_pathIn drop_outfact, the mathematical principle is similar: the passing range is ( 0 , 1 ) (0,1)(0,1 ) The random rand value, when a drop_rate is applied, the probability of being closed p = rand + drop_rate

  • Only need to ppBy rounding down p , you can get the data that obeys the 0-1 distribution. By multiplying the point with the weight, you can close the node of the drop_rate ratio
  • However, during the propagation process, the number of summary points has not changed (still includes closed nodes), so the input data mean u = sum ( x ) / N u=sum(x)/Nu=s u m ( x ) / N , it is enlarged

Suppose the original data XXX , the number of nodes isNNN , the mean isuuu

  • After a ratio of rrAfter the drop operation of r , the total data has n = N ∗ rn = N * rn=Nr nodes are set to 0
  • So the new mean is u ′ = ( N − n ) ∗ u / N u' = (Nn)*u/Nu=(Nn)u / N , obviously the mean changes, and the data distribution and gradient also change accordingly
  • In order to keep the data consistent, the mean needs to be pulled back: u ′ ÷ ( N − n ) / N u' ÷ (Nn)/Nu÷(Nn)/N,即 u ′ ÷ r u' ÷ r u÷r
  • But drop_paththe output is an adjustment to the original data, and the function of drop is completed through the activation function

2.3. Application in the network

Suppose you have the following code in the forward pass:

x = x + self.drop_path( self.conv(x) )

Then in drop_paththe branch , each batch has a probability sample of drop_prob self.conv(x)that will not be "executed" and will be passed directly as 0.

If xxx is an input tensor whose channels are[ B , C , H , W ] [B,C,H,W][B,C,H,W ] , then the meaning drop_pathof

It should be noted that drop_pathit cannot be used directly like this:

x = self.drop_path(x)

2.4. Code implementation

def drop_path(x, drop_prob: float = 0., training: bool = False):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

Guess you like

Origin blog.csdn.net/HoraceYan/article/details/128708965