drop path与drop out

Dropout

tutorial

definition understanding

torch.nn.Dropout(p=0.5inplace=False)

dropout and P

The following is the official pytorch document

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.

During training, some elements of the input tensor are randomly zeroed with probability p using samples from the Bernoulli distribution . Each channel will be zeroed independently on each forward pass.

In a word, I let each element in the input tensor have a probability of p becoming 0, and the remaining elements are divided by (1-p), which is to expand the remaining elements to ensure that the overall average value is constant

Notice:

  • The p in pytorch is the probability that the input returns to 0, which can be said to be the proportion of input failure, the higher the failure, the more failures
  • Dropout acts on the neuron, that is, the input tensor data, not the params of the scope model.

Principle: ( combined with the following examples to watch better )

dropout is a tensor matrix with iput, which is full of 0 and 1/(1-p) , generates a matrix with exactly the same size as the input matrix, and makes some matrix values ​​0 by element-by-element multiplication, that is It is equivalent to invalidating neurons and invalidating certain parameters. For example, in the above picture, each time three blocks of the same color are weighted and biased by the kernel, a new block is added, but at this time I invalidate one of the three blocks (such as The p of dropout is set to 0.33), let him multiply by 0 weight, and the remaining two are multiplied by (1/(1-p)), and then summed up, which ensures that the range of each training result is the same.

Originally, three features were combined to calculate one piece. Now some features are randomly invalidated to prevent the model from learning some features more deeply, reduce the possibility of over-fitting, and improve the generalization ability

Invalidating some neurons is a more vivid explanation of dropout. Note here that it is not to disable the kernel, but to let the data input into the tensor be the neurons.

It should be noted that these neurons are also invalid when backpropagating to update parameters later, and do not participate in parameter update

In addition, remember that dropout can reduce overfitting through practice. I think the reason is also theoretical, so it is not convincing.

example

Create a new Dropout layer,

import torch.nn as nn
import torch
m = nn.Dropout(p=0.4)
input = torch.randn(5, 3)
print(input)
output = m(input)
print(output)

input input

 output output

 It can be seen that 7 of the 15 numbers have become 0, and the probability is very close to 0.4

It can be seen that 1.46 in row 2 and column 1 is just divided by (1-0.4) to get 2.43.

drop path

def __init__(self, drop_prob: float = 0., scale_by_keep: bool = True):

Referenced from

"Analysis" regularized DropPath_timm droppath_ViatorSun's blog - CSDN blog ,

effect:

In a word, randomly let an entire sample (1 row of data) fail directly with the probability of drop_prob, and the value becomes 0

Then in the drop_path branch, the probability samples with drop_prob for each batch will not be "executed", and will be passed directly with 0.

If x is the input tensor, and its channel is [B, C, H, W], then the meaning of drop_path is that in a Batch_size, there are random samples of drop_prob, that is, B samples, about B*drop_prob samples will directly become 0.

The parameter scale_by_keep refers to whether to divide other non-expired samples by (1- drop_prob ) at the same time, so that the averages are the same.

example

self.drop_path = DropPath(0.3)
test=torch.arange(15,dtype=torch.float).resize(5,3)
print(x)
output=self.drop_path(x)
print(output)

 Original tensor

tensor([[ 0.,  1.,  2.],
        [ 3.,  4.,  5.],
        [ 6.,  7.,  8.],
        [ 9., 10., 11.],
        [12., 13., 14.]], device='cuda:0')

 The tensor output by the drop path

tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000],
        [ 8.5714, 10.0000, 11.4286],
        [12.8571, 14.2857, 15.7143],
        [17.1429, 18.5714, 20.0000]], device='cuda:0')

The first two become 0, and the last three are divided by (1- drop_prob ), which is 0.7

The difference between drop path and Dropout

The basic unit for setting Dropout to 0 is 1 neuron, that is, there are B*C*W*H neurons in [B,C,W,H]

The basic unit for setting the drop path to 0 is 1 piece of data, that is, in [B, C, W, H], the unit is B. If it is to be 0, then [C, W, H] neurons are 0 together.

Guess you like

Origin blog.csdn.net/zxyOVO/article/details/130046398