From the perspective of optimization itself , BatchSize and learning rate (and the impact of learning rate reduction strategy) in deep learning training are the
most important parameters that affect the performance convergence of the model.
The learning rate directly affects the convergence state of the model, and batchsize affects the generalization performance of the model. The two are directly related to the numerator and denominator, and can also affect each other.
Article Directory
- 1 The effect of Batchsize on training results (same number of epoch rounds)
- Comparative Results
- 1. Alexnet 2080s train_batchsize=32,val_batchsize=64。lr=0.01 GHIMyousan
- 2. Alexnet 2080 train_batchsize=64,val_batchsize=64,lr=0.01 GHIM-me 14k itera
- 3. Alexnet 2080s train_batchsize=64,val_batchsize=64,lr=0.02 GHIM-yousan
- 4 Squeezenet 2080s train_bs=64 val-bs=64 lr 0.01 GHIMyousan
- 4- Repeatability experiment Squeezenet 2080s train_bs = 64 val-bs = 64 lr 0.01 GHIM-me, the result is still overfitting 87% acc
- 5 == mobilenetv1 2080s t-bs: 64, v-bs: 64 lr: 0.01, GHIM-me == overfitting
- 5 == mobilenetv2 2080s t-bs: 64, v-bs: 64 lr: 0.01, GHIM-me == overfitting
- 6 mobilenetv1 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted
- 6 mobilenetv2 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted
- 10 predecessors talk about batchsize
- 1 ** Within a certain range, generally speaking, the larger the Batch_Size is, the more accurate the downward direction it determines, and the smaller the training oscillation will be. **
- 2 == Large batchsize performance is reduced because the training time is not long enough, which is not essentially a problem with batchsize. The parameter updates under the same epochs are reduced, so a longer number of iterations is required. ==
- 3 == The large batchsize converges to the sharp minimum, while the small batchsize converges to the flat minimum, which has better generalization ability. ==
- 4 Batchsize increases, the learning rate should increase with others
- 5 Increasing the size of Batchsize is equivalent to adding learning rate attenuation
- in conclusion
- 1 If the learning rate is increased, the batch size should also increase, so that the convergence is more stable.
- 2 Try to use a large learning rate, because many studies have shown that a larger learning rate is conducive to improving generalization ability. If you really want to decay, you can try other methods, such as increasing the batch size, the learning rate has a great impact on the model's convergence, and adjust it carefully.
- 3 The disadvantage of using bn is that you cannot use a too small batch size, otherwise the mean and variance will be biased. So now it is generally as much as the video memory can put. Moreover, when the model is actually used, it is really more important to distribute and preprocess the data. If the data is not good, it is useless to play any more tricks.
1 The effect of Batchsize on training results (same number of epoch rounds)
Here we use GHIM-20, 20 images of each type in 20 categories, a total of 10000 (train9000) (val1000), and
a total of 100 epochs (equivalent to traversing 9000 images 100 times, of which acc (traversal is printed 10 times per iteration A total of 1000 val_list)
Alexnet used here (train_batchsize = 32 and train_batchsize = 64 respectively)
Comparative Results
1. Alexnet 2080s train_batchsize=32,val_batchsize=64。lr=0.01 GHIMyousan
- 2080s training s 4.68h, nearly 28k iterations == 100epoch x 9000/32, each train-batch records a loss dark blue
- val-batch records a val-loss light blue
Question: Is the loss converged?
acc-loss is always greater than train-loss to see that it is an overfitting problem
Checked some answers
1. Theoretically, it does not converge , that is to say, there is a problem with the network you designed, which is also the first factor that should be considered: Whether the gradient exists, that is, whether the back propagation has broken;
2. Theoretically, it is convergent :-
The learning rate setting is unreasonable (in most cases) .If the learning rate setting is too large, it will cause non-convergence.If it is too small, it will cause the convergence rate to be very slow;
-
Batchsize is too large, it falls into the local optimum and cannot reach the global optimum, so it cannot continue to converge;
-
Network capacity, it is certain that the loss of the shallow network to complete complex tasks does not decrease. The network design is too simple. In general, the larger the number of layers and the number of nodes in the network, the stronger the fitting ability. If the number of layers and nodes is not enough , Unable to fit complex situations, will also cause non-convergence.
The base of the decrease of the learning rate step is not too large, and the batchsize = 32 is not large, so it is converged? ? (After all, loss = 0.0008 is not too big.) The
acc curve is as follows: It is basically fitted after looking at the picture after 15k iterations (that is, epoch = 15000 * 32/9000 = 53 rounds). What is interesting is that it corresponds to learning at this time. Rate decreased from 0.01 to 0.001
Question: Why does acc84% of val not go up
The training set performs well, and the test set difference is overfitting, indicating that the learned features are still not generalized enough.
Which layer of learning features is not good enough?
Overfitting solution
1. Reason:- The reason for the overfitting is that the magnitude of the training set does not match the complexity of the model,
- The magnitude of the training set is less than the complexity of the model;
- The feature distribution of training set and test set is inconsistent;
- The noise data in the sample ...
2. Solution
(simpler model structure, data augmentation, regularization, dropout, early stopping, ensemble, re-cleaning data)
-
2. Alexnet 2080 train_batchsize=64,val_batchsize=64,lr=0.01 GHIM-me 14k itera
2080 train_batchsize = 64, the 4.5k iteration is basically fitted (4500 * 64/9000 = 32 rounds), and both train and val are better. And the fitting is good
3. Alexnet 2080s train_batchsize=64,val_batchsize=64,lr=0.02 GHIM-yousan
2080s, no convergence or overfitting? ? ?
4 Squeezenet 2080s train_bs=64 val-bs=64 lr 0.01 GHIMyousan
4- Repeatability experiment Squeezenet 2080s train_bs = 64 val-bs = 64 lr 0.01 GHIM-me, the result is still overfitting 87% acc
5 mobilenetv1 2080s t-bs:64,v-bs:64 lr:0.01,GHIM-meOverfitting
5 mobilenetv2 2080s t-bs:64,v-bs:64 lr:0.01,GHIM-meOverfitting
6 mobilenetv1 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted
6 mobilenetv2 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted
lr=0.01
10 predecessors talk about batchsize
1 Within a certain range, generally speaking, the larger the Batch_Size, the more accurate the downward direction it determines, resulting in less training shock.
- As Batch_Size increases, the number of epochs required to achieve the same accuracy is increasing. Penultimate row
- As Batch_Size increases, the faster the processing of the same amount of data.
- Due to the contradiction between the above two factors, Batch_Size increases to a certain point, reaching the optimal time.
- Since the final convergence accuracy will fall into different local extremums, Batch_Size increases to some times to reach the optimal final convergence accuracy.
2 The large batchsize performance decline is because the training time is not long enough, which is not essentially a problem with batchsize. The parameter update under the same epochs becomes less, so a longer number of iterations is required.
The error rate rises after batchsize = 8k
3 The large batchsize converges to the sharp minimum, while the small batchsize converges to the flat minimum, which has better generalization ability.
The difference between the two lies in the changing trend, one fast and one slow, as shown above, the main reason for this phenomenon is that the noise caused by the small batchsize helps escape the sharp minimum.
4 Batchsize increases, the learning rate should increase with others
Usually when we increase the batchsize to N times the original, to ensure that the weights updated after the same sample are equal, according to the linear scaling rule, the learning rate should be increased to the original N times [5]. However, if the variance of the weights is to be maintained, the learning rate should be increased to the original sqrt (N) times [7]. At present, both strategies have been studied, and the former is mostly used.
5 Increasing the size of Batchsize is equivalent to adding learning rate attenuation
In fact, it can be seen from the weight update formula of SGD that the two are indeed equivalent, and this is verified by sufficient experiments in the article
in conclusion
1 If the learning rate is increased, the batch size should also increase, so that the convergence is more stable.
2 Try to use a large learning rate, because many studies have shown that a larger learning rate is conducive to improving generalization ability. If you really want to decay, you can try other methods, such as increasing the batch size, the learning rate has a great impact on the model's convergence, and adjust it carefully.
3 The disadvantage of using bn is that you cannot use a too small batch size, otherwise the mean and variance will be biased. So now it is generally as much as the video memory can put. Moreover, when the model is actually used, it is really more important to distribute and preprocess the data. If the data is not good, it is useless to play any more tricks.
Reference reading
https://zhuanlan.zhihu.com/p/29247151
https://zhuanlan.zhihu.com/p/64864995