Possible scenarios for deep learning convergence problems

1. If the database is too small, it will not cause the problem of non-convergence. As long as you are in the train, it will always converge (the rp problem will not run away). On the contrary, the lack of convergence is generally due to the fact that the amount of information in the sample is too large, so that the network is not enough to fit the entire sample space. Fewer samples may only bring about the problem of overfitting. Have you seen the convergence of the loss on your training set? If only the validate set does not converge, it means overfitting. At this time, various anti-overfit tricks should be considered, such as dropout, SGD, increasing the number of minibatches, reducing the number of nodes in the fc layer, momentum, finetune, etc.
2. If the learning rate is set too large, it will bring about the problem of running and flying (loss suddenly has been very large) . This is the most common situation for novices-why does the network run and run and watch to converge and the result suddenly flies? The most likely reason is that you use relu as the activation function and use softmax or a function with exp as the loss function of the classification layer. When a certain training is passed to the last layer, a certain node is over-activated (such as 100), then exp(100)=Inf, overflow occurs, all weights after bp will become NAN, and then the weights will always be Keep NAN, so loss will fly hot. To simulate this situation, I reproduced a failed experiment of mine a year ago. I took the loss curve of the whole process:
where red is loss and green is accuracy. It can be seen that it ran and flew once at around 2300, but fortunately, the lr was not set very large, so it was pulled back. If the lr setting is too high, there will be a situation where it will run away and never come back. At this time, if you stop for a while and pick the weights of a random layer, it is very likely that they are all NANs. In this case, it is recommended to try the dichotomy method. 0.1~0.0001. Different models have different optimal lr for different tasks.
3. Collect as much data as possible. One way is to crawl flickr, find celebrity tags, and then do a little manual culling to collect a good set of samples. In fact, the collection of samples is not about the number, but about the hardness. For example, if you collect 40 pictures of the same person with the same basic posture and expression, it is better to collect 10 pictures of him with different expressions. In a previous experiment, the model trained by 50 pictures per person with a large variance and more than 300 similar pictures per person is half a point higher than the former.
4. Try to use small models . If there is too little data, try to reduce the model complexity. Consider reducing the number of layers or reducing the kernel number.
========================
When I saw this question, I happened to be training a network of 10,000 class face identify. . It is estimated that it converged after a good luck ~ the graph with accuracy (I started training from random googlenet, and did finetune during the 100k training).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324779495&siteId=291194637