CNN调参与参考资料

问：最开始定网络结构的时候（比如几层卷积几层池化）这应该怎么判定呢？感觉拿到一个问题，想要用cnn去解决它的时候，完全不知道怎么入手。

答：如果只是拿来用一下的话，有一个基本上是“万能”的方法：小规模问题照搬LeNet的网络结构，较大规模的照搬vgg网络结构，然后再图慢慢改进。

谷歌科学家、Hinton亲传弟子Ilya Sutskever的深度学习综述及实际建议
===================

Here is a summary of the community’s knowledge of what’s important and what to look after:
Get the data: Make sure that you have a high-quality dataset of input-output examples that is large, representative, and has relatively clean labels. Learning is completely impossible without such a dataset.

数据：确保有高质量的数据，数据量大、数据具有代表性、标签具有区分性。如果没有这样的数据是不可能学习到好的模型的。

Preprocessing: it is essential to center the data so that its mean is zero and so that the variance of each of its dimensions is one. Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(1 + x) of that dimension. Basically, it’s important to find a faithful encoding of the input with zero mean and sensibly bounded dimensions. Doing so makes learning work much better. This is the case because the weights are updated by the formula: change in wij \propto xidL/dyj (w denotes the weights from layer x to layer y, and L is the loss function). If the average value of the x’s is large (say, 100), then the weight updates will be very large and correlated, which makes learning bad and slow. Keeping things zero-mean and with small variance simply makes everything work much better.

预处理：使得数据的每一个维度具有零均值和单位方差非常重要。

Minibatches: Use minibatches. Modern computers cannot be efficient if you process one training case at a time. It is vastly more efficient to train the network on minibatches of 128 examples, because doing so will result in massively greater throughput. It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches. But don’t use very large minibatches because they tend to work less well and overfit more. So the practical recommendation is: use the smaller minibatch that runs efficiently on your machine.

minibatch:使用小一些的minibatch要更有效。以128个例子的小批量来训练网络是非常有效的，因为这样做将导致大的吞吐量。使用小的minibatch可能带来性能提高和轻微过拟合并且计算效率变低，好处胜于劣处。不要怕使用大的minibatch，效果不好并且非常容易过拟合。

Gradient normalization: Divide the gradient by minibatch size. This is a good idea because of the following pleasant property: you won’t need to change the learning rate (not too much, anyway), if you double the minibatch size (or halve it).

梯度归一化：

Learning rate schedule: Start with a normal-sized learning rate (LR) and reduce it towards the end.
A typical value of the LR is 0.1. Amazingly, 0.1 is a good value of the learning rate for a large number of neural networks problems. Learning rates frequently tend to be smaller but rarely much larger.
Use a validation set ---- a subset of the training set on which we don’t train --- to decide when to lower the learning rate and when to stop training (e.g., when error on the validation set starts to increase).

A practical suggestion for a learning rate schedule: if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going. Eventually, the LR will become very small, at which point you will stop your training. Doing so helps ensure that you won’t be (over-)fitting the training data at the detriment of validation performance, which happens easily and often. Also, lowering the LR is important, and the above recipe provides a useful approach to controlling via the validation set.

But most importantly, worry about the Learning Rate. One useful idea used by some researchers (e.g., Alex Krizhevsky) is to monitor the ratio between the update norm and the weight norm. This ratio should be at around 10-3. If it is much smaller then learning will probably be too slow, and if it is much larger then learning will be unstable and will probably fail.

Weight initialization. Worry about the random initialization of the weights at the start of learning.
If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.

Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.

Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.

If you are training RNNs or LSTMs, use a hard constraint over the norm of the gradient (remember that the gradient has been divided by batch size). Something like 15 or 5 works well in practice in my own experiments. Take your gradient, divide it by the size of the minibatch, and check if its norm exceeds 15 (or 5). If it does, then shrink it until it is 15 (or 5). This one little trick plays a huge difference in the training of RNNs and LSTMs, where otherwise the exploding gradient can cause learning to fail and force you to use a puny learning rate like 1e-6 which is too small to be useful.

Numerical gradient checking: If you are not using Theano or Torch, you’ll be probably implementing your own gradients. It is easy to make a mistake when we implement a gradient, so it is absolutely critical to use numerical gradient checking. Doing so will give you a complete peace of mind and confidence in your code. You will know that you can invest effort in tuning the hyperparameters (such as the learning rate and the initialization) and be sure that your efforts are channeled in the right direction.

If you are using LSTMs and you want to train them on problems with very long range dependencies, you should initialize the biases of the forget gates of the LSTMs to large values. By default, the forget gates are the sigmoids of their total input, and when the weights are small, the forget gate is set to 0.5, which is adequate for some but not all problems. This is the one non-obvious caveat about the initialization of the LSTM.

Data augmentation: be creative, and find ways to algorithmically increase the number of training cases that are in your disposal. If you have images, then you should translate and rotate them; if you have speech, you should combine clean speech with all types of random noise; etc. Data augmentation is an art (unless you’re dealing with images). Use common sense.

Dropout. Dropout provides an easy way to improve performance. It’s trivial to implement and there’s little reason to not do it. Remember to tune the dropout probability, and to not forget to turn off Dropout and to multiply the weights by (namely by 1-dropout probability) at test time. Also, be sure to train the network for longer. Unlike normal training, where the validation error often starts increasing after prolonged training, dropout nets keep getting better and better the longer you train them. So be patient.

Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements. One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer. The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong.

I am pretty sure that I haven’t forgotten anything. The above 13 points cover literally everything that’s needed in order to train LDNNs successfully.
So, to Summarize:

LDNNs are powerful.
LDNNs are trainable if we have a very fast computer.
So if we have a very large high-quality dataset, we can find the best LDNN for the task.
Which will solve the problem, or at least come close to solving it.

让CNN跑起来，以下是调参的所有秘密

- 收集高质量标注数据。

- 输入输出数据做好归一化，以防出现数值问题。方法就是主成分分析啥的。

- 参数初始化很重要。太小了，参数根本走不动。一般权重参数0.01均方差，0均值的高斯分布是万能的，不行就试更大的。偏差参数全0即可。

- 用SGD ，minibatch size 128。或者更小size ，但那样吞吐量变小，计算效率变低。

- 用带momentum的SGD，二阶方法不用也罢。

- 梯度更新的步长很重要。一般0.1是个万能数值。调参可改进结果，具体做法是人肉监督：用另外的验证集观察测试错误率，一旦不降了，步长减半甚至更多。

- 梯度归一化：除以minibatch size ，这样就不显式依赖minibatch size

- 限制权重参数的最大值防止跑飞。一般最大行范数不超过2或者4，否则同比收缩到这个值。

- 梯度大致应该总是只改变参数的千分之一，偏离这个数字太远的，调之。

- dropout一定要用

- relu一定要用

用过这些了还不行，只好反省人品了...

原答案分割线

授人以鱼不如授人与渔。cnn调参，最好的参考论文就是那篇nips2012用cnn做imagenet的，没有之一。dropout那篇文章可作为最佳补充。

楼上还有推荐trick of trade那书的...真心没用，那书不涉及pre training的章节，除了lecun写的第一篇，其它看完呵呵就好。

CNN调参与参考资料

猜你喜欢