CS231n
Lecture 6: Training Neural Networks, Part I
Review
回顾之前的内容,我们学习了神经网络的反向传播训练方法和CNN的结构,于是对于CNN我们可以用反向传播方法进行训练,具体方式是
1. 采样mini-batch
2. 前向传播获得loss
3. 根据loss进行反向传播梯度
4. 根据梯度更新参数
Training Neural Networks
Activation Functions
sigmoid, leaky ReLU, tanh, maxout, ReLU, ELU, …
传统神经网络中用的是sigmoid,但是这有很大问题
1. 神经元一饱和梯度就消失,tanh同理
2. 输出不是以0为中心的
3. exp()计算代价较大
ReLU则有很大优势
1. 不饱和
2. 计算代价很低
3. 收敛快
4. 从神经生物学来看更合理
白璧微瑕:不是以0为中心的,
时不激活
其余还有leaky ReLU等,因为较少应用所以略过
Data Preprocessing
中心化、正规化:
Weight Initialization
Gauss initialization不行
Xavier initialization:
,在ReLU时不好用
He et.al.:
Batch Normalization
直接用Gauss initialization
一般放在CONV/FC层和ReLU层之间,即[CONV+BN+ReLU+pool]或[FC+BN+ReLU+pool]
Problem: do we necessarily want a unit gaussian input to a tanh layer?
A: 的大部分能量集中在 之间,而 ,梯度已经开始消失,所以还是将 进一步缩小比较好
实际使用时迭代更新
好处
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependenceon initialization
- Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
Babysitting the Learning Process
- Preprocess the data
- Choose the architecture
- Double check that the loss is reasonable
- Try training…Make sure that you can overfit very small portion of the training data; Start with small regularization and ind learning rate that makes the loss go down;
Hyperparameter Optimization
coarse
fine
If the cost is ever > 3 * original cost, break out early
it’s best to optimize in log space
Q: But this best cross-validation result is worrying. Why?
A:
big gap between training accuracy and testing accuracy
overfitting