Lecture 6: Training Neural Networks, Part I

CS231n

Lecture 6: Training Neural Networks, Part I

Review

回顾之前的内容,我们学习了神经网络的反向传播训练方法和CNN的结构,于是对于CNN我们可以用反向传播方法进行训练,具体方式是
1. 采样mini-batch
2. 前向传播获得loss
3. 根据loss进行反向传播梯度
4. 根据梯度更新参数

Training Neural Networks

Activation Functions

sigmoid, leaky ReLU, tanh, maxout, ReLU, ELU, …
传统神经网络中用的是sigmoid,但是这有很大问题
1. 神经元一饱和梯度就消失,tanh同理
2. 输出不是以0为中心的
3. exp()计算代价较大
ReLU则有很大优势
1. 不饱和
2. 计算代价很低
3. 收敛快
4. 从神经生物学来看更合理
白璧微瑕:不是以0为中心的, x < 0 时不激活
其余还有leaky ReLU等,因为较少应用所以略过

Data Preprocessing

中心化、正规化: y = x μ σ

Weight Initialization

Gauss initialization不行
Xavier initialization: w N ( μ , σ 2 ) , μ = 0 f a n i n × f a n o u t , σ = f a n i n ,在ReLU时不好用
He et.al.: w N ( μ , σ 2 ) , μ = 0 f a n i n × f a n o u t , σ = f a n i n 2

Batch Normalization

直接用Gauss initialization

x ^ = x E ( x ) V a r ( x )

一般放在CONV/FC层和ReLU层之间,即[CONV+BN+ReLU+pool]或[FC+BN+ReLU+pool]
Problem: do we necessarily want a unit gaussian input to a tanh layer?
A: N ( 0 , 1 ) 的大部分能量集中在 [ 3 , 3 ] 之间,而 tanh ( 3 ) = 0.995 , tanh ( 2 ) = 0.964 ,梯度已经开始消失,所以还是将 σ 进一步缩小比较好
实际使用时迭代更新
x ^ = x μ σ 2 + ϵ y = γ x ^ + β

好处

  • Improves gradient flow through the network
  • Allows higher learning rates
  • Reduces the strong dependenceon initialization
  • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

Babysitting the Learning Process

  1. Preprocess the data
  2. Choose the architecture
  3. Double check that the loss is reasonable
  4. Try training…Make sure that you can overfit very small portion of the training data; Start with small regularization and ind learning rate that makes the loss go down;

Hyperparameter Optimization

coarse fine
If the cost is ever > 3 * original cost, break out early
it’s best to optimize in log space
Q: But this best cross-validation result is worrying. Why?
A:
big gap between training accuracy and testing accuracy overfitting

猜你喜欢

转载自blog.csdn.net/qq_36356761/article/details/80074372