Neural Networks and Deep Learning

3.6 Activation Function

这里写图片描述
这里写图片描述

sigmoid: a = 1 1 + e z
取值在(0,1)之间,除非是二分类的输出层,一般不选用,因为 t a n h 比sigmoid表现要好。

tanh: a = e z e z e z + e z
取值在(-1,1),有数据中心化效果,使得网络的训练更容易,因此表现比sigmoid要好。

z 很大或者 z 很小的时候,sigmoid(z)和tanh(z)的导数都接近于0,会拖慢训练速度。
One of the downsides of both the s i g m o i d function and the t a n h function is that if z is either very large or very small then the gradient of the derivative or the slope of this function becomes very small so Z is very large or z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent.

Relu: m a x ( 0 , z )
当不知道选什么的时候,选Relu(rectified linear unit)。

Leaky Relu: m a x ( 0.01 z , z )

Relu的优点是,在 z 的大部分取值空间,其斜率都是远大于1的,使得网络训练速度较快。
The advantage of Relu is that for a lot of the space of z the derivative of the activation function, the slope of the activation function is very different from zero and so in practice using the regular activation function your new network will often learn much faster than using the tanh or the sigmoid activation function. I know that for half of the range of z the slope of relu is zero but in practice enough of your hidden units will have z greater than zero so learning can still be quite mask for most training examples.

猜你喜欢

转载自blog.csdn.net/XindiOntheWay/article/details/82285657