Improving Deep Neural Networks

1.10 Vanishing/Exploding Gradients

这里写图片描述
One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.

这里写图片描述
假设激活函数 $g(z)=z$ ，即激活函数是线性的，并且没有偏移项 $b$ ：

\begin{aligned} z^{[1]} = W^{[1]} x a^{[1]} = z^{[1]} \\ z^{[2]} = W^{[2]} a^{[1]} = W^{[2]} W^{[1]} x \\ ⋮ \\ \hat{y} = W^{[L]} W^{[L - 1]} W^{[L - 2]} \dots W^{[2]} W^{[1]} x \end{aligned}

$\begin{align*} &z^{[1]}=W^{[1]}x\quad a^{[1]}=z^{[1]}\\ &z^{[2]}=W^{[2]}a^{[1]}=W^{[2]}W^{[1]}x \\ &\vdots \\ &\hat{y}=W^{[L]}W^{[L-1]}W^{[L-2]}\cdots W^{[2]}W^{[1]}x \end{align*}$

The weights W, if they’re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.
And if W is just a little bit less than identity. So this maybe here’s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.
And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers.

1.11 Weight initialization for deep networks

这里写图片描述

It turns out that a partial solution to vanishing/exploding gradients, doesn’t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

So in order to make $z$ not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you’re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that’s going into a neuron.

\begin{aligned} R e l u : \\ v a r (W) = \frac{2}{n^{[l - 1]}} \\ W^{[l]} = n p . r a n d o m . r a n d n (s h a p e) * n p . s q r t (\frac{2}{n^{[l - 1]}}) \\ n^{[l - 1]} denotes the number of units in l - 1 layer \\ t a n h : \\ v a r (W) = \frac{1}{n^{[l - 1]}} Xavier Initialization \end{aligned}

$\begin{align*} &Relu:\\ &var(W)=\frac{2}{n^{[l-1]}}\\ & W^{[l]}=np.random.randn(shape)*np.sqrt(\frac{2}{n^{[l-1]}})\\ & n^{[l-1]}\text{denotes the number of units in } l-1 \text{ layer}\\ &\\ &tanh:\\ &var(W)=\frac{1}{n^{[l-1]}} \text{ Xavier Initialization} \end{align*}$

So in practice, what you can do is set the weight matrix $W$ for a certain layer to be $np.random.randn$ you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be $n^{[(l-1)]}$ $because that’s the number of units that I’m feeding into each of the units and they are l.

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it’s trying to set each of the weight matrices w you know so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly. I’ve just mention some other variants.