Machine Learning - Neural Networks Learning: Cost Function and Backpropagation

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/iracer/article/details/51227081

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 5, Neural Networks Learning. This article contains some topic about Cost Function and Backpropagation algorithm.


Cost Function and Backpropagation


Neural networks are one of the most powerful learning algorithms that we have today. In this and in the next few sections, We're going to start talking about a learning algorithm for fitting the parameters of a neural network given a training set.As with the discussion of most of our learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network. 


1. Cost function


I'm going to focus on the application of neural networks to classification problems. So suppose we have a network like that shown in the picture. And suppose we have a training set like this is x(i) , y(i) pairs of M training example.



L = total no. of layers in network, L = 4.

sl = no. of units (not counting bias unit) in layer l,s1 = 3, s2 = 5,s4 = sL = 4

Binary classification

The first is Binary classification, where the labels y are either 0 or 1. In this case, we will have 1 output unit, so this Neural Network unit on top has 4 output units, but if we had binary classification we would have only one output unit that computes h( x). And the output of the neural network would be  h( x) is going to be a real number.

y = 0 or 1

Multi-class classification (K classes)


K output units


Cost function 


Logistic regression

The cost function we use for the neural network is going to be a generalization of the one that we use for  logistic regression. For logistic regression we used to minimize the cost function J(θ) that was minus 1/m of this cost function and then plus this extra regularization term here, where this was a sum from J=1 through n, because we did not regularize the bias term θ 0.


Neural network

For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one, which is the compression output unit, we may instead have K of them. So here's our cost function.
Our new network now outputs vectors in  R K where K might be equal to 1 if we have a binary classification problem. I'm going to use this notation h(x) subscript i to denote the ith output. That is, h(x) is a k-dimensional vector and so this subscript i just selects out the ith element of the vector that is output by my neural network. My cost function J(θ) is now going to be the following.  


Is - 1/m of a sum of a similar term to what we have for logistic regression, except that we have the sum from K equals 1 through K. This summation is basically a sum over my K output. So if I have four output units, that is if the final layer of my neural network has four output units, then this is a sum from k equals one through four of basically the logistic regression algorithm's cost function but summing that cost function over each of my four output units in turn.
And finally, the second term here is the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but  all it's doing is it's summing over these terms θji l for all values of ji and l. Except that we don't sum over the terms corresponding to these bias values like we have for logistic progression.

2. Backpropagation algorithm


In the previous section, we talked about a cost function for the neural network. In this section, let's start to talk about an algorithm, for trying to minimize the cost function. In particular, we'll talk about the back propagation algorithm.

Gradient computation 


Here's the cost function that we wrote down in the previous section. What we'd like to do is try to find parameters theta to try to minimize J( θ). In order to use either gradient descent or one of the advance optimization algorithms.


Need code to compute:

What we need to do therefore is to write code that takes this input the parameters theta and computes  j of theta and these partial derivative terms. Remember, that the parameters in the neural network of these things, theta superscript l subscript  ij, that's the real number and so, these are the partial derivative terms we need to compute. In order to compute the cost function  j of theta, we just use this formula up here and so, what I want to do for the most of this video is focus on talking about how we can compute these partial derivative terms.

Given one training example ( x, y)

Let's start by talking about the case of when we have only one training example, our entire training set comprises only one training example which is a pair (x, y).


And let's tap through the sequence of calculations we would do with this one training example. The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input.

Forward propagation


So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network.


Gradient computation: Back propagation algorithm 

Next, in order to compute the derivatives,we're going to use an algorithm called back propagation. The intuition of the back propagation algorithm is that for each note we're going to compute the term δ superscript subscript jthat's going to somehow represent the error of note jin the layer l.

Intuition:    

For each output unit (layer L = 4)

 

If you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is justδ(4) gets set as a(4)

Where here, each of these δ(4),a(4) and y, each of these is a vector whose dimension is equal to the number of output units in our network.

What we do next is compute the delta terms for the earlier layers in our network. Here's a formula for computingδ(3) isδ(3) is equal to theta 3 transpose timesδ(4). And this dot times, this is the elementy's multiplication operation that we know from MATLAB.


Backpropagation algorithm 


Training set:




3. Backpropagation intuition


Backpropagation maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to linear regression or logistic regression. And I've actually used backpropagation, you know, pretty successfully for many years. And even today I still don't sometimes feel like I have a very good sense of just what it's doing, or intuition about what back propagation is doing. If, for those of you that are doing the programming exercises, that will at least mechanically step you through the different steps of how to implement back prop. So you'll be able to get it to work for yourself. And what I want to do in this section is look a little bit more at the mechanical steps of backpropagation, and try to give you a little more intuition about what the mechanical steps the back prop is doing to hopefully convince you that, you know, it's at least a reasonable algorithm.

Forward Propagation 

In order to better understand backpropagation, let's take another closer look at what forward propagation is doing. Here's a neural network with two input units that is not counting the bias unit, and two hidden units in this layer, and two hidden units in the next layer. And then, finally, one output unit. Again, these counts two, two, two, are not counting these bias units on top.


In order to illustrate forward propagation,I'm going to draw this network a little bit differently. And in particular I'm going to draw this neural-network with the nodes drawn as these very fat ellipsis, so that I can write text in them. When performing forward propagation, we might have some particular example. Say some example (xi, yi) And it'll be this xthat we feed into the input layer.

So the way we compute this value, z1(3) is

When we forward propagated to the first hidden layer here,what we do is compute z1(2) and z2(2). So these are the weighted sum of inputs of the input units. And then we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here's are the activation values. So that gives us a1(2)and  a2(2) . And then we forward propagate again to get here z1(3). Apply the sigmoid of the logistic function, the activation function to that to get a1(3). And similarly, like so until we getz1(4). Apply the activation function. This gives us a1(4), which is the final output value of the neural network.

What is backpropagation doing? 


What backpropagation is doing is doing a process very similar to Forward Propagation. Except that instead of the computations flowing from the left to the right of this network, the computations since their flow from the right to the left of the network. And using a very similar computation as this.
Cost function of neural network is


Focusing on a single example x(i),y(i), the case of 1 output unit (K=1), and ignoring regularization (λ=0), the cost function can be written as follows

And what this cost function does is it plays a role similar to the squared arrow. So, rather than looking at this complicated expression, if you want you can think of cost of i being approximately the square difference between what the neural network outputs, versus what is the actual value.
Think of


i.e.how well is the network doing on example i?



More formally, what the delta terms actually are is this, they're the partial derivative with respect to zj(l), that is this weighted sum of inputs that were confusing these z terms. Partial derivatives with respect to these things of the cost function. So concretely, the cost function is a function of the label y and of the value, this h(x) output value neural network. And if we could go inside the neural network and just change those zj(l) values a little bit, then that will affect these values that the neural network is outputting. And that will end up changing the cost function.

We don't compute the bias term

And by the way, so far I've been writing the delta values only for the hidden units, but excluding the bias units. Depending on how you define the backpropagation algorithm, or depending on how you implement it, you may end up implementing something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they are just what they are, and there's no way for us to change the value. And so, depending on your implementation of back prop, the way I usually implement it. I do end up computing these delta values, but we just discard them, we don't use them.

猜你喜欢

转载自blog.csdn.net/iracer/article/details/51227081