Logistic Regression - Cost function

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第七章《logistic回归》中第49课时《代价函数》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。

In this video we'll talk about how to fit the parameters \theta for logistic regression. In particular, I'd like to define the optimization objective or the cost function that we'll use to fit the parameters. Here's the supervised learning problem of fitting a logistic regression model.

 

We have a training set of m training examples. And as usual each of our examples is represented by a feature vector that's n+1 dimensional. And as usual we have x_{0}=1. Our first feature, or zero feature is always equal to 1, and because this is a classification problem, or training set has the property that every label y, is either 0 or 1. This is the hypothesis and the parameters of the hypothesis is this \theta over here. And the question I want to talk about is given this training set how do we choose, or how do we fit the parameters \theta?

Back when we were developing the linear regression model, we use the following cost function. I've written this slightly differently, where instead of \frac{1}{2m}, I've taken the \frac{1}{2} and put it inside the summation instead. Now, I want to use an alternative way of writing out this cost function which is that instead of writing out this squared error here, let's write here, cost(h_{\theta }(x^{(i)}),y^{(i)}). I'm going to define that term to be equal to this cost(h_{\theta }(x^{(i)}),y^{(i)})=\frac{1}{2}(h_{\theta }(x^{(i)})-y^{(i)})^{2}. It just equal to \frac{1}{2} of the squared error. So now, we can see more clearly that the cost function is \frac{1}{m} times the sum over my training set of the cost term here. And to simplify the equation a little bit more, it's gonna be convenient to get rid of those superscripts. So just define cost(h_{\theta }(x),y)=\frac{1}{2}(h_{\theta }(x)-y)^{2}. And the interpretation of this cost function is that this is the cost I want my learning algorithm to have to pay, if it outputs that value, if its prediction is h_{\theta }(x), and the actual label is y. And no surprise for the linear regression the cost for you to define is that. Well the cost for this is that \frac{1}{2} times the square difference between what are predicted and the actual value that we observed for y. Now, this function worked fine for linear regression, but here we're interested in logistic regression. If we could minimize this cost function that is plugged into J here. That will work okay. But it turns out that if we use this particular cost function, this would be a non-convex function of the parameters \theta. Here's what I mean by non-convex. We have some cost function J(\theta ) and for logistic regression this function h_{\theta }(x) here has a non linearity since h_{\theta }(x)=\frac{1}{1+e^{-\theta ^{T}x}}. So it's a pretty complicated nonlinear function. And if you take the sigmoid function and plug it in here and then take this cost function and plug in there, and then plot what J(\theta ) looks like, you find that J(\theta ) can look like a function just like this with many local optima, and formal term for this is that this is a non-convex function. And you can kind of tell, if you were run gradient descent on this sort of function, it is not guaranteed to converge to the global minimum. Whereas in contrast, what we would like is to have a cost function J(\theta ) that is convex, that is a single bow-shaped function that looks like this, so that if you run gradient descent, we would be guaranteed that gradient descent converge to the global minimum. And the problem of using the squared cost function is that because of this very non-linear sigmoid function that appears in the middle here, J(\theta ) ends up being a non-convex function if you were to define it as the squared cost function. So what we'd like to do is to instead come up with a different cost function that is convex and so that we can apply a great algorithm like gradient descent and be guaranteed to find a global minimum.

Here's a cost function that we're going to use for logistic regression. We're going to define the cost as the penalty that the algorithm pays if it outputs a value h_{\theta }(x) which is some number like 0.7. And the actual cost label turns out to be y. The cost is going to be -log(h_{\theta }(x)) if y=1. And -log(1-h_{\theta }(x)) if y=0. This looks like a pretty complicated function. But let's plot this function to gain some intuition about what it's doing. Let's start up with the case of y=1. If y is equal to 1, then the cost function is -log(h_{\theta }(x)). If we plot that, let's say that the horizontal axis is h_{\theta }(x). We know that a hypothesis is going to output a value between 0 and 1. So h_{\theta }(x) that varies between 0 and 1. If you plot what this cost function looks like, you find that it looks like this. First, you notice that if y=1 and h_{\theta }(x)=1. That in other words, if the hypothesis predict exactly h_{\theta }(x)=1, and y is exactly equal to what I predicted. Then the cost=0. Right? But now, notice also that as h_{\theta }(x) approaches 0, the output of hypothesis approaches 0, the cost blows up, and it goes to infinity. And what this does is it captures the intuition that if a hypothesis outputs 0. That is saying, our hypothesis is saying, the chance of y=1 is equal to 0. It's kind of like we're going to our medical patient and saying, "That probability that you have a malignant tumor, the probability that y=1 is 0". So, it's absolutely impossible that your tumor is malignant. But if it turns out that the patient's tumor, actually is malignant. So if y is equal to 1 even after we told them the probability of it happening is 0. It's absolutely impossible for it to be malignant. But if we told them this with that level of certainty, and we turned out to be wrong, then we penalize the learning algorithm by a very, very large cost, and that's captured by having this cost goes infinity if y=1 and h_{\theta }(x) approaches 0. This might consider of the case of y=1. Let's look at what the cost function looks like for y=0.

If y is equal to 0, then the cost looks like this expression over here. And if you plot the function -log(1-z), what you get is the cost function actually looks like this. So, it goes from 0 to 1. So if you plot the cost function for the case of y=0, you find that it looks like this and what this curve does is it now blows up, and it goes to plus infinity as h_{\theta }(x) goes to 1. Because it's saying that if y turns out to be equal to 0, but we predicted that y is equal to 1 with almost certainty with probability 1, then we end up paying a very large cost. And conversely, if h_{\theta }(x) is equal to 0 and y equals 0, then the hypothesis nailed it. The predicted y is equal to 0 and it turns out y is equal to 0 so at this point the cost function is going to be 0.

In this video, we have defined the cost function for a single training example. The topic of convexity analysis is beyond the scope of this course. But it is possible to show that with our particular choice of cost function this would give us a convex optimization problem as cost function, overall cost function J(\theta ) will be convex and local optima free. In the next video, we are gonna take these ideas of the cost function for a single training example and develop that further and define the cost function for the entire training set and we'll also figure out a simpler way to write it than we have been using so far. And based on that we'll work out gradient descent and that will give us logistic regression algorithm.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/104742400
今日推荐