Linear Regression with multiple variables - Normal equation

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第五章《多变量线性回归》中第33课时《正规方程》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。
 

In this video (article), we'll talk about the normal equation, which for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta.

Concretely, so far the algorithm that we've been using for linear regression is gradient descent where in order to minimize the cost function J(\theta ), we would take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the global minimum. In contrast, the normal equation would give us a method to solve for \theta analytically, so that rather than needing to run this iterative algorithm, we can instead just solve for the optimal value for theta all at one go, so that in basically one step you get to the optimal value right there. It turns out that the normal equation method has some advantages and some disadvantages, but before we get to that and talk about when you should use it, let's get some intuition about what this method does.

For this explanatory example, let's take a very simplified cost function J(\theta ), that's just the function of a real number \theta. So, for now, imaging that \theta is just a scalar value or that \theta is just a real value. It's just a number, rather than a vector. Imagine that we have a cost function J that's a quadratic function of this real value parameter \theta, so J(\theta ) looks like that. Well, how do you minimize a quadratic function? For those of you that know a little bit of calculus, you may know that the way to minimize a function is to take derivatives and to set derivatives equal to zero. So we take the derivative of J with respect to the parameter of \theta. You get some formula which I am not going to derive, you set that derivative to zero, and this allows you to solve for the value of \theta that minimizes J(\theta ). That was a simpler case of when \theta was just a real number. In the problem that we are interested in, \theta is no longer just a real number, but instead is this n+1-dimentional parameter vector, and, a cost function J is a function of this vector value or \theta _{0} through \theta _{n}. And the cost function looks like this, some square cost function on the right. How do you minimize this cost function J? Calculus actually tells us that one way to do so, is to take the partial derivative of J, with respect to very parameter of \theta _{j} in turn, and then, to set all of these to 0. If you do that, you solve the values of \theta _{0}, \theta _{1} up to \theta _{n}. Then this will give you that value of the \theta to minimize the cost function J. Well if you actually work through the calculus and work through the solution to the parameters \theta _{0}, \theta _{1} up to \theta _{n}, the derivation ends up being somewhat involved. And, what I'm going to do in this video (article), is actually to not go through the derivation, which is kind of long and kind of involved, but what I want to do is just tell you what you need to know in order to implement this process so you can solve for the values of \thetas that corresponds to where the partial derivatives is equal to zero. Or alternatively, or equivalently, the values of \thetas that minimize the cost function J(\theta ). I realize that some of the comments I made may have made more sense only to those of you that are little more familiar with calculus. So, but if you don't know, if you're less familiar with calculus, don't worry about it. I'm just going to tell you what you need to know in order to implement this algorithm and get it to work.

For the example that I want to use as a running example. Let's say that I have m=4 training examples. In order to implement this normal equation method, what I'm going to do is the following. I'm going to take my dataset, so here are my four training examples. In this case let's assume that, these four examples is all the data I have. What I am going to do is take my dataset and add an extra column that corresponds to my extra feature x_{0}, that is always takes on this value of 1. What I'm going to do is I'm then going to construct a matrix called X that's a matrix that basically contains all of the features from my training data. So, concretely, here are my features and we're going to take all those numbers and put them into this matrix "X", okay? So just copy the data over one column at a time and then I am going to do something similar for ys. I am going to take the values that I'm trying to predict and construct now vector, like so, and call that a vector y. So X is going to be a m\times (n+1)-dimensional matrix, and y is going to be a m-dimension vector. Where m is the training examples and n is the number of features, n+1, because of this extra feature x_{0} that I had. Finally if you take your matrix X and if you just compute this, and set \theta =(X^{T}X)^{-1}X^{T}y, this would give you the value of \theta that minimizes your cost function. There was a lot that happened on the slides and I work through it using one specific example of one dataset. Let me just write this out in a slightly more general form. And then later on in this video (article), let me explain this equation a little bit more. In case it is not yet entirely clear how to do this.

In a general case, let us say we have m training examples, so (x^{(1)},y^{(1)}) up to (x^{(m)},y^{(m)}) and n features. So, each of the training example x^{(i)} may looks like a vector like this, that is a n+1 dimensional feature vector. The way I'm going to construct the matrix "X", this is also called the design matrix is as follows. Each training example gives me a feature vector like this. Let's say, sort of n+1 dimensional vector. The way I'm going to construct my design matrix X is only construct the matrix like this. And what I'm going to do is take the first training example, so that is a vector, take its transpose so it ends up being this long flat thing and make (x^{(1)})^{T} the first row of my design matrix X. Then I am going to take my second training example x^{(2)}, take the transpose of that and put that as the second row of X and so on, down until my last training example. Take the transpose of that, and that's my last row of my matrix X. And, so this makes my matrix X, an m\times (n+1) dimensional matrix. As a concrete example, let's say I have only one feature, really, only one feature other than x_{0}, which is always equal to 1. So if my feature vector is x^{(i)}, which are equal to this 1, which is x_{0}, then some real feature, like maybe the size of the house, then my design matrix X, would be equal to this. For the first row, I'm going to basically take this and take its transpose, So I'm going to end up with 1 and then x^{(1)}_{1}. For the second row we are going to end up with 1 and then x^{(2)}_{1} and so on down to 1, x^{^{(m)}}_{1}. And this will be a m\times 2 dimensional matrix. So, that's how to construct the matrix X. And the vector y maybe sometimes I might write an arrow on top to denote that it is a vector, but very often I'll just write this as y, either way. The vector y is obtained by taking all the labels, all the correct prices of houses in my training set, and just stacking them up into an m-dimensional vector, and that's y. Finally, having constructed the matrix X and the vector y, we then just compute \theta =(X^{T}X)^{-1}X^{T}y.

I just want to make sure that this equation makes sense to you and you know how to implement it. So, concretely, what is this (X^{T}X)^{-1}? Well, it is the inverse of the matrix X^{T}X. Concretely, if you were to set A=X^{T}X, so X^{T} is a matrix, X^{T}X gives you another matrix, and we call that matrix A. Then, you know, (X^{T}X)^{-1} is just you take this matrix A and you invert it, right! This gives, let's say A^{-1}. And so that's how you computes this thing. You compute X^{T}X and then you compute this inverse. We haven't yet talked about Octave. We'll do so in the later set of videos, but in the octave programming language, or a similar view, and also the matlab programming language is very similar. The command to compute this quantity, (X^{T}X)^{-1}X^{T}y, is as follows. In OctaveX^{'} (X prime) is the notation that you use to denote X^{T}. And so, this expression that's boxed in red (X^{'}*X), that's computing X^{T}X. pinv is a function for computing the inverse of a matrix, so this (pinv(X^{'}X)) computes (X^{T}X)^{-1}, and then you multiply that by X^{T}, and you multiply that by y. So you end up computing that formula which I didn't prove, but it is possible to show mathematically even though I'm not going to do so here. This formula gives you the optimal value of \theta in the sense that if you set \theta equal to this, that's the value of \theta that minimizes the cost function J(\theta ) for linear regression. One last detail, in an earlier video, I talked about feature scaling and the idea of getting features to be on similar ranges of values of each other. If you are using this normal equation method then feature scaling isn't actually necessary and is actually okay if, say, any feature x_{1} is between 0 and 1, and some feature x_{2} is between ranges from 0 to 1000 and some feature x_{3} ranges from 0 to 10^{-5}. And if you are using the normal equation method this is okay and there is no need to do feature scaling. Although of course if you're using gradient descent, then, feature scaling is still important.

Finally, when should you use the gradient descent and when should you use the normal equation method. Here are some of their advantages and disadvantages. Let's say you have m training examples and n features. One disadvantage of gradient descent is that, you need to choose the learning rate \alpha. And, often, this means running it few times, with different learning rate \alpha and seeing what works best. And so that is sort of extra work and extra hassle. Another disadvantage with gradient descent is it needs many more iterations. So, depending on the details, that could make it slower, although there's more to the story as we'll see in a second. As for the normal equation, you don't need to choose any learning rate \alpha, so that, you know, makes it really convenient, makes it simple to implement. You just run it and it usually just works. And you don't need to iterate, so you don't need to plot J(\theta ), check the convergence or take all those extra steps. So far, the balance seems to favor the normal equation. Here are some disadvantages of the normal equation, and some advantages of gradient descent. Gradient descent works pretty well, even when you have a very large number of features. So, even if you have millions of features, you can run gradient descent and it will be reasonably efficient. It will do something reasonable. In contrast, the normal equation, in order to solve for the parameters \theta, we need to solve for this term. We need to compute this term, (X^{T}X)^{-1}. This matrix X^{T}X. That's an n\times n matrix, if you have n features. Because, if you look at the dimensions of X^{T} and dimension of X, you figure out what the dimension of the product is, the matrix X^{T}X is an n\times n matrix where n is the number of features, and for almost computed implementations, the cost of inverting the matrix, grows roughly as the cube of the dimension of the matrix. So, computing this inverse costs roughly order n^{^{3}} time. Sometimes, it's slightly faster than n^{^{3}} but it's close enough for our purposes. So if n, the number of features is very large, then computing this quantity can be slow and the normal equation method can actually be much slower. So if n is large then I might usually use gradient descent because we don't want to pay this order n^{^{3}} time. But, if n is relatively small, then the normal equation might give you a better way to solve the parameters. What does small and large mean? Well, if n is on the order of a hundred, then inverting a hundred-by-hundred matrix is no problem by modern computing standards. If n is a thousand, I would still use the normal equation method. Inverting a thousand-by-thousand matrix is actually really fast on a modern computer. If n is ten thousand, then I might start to wonder. Inverting a ten thousand-by-ten thousand matrix starts to get kind of slow, and I might then start to maybe lean in the direction of gradient descent, but maybe not quite. n equals ten thousand, you can invert a ten-thousand-by-ten-thousand matrix. But if it gets much bigger than that, then, I would probably use gradient descent. So, if n equals to 10^{6} with a million features, then inverting a million-by-million matrix is going to be very expensive, and I would definitely favor gradient descent if you have that many features. So exactly how large set of features has to be before you convert to gradient descent, it's hard to give a strict number. But, for me, it is usually around ten thousand but I might start to consider switching over to gradient descent or maybe some other algorithms that we'll talk about later in this class. To summarize, so long as the number of features is not too large, the normal equation gives us a great alternative method to solve for the parameter \theta. Concretely, so long as the number of features is less than 1000, you know, I would use, I would usually just use the normal equation method rather than gradient descent.

To preview some ideas that we'll talk about later in this course, as we get to the more complex learning algorithm, for example, when we talk about classification algorithm, like a logistic regression algorithm, we'll see that those algorithm... The normal equation method actually don't work for those more sophisticated learning algorithms, and we will have to resort to gradient descent for those algorithm. So, gradient descent is a very useful algorithm to know. Both for linear regression when we have a large number of features and for some of the other algorithms that we'll see in this course, because for them the normal equation method just doesn't apply and doesn't work. But for this specific model of linear regression, the normal equation can give you an alternative that can be much faster than gradient descent. So, depending on the detail of your algorithm, depending on the detail of the problems and how many features you have, both of the algorithms are well worth knowing about.

<end>

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

猜你喜欢

转载自blog.csdn.net/edward_wang1/article/details/103820201