Machine Learning Andrew Ng -2. Linear regression with one varible

2.1 Model representation (模型描述)

在这里插入图片描述

In supervised learning, we have a data set and this data set is called a training set (训练集).

在这里插入图片描述

( x , y ) (x,y) : one training example

( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)} ) : the i t h i^{th} training example

x ( 1 ) = 2104 x^{(1)} = 2104

x ( 2 ) = 1416 x^{(2)} = 1416

y ( 1 ) = 460 y^{(1)} = 460

y ( 2 ) = 232 y^{(2)} =232

在这里插入图片描述

Hypothesis (假设函数) : h θ ( x ) h_\theta (x)

How to go about implementing this model ?

2.2 Cost function

How to fit the best possible straight line to our data ?

在这里插入图片描述

With the different choices of parameters θ 0 \theta_0 and θ 1 \theta_1 , we get different hypothesis, different hypothesis functions.

在这里插入图片描述

In linear regression we have a training set, what we want to do is to come up with values for the parameters θ 0 \theta_0 and θ 1 \theta_1 , so that the straight line we get out of this corresponds to a straight line that somehow fits the data well.

How do we come up with values θ 0 \theta_0 and θ 1 \theta_1 ?

Linear regression, what we’re going to do is to solve a minimization problem. What we are going to do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.
m i n ( θ 0 , θ 1 ) i = 1 m ( h θ ( x ( i ) ) y ( i ) ) 2 min(\theta_0, \theta_1) \sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2
Minimize the difference of this squared error, square difference between the predicted price of the house and the price that it will actually sell for.

m is the size of the training set.

Define a cost function.
J ( θ 0 , θ 1 ) = 1 2 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2

m i n i m i z e ( θ 0 , θ 1 ) J ( θ 0 , θ 1 ) minimize(\theta_0, \theta_1) J(\theta_0, \theta_1)

Cost function is also called the squared error function (平方误差函数), or sometimes called the square error cost function (平方误差代价函数).

在这里插入图片描述

2.3 Cost function intuition I

在这里插入图片描述

θ 1 = 1 \theta_1 = 1 , we have

在这里插入图片描述

J ( θ 1 ) = 1 2 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) 2 = 1 2 m i = 1 m ( θ 1 x ( i ) y ( i ) ) 2 = 1 2 m ( 0 2 + 0 2 + 0 2 ) = 0 J(\theta_1)=\frac{1}{2m}\sum_{i = 1}^{m}(h_\theta (x^{(i)})-y^{(i)})^2 =\frac{1}{2m}\sum_{i = 1}^{m}(\theta_1 x^{(i)}-y^{(i)})^2=\frac{1}{2m}(0^2+0^2+0^2)=0

θ 1 = 0.5 \theta_1 =0.5 , we have

在这里插入图片描述
J ( 0.5 ) = 1 2 m [ ( 0.5 1 ) 2 + ( 1 2 ) 2 + ( 1.5 3 ) 2 ] = 1 2 + 3 ( 3.5 ) = 3.5 6 0.58 J(0.5)= \frac{1}{2m}[(0.5-1)^2+(1-2)^2+(1.5-3)^2]=\frac{1}{2+3}\cdot(3.5)=\frac{3.5}{6}\thickapprox0.58
θ 1 = 0 \theta_1 =0 , we have

在这里插入图片描述

For different values of θ 1 \theta_1 , we can compute range of values, and get something like this :

在这里插入图片描述

Each value of θ 1 \theta_1 corresponds to a different hypothesis, or to a different straight line fit on the left.

For each value of θ 1 \theta_1 we could then derive a different value of J ( θ 1 ) J(\theta_1) .

在这里插入图片描述
We want to choose the value of θ 1 \theta_1 that minimize J ( θ 1 ) J(\theta_1) , this was our objective function for the linear regression.

2.4 Cost function intuition II

在这里插入图片描述

在这里插入图片描述When we have two parameters, it turns out the cost function also has a similar sort of bowl shape. And in fact, depending on the training set, we might get a cost function that may be looks something like this :

在这里插入图片描述

This is a 3-D surface plot, where the axes are labeled θ 0 \theta_0 and θ 1 \theta_1 . As you vary θ 0 \theta_0 and θ 1 \theta_1 , the two parameters, you get different values of the cost function J ( θ 0 , θ 1 ) J(\theta_0, \theta_1) , and the height of this surface above a particular point of θ 0 \theta_0 and θ 1 \theta_1 indicates the value of J ( θ 0 , θ 1 ) J(\theta_0, \theta_1) .

Contour plots (等高线图) also call contour figures

在这里插入图片描述
The axis are θ 0 \theta_0 and θ 1 \theta_1 . And each of these ovals (椭圆形), what each of these ellipses shows is a set of points that takes on the same value for J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) .

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

2.5 Gradient descent (梯度下降)

Gradient descent is used not only in linear regression. It’s actually used all over the place in machine learning.

Gradient descent for minimizing some arbitrary functions J J .

Problem:

在这里插入图片描述

A property (性质) of gradient descent: Start at the first point, we will find a local optimum (局部最优) , but if started just a little bit, a slightly different location, you would have wound up at a very different local optimum.

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

The notation : = := ,we use this to denote assignment (赋值), a : = b a:=b what is means in a computer, this means take the value in b b and use it to overwrite whatever the value of a a , this means we will set a a to be equal to the value of b b .

a = b a=b , then this is a truth assertion (真假判定) ,

α \alpha is called the learning rate. What α \alpha does is, it basically controls how big a step we take downhill with gradient descent. If α \alpha is very large, then that corresponds to a very aggressive gradient descent procedure, where we’re trying to take huge steps downhill. And if α \alpha is very small, then we’re taking little, little baby steps downhill.

How to set α \alpha ? We will discuss later…

Simultaneously update θ 0 \theta_0 and θ 1 \theta_1 .

在这里插入图片描述

2.6 Gradient descent intuition

在这里插入图片描述

In order to convey these intuitions, we use a slightly simpler example where we want to minimize the function of just one parameter.
min θ 1 J ( θ 1 ) θ 1 R \min_{\theta_1} J(\theta_1) \quad \quad \theta_1\in\mathbb{R}
在这里插入图片描述

What if the parameter θ 1 \theta_1 is already at a local minimum ?

在这里插入图片描述

Local minimum is when you have this derivative equal to zero.

在这里插入图片描述

2.7 Gradient descent for linear regression

Put together gradient descent with our cost function, and that will give us an algorithm for linear regression for fitting a straight line to our data.

在这里插入图片描述

\begin {align*}
\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)
&= \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}{m}(h_{\theta}(x{(i)})-y{(i)})2\
&=\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}{m}(\theta_0+\theta_1x{(i)}+y{(i)})2
\end {align*}

…em这都不支持??? 算辽 不分行了…
我想要的效果是这样
在这里插入图片描述

θ j J ( θ 0 , θ 1 ) = θ j 1 2 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) 2 = θ j 1 2 m i = 1 m ( θ 0 + θ 1 x ( i ) + y ( i ) ) 2 \frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) = \frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 =\frac{\partial}{\partial\theta_j}\cdot\frac{1}{2m}\cdot\sum_{i=1}^{m}(\theta_0+\theta_1x^{(i)}+y^{(i)})^2

( θ 0 ) j = 0 : θ 0 J ( θ 0 , θ 1 ) = 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) (\theta_0) \quad j=0:\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})

( θ 1 ) j = 1 : θ 1 J ( θ 0 , θ 1 ) = 1 m i = 1 m ( h θ ( x ( i ) ) y ( i ) ) x ( i ) (\theta_1) \quad j=1:\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\cdot\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}

在这里插入图片描述

It turns out that the cost function for linear regression is always going to be a bow-shaped function, that it is called a convex function (凸函数).

Convex function doesn’t have any local optima, except for the one global optimum.

在这里插入图片描述
We get this:

依旧弄不明白如何缩放图片…
我想要的效果是这样
在这里插入图片描述
okok 大一点更清楚

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

"Batch" Gradient Descent Algorithm

“Batch” means that each step of gradient descent uses all the training examples.

Normal equations methods (正规方程组法) : solving for the minimum of the cost function J J without needing to use an iterative (迭代) algorithm like gradient descent.

Gradient descent will scale better to larger data sets than that normal equations methods.

发布了44 篇原创文章 · 获赞 0 · 访问量 1354

猜你喜欢

转载自blog.csdn.net/qq_41664688/article/details/103987586