Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化

Classification

It's not a good idea to use linear regression for classification problem.

We can use logistic regression algorism, which is a classification algorism

想要\(0\le h_{\theta}(x) \le 1\), 只需要使用sigmoid function (又称为logistic function)
\[ \large h_\theta(x) = g(\theta^Tx), \quad其中\;g(z) =\frac{1}{1+e^{-z}} \]
\(h_\theta(x)\)的意义在于: \(h_\theta(x)\) = estimated probability that \(y = 1\) on input \(x\)

注意:\(x=0\)时,\(g(z)\)刚好等于0.5

Decision Boundary

\(h_\theta{(x)} == P\{y=1|x;0 \}\) (\(P\)指预测的概率)

​ 在课上的例子中,\(h_\theta(x) \ge 0.5,则y=1, else\; y=0\)

​ 不妨设\(\theta = \begin{bmatrix}-3\\ 1\\ 1 \end{bmatrix} ,则 h_\theta(x)=g(-3+x_1+x_2)\)

​ 由于"\(y=1\)" == "\(h_\theta(x) \ge 0.5\)" == "\(\theta^Tx \ge 0\)" == "\(-3+x_1+x_2 \ge 0\)"

这样的到了 "\(y=1\)" == "\(x_1+x_2 \ge 3\)"

\(x_1+x_2\)\(3\) 的关系决定了 \(y\) 的值,这就是Decision boundary(决策边界)

拓展到 Non-linear decision boundary:

​ 还可以有:Predict "\(y=1\)" if \(-1+x_1^2+x_2^2 \ge 0\) (\(\theta = \begin{bmatrix}-1\\ 0\\ 0 \\ 1\\ 1 \end{bmatrix},\;x = \begin{bmatrix}x_0\\ x_1\\ x_2\\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix}1\\ x_1\\ x_2\\ x_1^2 \\ x_2^2 \end{bmatrix}\))

​ 通过\(\theta\)的不同选择与\(x\)的不同构造可以得到各种形状的决策边界

​ 而Decision Boundary 取决于参数 \(\theta\) 的选择,并非由训练集决定

​ 我们需要用训练集来拟合参数 \(\theta\)

Cost Function
\[ \begin{align} &J(\theta) =\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\end{align} \]
在之前的 linear regression 中,用的Cost函数是:$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 $

但那不是通用的,在hypothesis function \(h_\theta(x)\)不再是线性方程的情况下,若再采用$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 \(会导致\)J(\theta)$ 有着众多的local optima,而不是我们想要的convex function

Logistic Regression Cost Function
\[ Cost(h_\theta(x),y) = \begin{cases} \begin{align} {-log(h_\theta(x))} &\quad\text{ if $y$ = 1} \\ {-log(1-h_\theta(x))} &\quad \text{ if $y$ = 0} \end{align} \end{cases} \]
\(h_\theta(x)=y\) 时,\(Cost(h_\theta(x,y))=0\),

\(y=1,h_\theta(x)\rightarrow0\)\(Cost \rightarrow \infty\),此时:\(\theta^Tx \rightarrow -\infty\)

\(y=0,h_\theta(x)\rightarrow1\)\(Cost \rightarrow \infty\),此时:\(\theta^Tx \rightarrow \infty\)

这样就保证了\(\theta\)的调整能使得\(h_\theta(x)\)\(y\) 靠近,也就是预测效果与实际更加符合

上面的\(Cost\) function 也可以写成:
\[ Cost(h_\theta(x),y) = -y\cdot log(h_\theta(x))-(1-y)\cdot log(1-h_\theta(x)) \]
这与之前的cases形式是等价的

所以:
\[ \begin{align} J(\theta) &=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\ &= -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\cdot log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot log(1-h_\theta(x^{(i)}))] \end{align} \]
Gradient Descent Algorithm的通用形式还是跟linear regression的一样(当然把\(h_\theta(x)\)展开后就不一样了):
\[ \begin{align}&\text{Repeat\{} \\ &\qquad\theta_j := \theta_j - \alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\ &\} \end{align} \]

Other Optimization Algorism

  • Conjugate Algorism(共轭梯度法)
  • BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
  • L-BFGS( Limited-memory BFGS)

advantage:

  • no need to manually pick \(\alpha\)
  • Often faster than gradient descent

disadvantage:

  • More complex

不建议自己写,但是...可以直接调库啊

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
    jVal = [code to compute J(theta)]
    gradient = zeros(n+1,1)
    gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
    gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
    ...
    gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
%}

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification:

用one-vs-all(一对多/一对余)的思想

对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类,然后用之前的课程中讲得分类方法拟合出这一类的分类器(classifier)

(classifier 就是hypothesis)

最后得出\(n\)个classifiers, 其中\(n\)是类别的总数量, \(y\)是类别:
\[ h_\theta^{(i)}(x) = P(y=i|x;\theta)\qquad (i=1,2,3,\dots,n) \]
也就是说,给定\(x\)\(\theta\)\(h_\theta^{(i)}(x)\) 能算出来类别是\(i\)类的概率

然后输入一个新的input \(x\)时,作出预测的行为是:\(\underbrace{max}_i(h_\theta^{(i)}(x))\)

Regularization (正则化)

解决overfitting(过拟合)的问题,另一个描述这个问题的词语是high variance(高方差)

这是 过多变量(feature)+ 过少训练数据 造成的

​ If we have too many features, the learned hypothesis may fit the training set very well(\(J(\theta) \approx 0\))

generalize:  how well a hypothesis applies even to new examples

Option to address overfitting:

  • Reduce number of features:
    • Manually select which features to keep
    • Model selection algorism
  • Regularization:
    • Keep all features, but reduce magnitude(大小)/values of parameters \(\theta_j\)
    • Works well when having a lot of features , each of which contributes a bit to predicting \(y\)

regularized Linear Regression

Regularization 的思路:

Small values for parameters \(\theta_0, \theta_1,\dots,\theta_n\):

  • "Simpler" hypothesis
  • Less prone to overfitting

也就是将某些影响过大的\(\theta_j\)设得很小,比如: \(\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 \approx \theta_0 + \theta_1x + \theta_2x^2\)

Gradient Descent

但是这个regularization 的过程不是在 \(h_\theta(x)\) 里进行的,而是在Cost Function \(J(\theta)\)里进行的:
\[ \large J(\theta) =\frac{1}{2m} [\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2 ] \]
注意后面加上的那一项(称之为正则化项)是从1开始的,它收缩了除了\(\theta_0\)外的每一个参数。 \(\lambda\) 称为regularization parameter(正则化参数),用于控制两个不同目标之间的平衡关系。

在这个cost functions 里两个\(\sum\)项代表了两个不同的目标:

  • 使假设更好地拟合数据(fit the training data well)
  • 保持参数值较小(keep the parameters small)

较小的参数值能得到简单的hypothesis,从而避免overfitting

注意:\(\lambda\)不能过大,否则会使得 \(\theta_1,\dots ,\theta_n \approx 0\), 从而fail to fit even the training set ——too high bias——underfitting(欠拟合)

\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\ &\} \end{align} \]
亦即
\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j}(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\qquad (j = 1,2...,n)\\ &\} \end{align} \]
Normal Equation

review: 之前的Normal Equation是 \(\theta = (X^TX)^{-1}X^Ty\)

改成\(\theta = (X^TX+\lambda \small{\begin{bmatrix}0 \\&1 \\ &&1\\&&&\ddots\\&&&&1 \end{bmatrix}})^{-1}X^Ty,\quad \large\text{if }\lambda \gt 0\)

关于不可逆/退化矩阵 的问题,还是用Octave中的pinv()可以取伪逆矩阵

但是只要确保\(\lambda\)严格大于0,就能证明括号里的两个矩阵的和是可逆的.....

Regularized Logistic Regression

review: $ J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}, log,h_\theta(x^{(i)})+(1-y^{(i)}), log,(1-h_\theta(x^{(i)}))]$

处理方法与Linear Regression 的一样,都是在式子最后面加上一个正则化项 \(\frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2\)
\[ J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,h_\theta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2 \]
Gradient Descent(general 形式跟Linear Regression的一样,区别还是只有\(h_\theta(x^{(i)})\)不同):
\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\ &\} \end{align} \]
在Octave中还是用之前的代码模版就行,注意在算\(\frac{\partial J(\theta)}{\partial \theta_j}\;(\small j=1,2,\dots,n)\)时需要注意把正则化项的偏微分加上

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
    jVal = [code to compute J(theta)]
    gradient = zeros(n+1,1)
    gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
    gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
    ...
    gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
%}

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

猜你喜欢

转载自www.cnblogs.com/khunkin/p/10199384.html