机器学习最优化方法[2] --牛顿法

1、牛顿法

牛顿法也是一种迭代的求解方法,相比于梯度下降法,牛顿法在搜索方向上不仅考虑一阶梯度方向,同时考虑二阶梯度,因此用牛顿法求解收敛更快。
假设有函数 f ( x ) , x R f(x), x\in\R , x k x_k 为当前的极小值估计,在 x k x_k 处做二阶泰勒展开:
f ( x ) = f ( x k ) + f ( x k ) ( x x k ) + 1 2 f ( x k ) ( x x k ) 2 f(x)=f(x_k)+f^\prime(x_k)(x-x_k)+\frac{1}{2}f^{\prime\prime}(x_k)(x-x_k)^2
两边同时对 x x 求导:
f ( x ) = f ( x k ) + f ( x k ) ( x x k ) f^\prime(x)=f^\prime(x_k)+f^{\prime\prime}(x_k)(x-x_k)
x x 取极值点的时候, f ( x ) = 0 f^\prime(x)=0 , 因此:
f ( x k ) + f ( x k ) ( x x k ) = 0 f^\prime(x_k)+f^{\prime\prime}(x_k)(x-x_k)=0 可以推导得到:
x = x k f ( x k ) f ( x k ) x=x_k-\frac{f^\prime(x_k)}{f^{\prime\prime}(x_k)}\qquad\qquad
因此给定初始点 x 0 x_0 , 可以构造如下迭代式:
x k + 1 = x k f ( x k ) f ( x k ) , k = 0 , 1 , 2 , . . . x_{k+1}=x_k-\frac{f^\prime(x_k)}{f^{\prime\prime}(x_k)}, k=0,1,2,...
如果推广到多维,假设 x R N \bf x\in\R^N , 则在 x k \bf x_k 处的二阶泰勒展开式:
f ( x ) = f ( x k ) + f ( x k ) ( x x k ) + 1 2 ( x x k ) T 2 f ( x k ) ( x x k ) f({\bf x})=f({\bf x_k})+\nabla f({\bf x_k})({\bf x-x_k})+\frac{1}{2}({\bf x-x_k})^{\bf T}\nabla^2 f({\bf x_k})({\bf x-x_k})
其中 f \nabla f f ( x ) f({\bf x}) 的梯度向量, 2 f \nabla^2 f f ( x ) f({\bf x}) 的海森矩阵(Hessian Matrix),
f = [ f x 1 f x 2 f x N ] , 2 f = [ 2 f x 1 2 2 f x 1 x 2 2 f x 1 x N f x 2 x 1 2 f x 2 2 2 f x 2 x N 2 f x N x 1 2 f x N x 2 2 f x N 2 ] \nabla f= \left[ \begin{matrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_N} \end{matrix} \right], \qquad \nabla^2 f= \left[ \begin{matrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_N}\\ \frac{\partial f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_N}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_N\partial x_1} & \frac{\partial^2 f}{\partial x_N\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_N^2} \end{matrix} \right]
同样的,我们两边对 x \bf x 求导,并令 f ( x ) = 0 f^\prime(\bf x)=0 ,得到:
f ( x k ) + 2 f ( x k ) ( x x k ) = 0 \nabla f({\bf x_k})+\nabla^2 f({\bf x_k})({\bf x-x_k})=0 得到:
x = x k ( 2 f ( x k ) ) 1 f ( x k ) \bf x=\bf x_k- (\nabla^2 f({\bf x_k}))^{-1}\nabla f({\bf x_k})
一般的我们令 f ( x k ) = g k \nabla f({\bf x_k})=\bf{g_k} , 2 f ( x k ) = H k \nabla^2 f({\bf x_k})=\bf H_k ,可得到迭代式: x k + 1 = x k H k 1 g k \bf x_{k+1}=x_k-H_k^{-1}\cdot g_k
这就是最原始的牛顿法,其中 d k = H k 1 g k \bf d_k=-H_k^{-1}\cdot g_k 称为牛顿方向,但是最原始的牛顿法没有步长因子,所以无法保证稳定的下降,因此有了阻尼牛顿法,阻尼牛顿法的梯度方向仍然是 d k \bf d_k ,但是在每一轮迭代时在梯度方向上搜索最优的步长因子,即在迭代计算 x k + 1 \bf x_{k+1} 之前,先得到步长 λ k \lambda_{k} : λ k = a r g   min λ ( x k + λ d k ) , λ R \lambda_k=arg\,\min_{\lambda}(\bf {x_k} + \lambda d_k), \lambda \in R
然后才迭代得到: x k + 1 = x k + λ k d k \bf x_{k+1}=x_k+\lambda_kd_k , 即: x k + 1 = x k λ k H k 1 g k \bf x_{k+1}=x_k-\lambda_kH_k^{-1}\cdot g_k

2、拟牛顿法

牛顿法收敛速度快,但是有两个缺点:
1、需要计算目标函数的二阶偏导数,计算复杂度高
2、海森矩阵无法一定保持正定,从而无法求得逆矩阵,导致牛顿法失效
因此我们有了拟牛顿法(或者准牛顿法),不用去求二阶偏导数,而是构造出近似海森矩阵的正定的对阵矩阵,然后来优化目标函数。不同的构造方法自然就产生了不同的拟牛顿法。
假设在迭代到 x k + 1 \bf x_{k+1} 时,我们将目标函数在 x k + 1 \bf x_{k+1} 处做二阶泰勒展开: f ( x ) x f ( x k + 1 ) + f ( x k + 1 ) ( x x k + 1 ) + 1 2 ( x x k + 1 ) T 2 f ( x k + 1 ) ( x x k + 1 ) f({\bf x}) \approx xf({\bf x_{k+1}})+\nabla f({\bf x_{k+1}})({\bf x-x_{k+1}})+\frac{1}{2}({\bf x-x_{k+1}})^{\bf T}\nabla^2 f({\bf x_{k+1}})({\bf x-x_{k+1}})
两边同时求梯度: f ( x ) f ( x k + 1 ) + 2 f ( x k + 1 ) ( x x k + 1 ) \nabla f({\bf x}) \approx \nabla f({\bf x_{k+1}})+\nabla^2 f({\bf x_{k+1}})({\bf x-x_{k+1}})
x = x k \bf x=x_k , 得到: f ( x k ) f ( x k + 1 ) + 2 f ( x k + 1 ) ( x k x k + 1 ) \nabla f({\bf x_k}) \approx \nabla f({\bf x_{k+1}})+\nabla^2 f({\bf x_{k+1}})({\bf x_k-x_{k+1}}) ,
即: g k + 1 g k H k + 1 ( x k + 1 x k ) \bf g_{k+1}-g_k \approx H_{k+1}(x_{k+1}-x_{k})
我们引入记号: y k = g k + 1 g k \bf y_k=g_{k+1}-g_k , s k = x k + 1 x k \bf s_k=x_{k+1}-x_k , 可得到:
y k H k + 1 s k s k H k + 1 1 y k {\bf y_k \approx H_{k+1}s_k} 或者 {\bf s_k \approx H_{k+1}^{-1}y_k}
我们用 B k + 1 \bf B_{k+1} 对海森矩阵 H k + 1 \bf H_{k+1} 做近似, 用 D k + 1 \bf D_{k+1} 对海森矩阵的逆矩阵 H k + 1 1 \bf H_{k+1}^{-1} 做近似, 则有:
y k = B k + 1 s k s k = D k + 1 y k {\bf y_k = B_{k+1}s_k} 或者 {\bf s_k = D_{k+1}y_k}

DFP算法

DFP算法的核心思想是通过迭代的方法对 D k + 1 D_{k+1} (即 H k + 1 1 H_{k+1}^{-1} )做近似,如下: D k + 1 = D k + Δ D k D_{k+1}=D_k+\Delta D_k
D 0 D_0 一般取单位矩阵,因此关键是 Δ D k \Delta D_k 如何构造, 这里我们将 Δ D k \Delta D_k 构造如下:
Δ D k = α u u T + β v v T \Delta D_k=\alpha {\bf uu^T}+\beta {\bf vv^T}
这样保证了 Δ D k \Delta D_k 的对称性,下面我们开始从 s k = D k + 1 y k {\bf s_k = D_{k+1}y_k} 推导,可以得到:
s k = ( D k + Δ D k ) y k = D k y k + α u u T y k + β v v T y k {\bf s_k}=(D_k+\Delta D_k){\bf y_k}=D_k{\bf y_k}+\alpha {\bf uu^Ty_k}+\beta {\bf vv^Ty_k}
s k = D k y k + u ( α u T y k ) + v ( β v T y k ) {\bf s_k}=D_k{\bf y_k}+{\bf u}(\alpha {\bf u^Ty_k})+{\bf v}(\beta {\bf v^Ty_k})
其中 α u T y k \alpha {\bf u^Ty_k} β v T y k \beta {\bf v^Ty_k} 为两个数,我们可以做如下赋值:
α u T y k = 1 , β v T y k = 1 \alpha {\bf u^Ty_k}=1,\beta {\bf v^Ty_k}=-1
这样可以得到: α = 1 u T y k \alpha =\frac{1}{\bf u^Ty_k} , β = 1 v T y k \beta=-\frac{1}{\bf v^Ty_k}
接下来我们将要确定如何构造 u \bf u v \bf v , 我们再回到上面 s k \bf s_k 的表达式,将 α u T y k = 1 , β v T y k = 1 \alpha {\bf u^Ty_k}=1,\beta {\bf v^Ty_k}=-1 代入,得到:
s k = D k y k + u v {\bf s_k}=D_k{\bf y_k}+\bf u-v
u v = s k D k y k {\bf u-v=s_k}-D_k\bf y_k
要确保上式成立,我们可以直接令 u = s k \bf u=s_k , v = D k y k {\bf v}=D_k{\bf y_k} ,这样我们可以得到:
α = 1 s k T y k , β = 1 y k T D k T y k \alpha =\frac{1}{\bf s_k^Ty_k}, \beta=-\frac{1}{{\bf y_k^T}D_k^T{\bf y_k}}
最后可以得到 Δ D k \Delta D_k 的迭代式:
Δ D k = s k s k T s k T y k D k y k y k T D k T y k T D k T y k \Delta D_k=\frac{\bf s_ks_k^T}{\bf s_k^Ty_k}-\frac{D_k{\bf y_ky_k^T}D_k^T}{{\bf y_k^T}D_k^T{\bf y_k}}
这样我们可以初始 D 0 = I D_0=I , 每轮迭代确定搜索方向为 D k g k -D_k\bf g_k ,利用阻尼牛顿法确定最优的步长因子,然后迭代得到 x k + 1 \bf x_{k+1} , 在下一轮迭代前,同样迭代求得 D k + 1 D_{k+1}

BFGS算法

与DFP算法相比,BFGS算法性能更好,是目前求解无约束非线性优化问题最常用的方法之一。
与DFP算法不同的是,BFGS算法是通过构造 B k + 1 B_{k+1} 来直接逼近海森矩阵 H k + 1 H_{k+1} , 同样有:
B k + 1 = B k + Δ B k B_{k+1}=B_k+\Delta B_k
采用与DFP算法同样的推导过程,这里我们略过,最终得到矫正矩阵 Δ B k \Delta B_k 的迭代式:
Δ B k = y k y k T y k T y k B k s k s k T B k T s k T B k T s k \Delta B_k=\frac{\bf y_ky_k^T}{\bf y_k^Ty_k}-\frac{B_k{\bf s_ks_k^T}B_k^T}{{\bf s_k^T}B_k^T{\bf s_k}}
可以看到与DFP的 Δ D k \Delta D_k 的迭代式只是 s k \bf s_k y k \bf y_k 换了个位置。
BFGS算法在每轮迭代确定搜索方为: d k = B k 1 g k {\bf d_k}=-B_k^{-1}\bf g_k , 为了避免矩阵求逆,更一般的做法是通过Sherman-Morrison公式在每轮迭代中直接给出 B k + 1 1 B_{k+1}^{-1} B k 1 B_{k}^{-1} 的关系式:
B k + 1 1 = ( I s k y k T y k T s k ) B k 1 ( I y k s k T y k T s k ) + s k s k T y k T s k B_{k+1}^{-1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})B_{k}^{-1}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
关于如何通过Sherman-Morrison公式推导得出上式,可以参考这篇文章: https://blog.csdn.net/langb2014/article/details/48915425, 附录中有对上式的详细推导。

L-BFGS

BFGS算法需要存储 N × N N\times N 的矩阵 D K D_K , 当N很大时,需要很大的内存,因此有了L-BFGS(Limited-memory BFGS), 其基本思想是不存储 D K D_K , 而是存储计算过程用到的一系列向量{ s i \bf s_i }和{ y i \bf y_i }, 需要 D K D_K 时,通过{ s i \bf s_i }和{ y i \bf y_i }计算得到。当然{ s i \bf s_i }和{ y i \bf y_i }只是存固定的 m m 个,这样内存的开销将由 O ( N 2 ) O(N^2) 降到 O ( m N ) O(mN) 。下面我们来进行推导。
再来看BFGS的迭代式:
B k + 1 1 = ( I s k y k T y k T s k ) B k 1 ( I y k s k T y k T s k ) + s k s k T y k T s k B_{k+1}^{-1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})B_{k}^{-1}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
因为 D k = B k 1 D_k=B_k^{-1} , 所以:
D k + 1 = ( I s k y k T y k T s k ) D k ( I y k s k T y k T s k ) + s k s k T y k T s k D_{k+1}=(I-\frac{\bf s_ky_k^T}{\bf y_k^Ts_k})D_{k}(I-\frac{\bf y_ks_k^T}{\bf y_k^Ts_k})+\frac{\bf s_ks_k^T}{\bf y_k^Ts_k}
ρ k = 1 y k T s k \rho _k=\frac{1}{\bf y_k^Ts_k} , V k = I ρ k y k s k T V_k=I-\rho _k{\bf y_ks_k^T} , 则有:
D k + 1 = V k T D k V k + ρ k s k s k T D_{k+1}=V_k^TD_{k}V_k+\rho _k{\bf s_ks_k^T}
如果给定初始 D 0 D_0 , 我们有:
D 1 = V 0 T D 0 V 0 + ρ 0 s 0 s 0 T D_{1}=V_0^TD_{0}V_0+\rho _0{\bf s_0s_0^T}
D 2 = V 1 T D 1 V 1 + ρ 1 s 1 s 1 T = V 1 T V 0 T D 0 V 0 V 1 + V 1 T ( ρ 0 s 0 s 0 T ) V 1 + ρ 1 s 1 s 1 T D 3 = V 2 T D 2 V 2 + ρ 2 s 2 s 2 T = V 2 T ( V 1 T V 0 T D 0 V 0 V 1 + V 1 T ( ρ 0 s 0 s 0 T ) V 1 + ρ 1 s 1 s 1 T ) V 2 + ρ 2 s 2 s 2 T = V 2 T V 1 T V 0 T D 0 V 0 V 1 V 2 + V 2 T V 1 T ( ρ 0 s 0 s 0 T ) V 1 V 2 + V 2 T ( ρ 1 s 1 s 1 T ) V 2 + ρ 2 s 2 s 2 T D k + 1 = ( V k T V k 1 T V 1 T V 0 T ) D 0 ( V 0 V 1 V k 1 V k ) + ( V k T V k 1 T V 1 T ) ( ρ 0 s 0 s 0 T ) ( V 1 V k 1 V k ) + ( V k T V k 1 T V 2 T ) ( ρ 1 s 1 s 1 T ) ( V 2 V k 1 V k ) + + V k T ( ρ k 1 s k 1 s k 1 T ) V k + ρ k s k s k T \begin{aligned} D_{2} &=V_1^TD_{1}V_1+\rho _1{\bf s_1s_1^T}\\ &=V_1^TV_0^TD_{0}V_0V_1+V_1^T(\rho _0{\bf s_0s_0^T})V_1+\rho _1{\bf s_1s_1^T}\\ D3&=V_2^TD_{2}V_2+\rho _2{\bf s_2s_2^T}\\ &=V_2^T(V_1^TV_0^TD_{0}V_0V_1+V_1^T(\rho _0{\bf s_0s_0^T})V_1+\rho _1{\bf s_1s_1^T})V_2+\rho _2{\bf s_2s_2^T}\\ &=V_2^TV_1^TV_0^TD_{0}V_0V_1V_2+V_2^TV_1^T(\rho _0{\bf s_0s_0^T})V_1V_2+V_2^T(\rho _1{\bf s_1s_1^T})V_2+\rho _2{\bf s_2s_2^T}\\ &\cdots \\ D_{k+1}&=(V_k^TV_{k-1}^T\cdots V_1^TV_0^T)D_0(V_0V_{1}\cdots V_{k-1}V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_1^T)(\rho _0{\bf s_0s_0^T})(V_1\cdots V_{k-1 }V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_2^T)(\rho _1{\bf s_1s_1^T})(V_2\cdots V_{k-1 }V_k)\\ &+\cdots \\ &+V_k^T(\rho _{k-1}{\bf s_{k-1}s_{k-1}^T})V_k\\ &+ \rho _k{\bf s_ks_k^T} \end {aligned}
因此如果计算 D k + 1 D_{k+1} , 我们只需存储k个向量 s i \bf{s_i} y i \bf{y_i} 即可,但是我们事先并不知道收敛时需要的迭代次数,如果迭代次数非常大,那也会存储很大数量的向量。如果要降低内存消耗,则必须舍弃一些向量,当然我们应该首先舍弃最早生成的向量,从 s 0 \bf{s_0} y 0 \bf{y_0} 开始,依次向后。因此,如果我们设定只能存储 m m 个向量, 那么 D k + 1 D_{k+1} 可以用下式近似计算:
D k + 1 = ( V k T V k 1 T V k m + 2 T V k m + 1 T ) D 0 ( V k m + 1 V k m + 2 V k 1 V k ) + ( V k T V k 1 T V k m + 2 T ) ( ρ 0 s 0 s 0 T ) ( V k m + 2 V k 1 V k ) + ( V k T V k 1 T V k m + 3 T ) ( ρ 1 s 1 s 1 T ) ( V k m + 3 V k 1 V k ) + + ( V k T V k 1 T ) ( ρ k 2 s k 2 s k 2 T ) V k 1 V k + V k T ( ρ k 1 s k 1 s k 1 T ) V k + ρ k s k s k T \begin{aligned} D_{k+1}&=(V_k^TV_{k-1}^T\cdots V_{k-m+2}^TV_{k-m+1}^T)D_0(V_{k-m+1}V_{k-m+2}\cdots V_{k-1}V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_{k-m+2}^T)(\rho _0{\bf s_0s_0^T})(V_{k-m+2}\cdots V_{k-1 }V_k)\\ &+(V_k^TV_{k-1}^T\cdots V_{k-m+3}^T)(\rho _1{\bf s_1s_1^T})(V_{k-m+3}\cdots V_{k-1 }V_k)\\ &+\cdots \\ &+(V_k^TV_{k-1}^T)(\rho _{k-2}{\bf s_{k-2}s_{k-2}^T})V_{k-1}V_k\\ &+V_k^T(\rho _{k-1}{\bf s_{k-1}s_{k-1}^T})V_k\\ &+ \rho _k{\bf s_ks_k^T} \end {aligned}

参考资料

http://blog.csdn.net/itplus/article/details/21896453

猜你喜欢

转载自blog.csdn.net/tangdawei2010/article/details/84932692