统计学习方法读书笔记(九)-EM算法及其推广

全部笔记的汇总贴:统计学习方法读书笔记汇总贴

PDF免费下载:《统计学习方法(第二版)》

EM算法用于含有隐变量(hidden variable)的概率模型参数的极大似然估计,或极大后验概率估计。EM算法的每次迭代由两步组成:E步,求期望(expectation) ; M步,求极大(maximization)。

一、EM算法的引入

三硬币模型

假设有三枚硬币,分别记为A、B、C。这些硬币正面的概率分别为 π , p , q π,p,q π,p,q,进行如下的抛硬币实验:先掷硬币A,根据其结果选出硬币B或者硬币C,正面选硬币B,反面选硬币C,然后掷选出的硬币,掷硬币的记录,出现正面记作1,出现反面记作0,独立地重复n次实验(这里n=10),然后观测结果如下:

1,1,0,1,0,0,1,0,1,1

假设只能观测到掷硬币的结果,不能观测掷硬币的过程,问如何估计三硬币正面出现的概率,即三硬币模型的参数 π , p , q π,p,q π,p,q

所以我们要求的是: θ ^ = arg max ⁡ θ log ⁡ P ( Y ∣ θ ) \hat \theta=\argmax_\theta\log P(Y|\theta) θ^=θargmaxlogP(Yθ)
E步:求根据 y j y_j yj来置硬币B的概率 μ j ( i + 1 ) = π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j + ( 1 − π ( i ) ) ( q ( i ) ) y j ( 1 − q ( i ) ) 1 − y j \mu_j^{(i+1)}=\frac{\pi^{(i)}(p^{(i)})^{y_j}(1-p^{(i)})^{1-y_j}}{\pi^{(i)}(p^{(i)})^{y_j}(1-p^{(i)})^{1-y_j}+(1-\pi^{(i)})(q^{(i)})^{y_j}(1-q^{(i)})^{1-y_j}} μj(i+1)=π(i)(p(i))yj(1p(i))1yj+(1π(i))(q(i))yj(1q(i))1yjπ(i)(p(i))yj(1p(i))1yj
M步:计算模型参数的新的估计值 π ( i + 1 ) = 1 n ∑ j = 1 n μ j ( i + 1 ) p ( i + 1 ) = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) q ( i + 1 ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y i ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) \pi^{(i+1)}=\frac1n\sum_{j=1}^n\mu_j^{(i+1)}\\p^{(i+1)}=\frac{\sum_{j=1}^n\mu_j^{(i+1)}y_j}{\sum_{j=1}^n\mu_j^{(i+1)}}\\q^{(i+1)}=\frac{\sum_{j=1}^n(1-\mu_j^{(i+1)})y_i}{\sum_{j=1}^n(1-\mu_j^{(i+1)})} π(i+1)=n1j=1nμj(i+1)p(i+1)=j=1nμj(i+1)j=1nμj(i+1)yjq(i+1)=j=1n(1μj(i+1))j=1n(1μj(i+1))yi

以上是得到的结论,接下来我们进行推导

y j y_j yj为第 j j j次实验的观测数据
Z Z Z为隐变量,表示掷硬币A出现的结果,该变量只有两个取值(0/1)
z j z_j zj为第 j j j次实验时,掷硬币A出现的结果,其中 z j = 1 z_j=1 zj=1表示正面
θ \theta θ为参数集合 π , p , q π,p,q π,p,q
θ ( i ) \theta^{(i)} θ(i)为第 i i i次迭代时, π , p , q π,p,q π,p,q的估计值

E-Step:
完全数据的对数似然函数为: l o g ( P ( Y , Z ∣ θ ) ) = log ⁡ ( ∏ j = 1 n p ( y j , z j ∣ θ ) ) = ∑ j = 1 n log ⁡ ( p ( y j , z j ∣ θ ) ) log(P(Y,Z|θ))=\log(∏_{j=1}^np(y_j,z_j|θ))\\=∑_{j=1}^n\log(p(y_j,z_j|θ)) log(P(Y,Zθ))=log(j=1np(yj,zjθ))=j=1nlog(p(yj,zjθ))
期望为: E Z ∣ Y , θ ( i ) [ log ⁡ ( P ( Y , Z ∣ θ ) ) ] = ∑ j = 1 n ∑ z j [ p ( z j ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j ∣ θ ) ) ] = ∑ j = 1 n ∑ z j [ p ( z j ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j ∣ θ ) ) ] = ∑ j = 1 n [ p ( z j = 1 ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j = 1 ∣ θ ) ) ] + [ p ( z j = 0 ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j = 0 ∣ θ ) ) ] E_{Z|Y,θ^{(i)}}[\log(P(Y,Z|θ))]=∑_{j=1}^n∑_{z_j}[p(z_j|y_j,θ^{(i)})\log(p(y_j,z_j|θ))]\\=∑_{j=1}^n∑_{z_j}[p(z_j|y_j,θ^{(i)})\log(p(y_j,z_j|θ))]\\=∑_{j=1}^n{[p(z_j=1|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))]+[p(z_j=0|y_j,θ^{(i)})\log(p(y_j,z_j=0|θ))]} EZY,θ(i)[log(P(Y,Zθ))]=j=1nzj[p(zjyj,θ(i))log(p(yj,zjθ))]=j=1nzj[p(zjyj,θ(i))log(p(yj,zjθ))]=j=1n[p(zj=1yj,θ(i))log(p(yj,zj=1θ))]+[p(zj=0yj,θ(i))log(p(yj,zj=0θ))]
对于后验概率 p ( z j ∣ y j , θ ( i ) ) p(z_j|y_j,\theta^{(i)}) p(zjyj,θ(i)),可以写成 μ j ( i + 1 ) = P ( y j , z j = 1 ∣ θ ( i ) ) P ( y j ∣ θ ( i ) ) \mu_j^{(i+1)} = \frac{P(y_j, z_j=1 | \theta^{(i)})}{P(y_j | \theta^{(i)})} μj(i+1)=P(yjθ(i))P(yj,zj=1θ(i)),这样思考,给定条件 θ ( i ) \theta^{(i)} θ(i) z j = 1 z_j=1 zj=1的概率为 π ( i ) \pi^{(i)} π(i),留意 θ ( i ) \theta^{(i)} θ(i)已发生,那么 P ( z j = 1 ∣ θ ) P(z_j=1|\theta) P(zj=1θ)就等于 π ( i ) \pi^{(i)} π(i)。也就是说,此时的 θ ( i ) \theta^{(i)} θ(i)对于我们计算后验概率的时候可以不加以考虑。所以, p ( z j = 1 ∣ y j ; θ ( i ) ) = p ( y j , z j = 1 ) p ( y j ) = p ( y j ∣ z j = 1 ) p ( z j = 1 ) ∑ z j p ( y j , z j ) = p ( y j ∣ z j = 1 ) p ( z j = 1 ) p ( y j , z j = 1 ) + p ( y j , z j = 0 ) = ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j ⋅ π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j ⋅ π ( i ) + ( q ( i ) ) y j ( 1 − q ( i ) ) 1 − y j ⋅ ( 1 − π ( i ) ) = μ j ( i + 1 ) p(z_j=1|y_j;θ^{(i)})=\frac{p(y_j,z_j=1)}{p(y_j)}\\=\frac{p(y_j|z_j=1)p(z_j=1)}{∑_{z_j}p(y_j,z_j)}\\=\frac{p(y_j|z_j=1)p(z_j=1)}{p(y_j,z_j=1)+p(y_j,z_j=0)}\\=\frac{(p^{(i)})^{y_j}(1−p^{(i)})^{1−y_j}\cdot π^{(i)}}{(p^{(i)})^{y_j}(1−p^{(i)})^{1−y_j}\cdot π^{(i)}+(q^{(i)})^{y_j}(1−q^{(i)})^{1−y_j}\cdot (1−π^{(i)})}\\=\mu_j^{(i+1)} p(zj=1yj;θ(i))=p(yj)p(yj,zj=1)=zjp(yj,zj)p(yjzj=1)p(zj=1)=p(yj,zj=1)+p(yj,zj=0)p(yjzj=1)p(zj=1)=(p(i))yj(1p(i))1yjπ(i)+(q(i))yj(1q(i))1yj(1π(i))(p(i))yj(1p(i))1yjπ(i)=μj(i+1) p ( z j = 0 ∣ y j ; θ ( i ) ) = 1 − μ j ( i + 1 ) p(z_j=0|y_j;θ^{(i)})=1-\mu_j^{(i+1)} p(zj=0yj;θ(i))=1μj(i+1)
对于联合概率, p ( y j , z j = 1 ∣ θ ) = p ( y j ∣ z j = 1 , θ ) p ( z j = 1 ∣ θ ) = π p y j ( 1 − p ) 1 − y j p(y_j,z_j=1|θ)=p(y_j|z_j=1,θ)p(z_j=1|θ)=πp^{y_j}(1−p)^{1−y_j} p(yj,zj=1θ)=p(yjzj=1,θ)p(zj=1θ)=πpyj(1p)1yj p ( y j , z j = 0 ∣ θ ) = p ( y j ∣ z j = 0 , θ ) p ( z j = 0 ∣ θ ) = ( 1 − π ) q y j ( 1 − q ) 1 − y j p(y_j,z_j=0|θ)=p(y_j|z_j=0,θ)p(z_j=0|θ)=(1-π)q^{y_j}(1−q)^{1−y_j} p(yj,zj=0θ)=p(yjzj=0,θ)p(zj=0θ)=(1π)qyj(1q)1yj
所以, Q ( θ , θ ( i ) ) = E Z ∣ Y , θ ( i ) [ log ⁡ ( P ( Y , Z ∣ θ ) ) ] = ∑ j = 1 n [ p ( z j = 1 ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j = 1 ∣ θ ) ) + p ( z j = 0 ∣ y j , θ ( i ) ) log ⁡ ( p ( y j , z j = 1 ∣ θ ) ) ] = ∑ j = 1 n [ μ j ( i + 1 ) ⋅ log ⁡ ( π p y j ( 1 − p ) 1 − y j ) + ( 1 − μ j ( i + 1 ) ) ⋅ log ⁡ ( ( 1 − π ) q y j ( 1 − q ) 1 − y j ) ] Q(\theta,\theta^{(i)})=E_{Z|Y,θ^{(i)}}[\log(P(Y,Z|θ))]\\=∑_{j=1}^n\Big[p(z_j=1|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))+p(z_j=0|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))\Big]\\=∑_{j=1}^n\Big[μ^{(i+1)}_j\cdot\log(πp^{y_j}(1−p)^{1−y_j})+(1−μ^{(i+1)}_j)\cdot\log((1−π)q^{y_j}(1−q)^{1−y_j})\Big] Q(θ,θ(i))=EZY,θ(i)[log(P(Y,Zθ))]=j=1n[p(zj=1yj,θ(i))log(p(yj,zj=1θ))+p(zj=0yj,θ(i))log(p(yj,zj=1θ))]=j=1n[μj(i+1)log(πpyj(1p)1yj)+(1μj(i+1))log((1π)qyj(1q)1yj)]

M-Step
注意:在求偏导时,把 μ j ( i + 1 ) \mu_j^{(i+1)} μj(i+1)看作一个常数。
方法:分别对 π , p , q π,p,q π,p,q求偏导并令其等于0,求出参数的估计。

  • π \pi π求偏导并令其等于0
    ∂ f ∂ π = ∑ j = 1 n [ μ j ( i + 1 ) ⋅ 1 π − ( 1 − μ j ( i + 1 ) ) ⋅ 1 1 − π ] = ∑ j = 1 n π − μ j ( i + 1 ) π ( 1 − π ) = n π − ∑ j = 1 n μ j ( i + 1 ) π ( 1 − π ) = 0 \frac{∂f}{∂π}=∑_{j=1}^n\Big[{μ^{(i+1)}_j\cdot \frac1π−(1−μ^{(i+1)}_j)\cdot\frac1{1−π}}\Big]\\=∑_{j=1}^n\frac{π−μ^{(i+1)}_j}{π(1−π)}\\=\frac{nπ−∑_{j=1}^nμ^{(i+1)}_j}{π(1−π)}=0 πf=j=1n[μj(i+1)π1(1μj(i+1))1π1]=j=1nπ(1π)πμj(i+1)=π(1π)nπj=1nμj(i+1)=0
    所以 π \pi π的估计为 π = 1 n ∑ j = 1 n μ j ( i + 1 ) \pi=\frac1n\sum_{j=1}^n\mu_j^{(i+1)} π=n1j=1nμj(i+1)
  • p p p求偏导并令其等于0 ∂ f ∂ p = ∑ j = 1 n μ j ( i + 1 ) π { y j p y j − 1 ( 1 − p ) 1 − y j + p y j [ − ( 1 − y j ) ( 1 − p ) − y j ] } π p y j ( 1 − p ) 1 − y j = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j p y j − 1 ( 1 − p ) − y j ⋅ ( 1 − p ) + p y j − 1 ⋅ p [ ( y j − 1 ) ( 1 − p ) − y j ] p y j ( 1 − p ) 1 − y j = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j ( 1 − p ) + p ( y j − 1 ) p ( 1 − p ) = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j ( 1 − p ) + p ( y j − 1 ) p ( 1 − p ) = ∑ j = 1 n μ j ( i + 1 ) y j − p p ( 1 − p ) = 0 \frac{∂f}{∂p}=∑_{j=1}^nμ^{(i+1)}_j\frac{π\Big\{y_jp^{y_j−1}(1−p)^{1−y_j}+p^{y_j}[−(1−y_j)(1−p)^{−y_j}]\Big\}}{πp^{y_j}(1−p)^{1−y_j}}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_jp^{y_j−1}(1−p)^{−y_j}\cdot(1−p)+p^{y_j−1}\cdot p[(y_j−1)(1−p){−y_j}]}{p^{y_j}(1−p)^{1−y_j}}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_j(1−p)+p(y_j−1)}{p(1−p)}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_j(1−p)+p(y_j−1)}{p(1−p)}\\=∑_{j=1}^nμ^{(i+1)}_j\frac{y_j−p}{p(1−p)}=0 pf=j=1nμj(i+1)πpyj(1p)1yjπ{ yjpyj1(1p)1yj+pyj[(1yj)(1p)yj]}=j=1nμj(i+1)pyj(1p)1yjyjpyj1(1p)yj(1p)+pyj1p[(yj1)(1p)yj]=j=1nμj(i+1)p(1p)yj(1p)+p(yj1)=j=1nμj(i+1)p(1p)yj(1p)+p(yj1)=j=1nμj(i+1)p(1p)yjp=0
    所以 p p p的估计为 p = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) p=\frac{\sum_{j=1}^n\mu_j^{(i+1)}y_j}{\sum_{j=1}^n\mu_j^{(i+1)}} p=j=1nμj(i+1)j=1nμj(i+1)yj
  • q q q求偏导并令其等于0 ∂ f ∂ q = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) ( 1 − π ) { y j q y j − 1 ( 1 − q ) 1 − y j + q y j [ − ( 1 − y j ) ( 1 − q ) − y j ] } ( 1 − π ) q y j ( 1 − q ) 1 − y j = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j q y j − 1 ( 1 − q ) − y j ( 1 − q ) + q y j − 1 q [ ( y j − 1 ) ( 1 − q ) − y j ] q y j ( 1 − q ) 1 − y j = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j ( 1 − q ) + q ( y j − 1 ) q ( 1 − q ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j − q q ( 1 − q ) = 0 \frac{∂f}{∂q}=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{(1−π)\Big\{y_jq^{y_j−1}(1−q)^{1−y_j}+q^{y_j}[−(1−y_j)(1−q)^{−y_j}]\Big\}}{(1−π)q^{y_j}(1−q)^{1−y_j}}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_jq^{y_j−1}(1−q)^{−y_j}(1−q)+q^{y_j−1}q[(y_j−1)(1−q)^{−y_j}]}{q^{y_j}(1−q)^{1−y_j}}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_j(1−q)+q(y_j−1)}{q(1−q)}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_j−q}{q(1−q)}=0 qf=j=1n(1μj(i+1))(1π)qyj(1q)1yj(1π){ yjqyj1(1q)1yj+qyj[(1yj)(1q)yj]}=j=1n(1μj(i+1))qyj(1q)1yjyjqyj1(1q)yj(1q)+qyj1q[(yj1)(1q)yj]=j=1n(1μj(i+1))q(1q)yj(1q)+q(yj1)=j=1n(1μj(i+1))q(1q)yjq=0
    所以 q q q的估计为 q = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y i ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) q=\frac{\sum_{j=1}^n(1-\mu_j^{(i+1)})y_i}{\sum_{j=1}^n(1-\mu_j^{(i+1)})} q=j=1n(1μj(i+1))j=1n(1μj(i+1))yi

二、EM算法的收敛性

还是以上面的三硬币模型为例,

由书上的9.1.2的推导可以得到 L ( θ ) ≥ B ( θ , θ ( i ) ) L(\theta)\ge B(\theta,\theta^{(i)}) L(θ)B(θ,θ(i))

收敛性证明: L ( θ ( i + 1 ) ) ≥ B ( θ ( i + 1 ) , θ ( i ) ) ≥ B ( θ ( i ) , θ ( i ) ) = L ( θ ( i ) ) ≥ ⋯ ≥ L ( θ ) L(\theta^{(i+1)})\ge B(\theta^{(i+1)},\theta^{(i)})\ge B(\theta^{(i)},\theta^{(i)})=L(\theta^{(i)})\ge\cdots\ge L(\theta) L(θ(i+1))B(θ(i+1),θ(i))B(θ(i),θ(i))=L(θ(i))L(θ)
又 L ( θ ) 上 界 为 l o g 1 = 0 , L ( θ ) 单 调 上 升 , 所 以 收 敛 又L(\theta)上界为log1=0,L(\theta)单调上升,所以收敛 L(θ)log1=0L(θ)

三、EM算法在高斯混合模型学习中的应用

可以看看这篇文章:白板推导系列笔记(十一)-高斯混合模型,有非常详细的推导。

四、EM算法的推广

(一) F F F函数的极大-极大算法

F F F函数写成 F ( P ^   ( Z ) , θ ) = E P ^   ( Z ) ​ [ log ⁡ P ( Y , Z ∣ θ ) ] − E P ^   ( Z ) ​ [ log ⁡ P ^   ( Z ) ] F( \hat P~ (Z),θ)=E_{\hat P~ (Z)}​ [\log P(Y,Z∣θ)]−E_{\hat P~ (Z)​} [\log \hat P~ (Z)] F(P^ (Z),θ)=EP^ (Z)[logP(Y,Zθ)]EP^ (Z)[logP^ (Z)]

  • 固定 θ \theta θ求第一个极大,向右寻找最佳的隐变量(隐变量分布)使得当前的 F θ ( P ^ ( Z ) ) F_\theta(\hat P(Z)) Fθ(P^(Z))函数取得极大,而这个极大正好取在 P ( Z ∣ Y , θ ) P(Z|Y,\theta) P(ZY,θ)这个点上
  • 固定 P ^ ( Z ) \hat P(Z) P^(Z)求第二个极大

(二)GEM算法

基于 F F F函数的。

下一章传送门:统计学习方法读书笔记(十)-隐马尔可夫模型

猜你喜欢

转载自blog.csdn.net/qq_41485273/article/details/112862578
今日推荐