全部笔记的汇总贴:统计学习方法读书笔记汇总贴
PDF免费下载:《统计学习方法(第二版)》
EM算法用于含有隐变量(hidden variable)的概率模型参数的极大似然估计,或极大后验概率估计。EM算法的每次迭代由两步组成:E步,求期望(expectation) ; M步,求极大(maximization)。
一、EM算法的引入
三硬币模型
假设有三枚硬币,分别记为A、B、C。这些硬币正面的概率分别为 π , p , q π,p,q π,p,q,进行如下的抛硬币实验:先掷硬币A,根据其结果选出硬币B或者硬币C,正面选硬币B,反面选硬币C,然后掷选出的硬币,掷硬币的记录,出现正面记作1,出现反面记作0,独立地重复n次实验(这里n=10),然后观测结果如下:
1,1,0,1,0,0,1,0,1,1
假设只能观测到掷硬币的结果,不能观测掷硬币的过程,问如何估计三硬币正面出现的概率,即三硬币模型的参数 π , p , q π,p,q π,p,q。
所以我们要求的是: θ ^ = arg max θ log P ( Y ∣ θ ) \hat \theta=\argmax_\theta\log P(Y|\theta) θ^=θargmaxlogP(Y∣θ)
E步:求根据 y j y_j yj来置硬币B的概率 μ j ( i + 1 ) = π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j + ( 1 − π ( i ) ) ( q ( i ) ) y j ( 1 − q ( i ) ) 1 − y j \mu_j^{(i+1)}=\frac{\pi^{(i)}(p^{(i)})^{y_j}(1-p^{(i)})^{1-y_j}}{\pi^{(i)}(p^{(i)})^{y_j}(1-p^{(i)})^{1-y_j}+(1-\pi^{(i)})(q^{(i)})^{y_j}(1-q^{(i)})^{1-y_j}} μj(i+1)=π(i)(p(i))yj(1−p(i))1−yj+(1−π(i))(q(i))yj(1−q(i))1−yjπ(i)(p(i))yj(1−p(i))1−yj
M步:计算模型参数的新的估计值 π ( i + 1 ) = 1 n ∑ j = 1 n μ j ( i + 1 ) p ( i + 1 ) = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) q ( i + 1 ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y i ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) \pi^{(i+1)}=\frac1n\sum_{j=1}^n\mu_j^{(i+1)}\\p^{(i+1)}=\frac{\sum_{j=1}^n\mu_j^{(i+1)}y_j}{\sum_{j=1}^n\mu_j^{(i+1)}}\\q^{(i+1)}=\frac{\sum_{j=1}^n(1-\mu_j^{(i+1)})y_i}{\sum_{j=1}^n(1-\mu_j^{(i+1)})} π(i+1)=n1j=1∑nμj(i+1)p(i+1)=∑j=1nμj(i+1)∑j=1nμj(i+1)yjq(i+1)=∑j=1n(1−μj(i+1))∑j=1n(1−μj(i+1))yi
以上是得到的结论,接下来我们进行推导
y j y_j yj为第 j j j次实验的观测数据
Z Z Z为隐变量,表示掷硬币A出现的结果,该变量只有两个取值(0/1)
z j z_j zj为第 j j j次实验时,掷硬币A出现的结果,其中 z j = 1 z_j=1 zj=1表示正面
θ \theta θ为参数集合 π , p , q π,p,q π,p,q
θ ( i ) \theta^{(i)} θ(i)为第 i i i次迭代时, π , p , q π,p,q π,p,q的估计值
E-Step:
完全数据的对数似然函数为: l o g ( P ( Y , Z ∣ θ ) ) = log ( ∏ j = 1 n p ( y j , z j ∣ θ ) ) = ∑ j = 1 n log ( p ( y j , z j ∣ θ ) ) log(P(Y,Z|θ))=\log(∏_{j=1}^np(y_j,z_j|θ))\\=∑_{j=1}^n\log(p(y_j,z_j|θ)) log(P(Y,Z∣θ))=log(j=1∏np(yj,zj∣θ))=j=1∑nlog(p(yj,zj∣θ))
期望为: E Z ∣ Y , θ ( i ) [ log ( P ( Y , Z ∣ θ ) ) ] = ∑ j = 1 n ∑ z j [ p ( z j ∣ y j , θ ( i ) ) log ( p ( y j , z j ∣ θ ) ) ] = ∑ j = 1 n ∑ z j [ p ( z j ∣ y j , θ ( i ) ) log ( p ( y j , z j ∣ θ ) ) ] = ∑ j = 1 n [ p ( z j = 1 ∣ y j , θ ( i ) ) log ( p ( y j , z j = 1 ∣ θ ) ) ] + [ p ( z j = 0 ∣ y j , θ ( i ) ) log ( p ( y j , z j = 0 ∣ θ ) ) ] E_{Z|Y,θ^{(i)}}[\log(P(Y,Z|θ))]=∑_{j=1}^n∑_{z_j}[p(z_j|y_j,θ^{(i)})\log(p(y_j,z_j|θ))]\\=∑_{j=1}^n∑_{z_j}[p(z_j|y_j,θ^{(i)})\log(p(y_j,z_j|θ))]\\=∑_{j=1}^n{[p(z_j=1|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))]+[p(z_j=0|y_j,θ^{(i)})\log(p(y_j,z_j=0|θ))]} EZ∣Y,θ(i)[log(P(Y,Z∣θ))]=j=1∑nzj∑[p(zj∣yj,θ(i))log(p(yj,zj∣θ))]=j=1∑nzj∑[p(zj∣yj,θ(i))log(p(yj,zj∣θ))]=j=1∑n[p(zj=1∣yj,θ(i))log(p(yj,zj=1∣θ))]+[p(zj=0∣yj,θ(i))log(p(yj,zj=0∣θ))]
对于后验概率 p ( z j ∣ y j , θ ( i ) ) p(z_j|y_j,\theta^{(i)}) p(zj∣yj,θ(i)),可以写成 μ j ( i + 1 ) = P ( y j , z j = 1 ∣ θ ( i ) ) P ( y j ∣ θ ( i ) ) \mu_j^{(i+1)} = \frac{P(y_j, z_j=1 | \theta^{(i)})}{P(y_j | \theta^{(i)})} μj(i+1)=P(yj∣θ(i))P(yj,zj=1∣θ(i)),这样思考,给定条件 θ ( i ) \theta^{(i)} θ(i), z j = 1 z_j=1 zj=1的概率为 π ( i ) \pi^{(i)} π(i),留意 θ ( i ) \theta^{(i)} θ(i)已发生,那么 P ( z j = 1 ∣ θ ) P(z_j=1|\theta) P(zj=1∣θ)就等于 π ( i ) \pi^{(i)} π(i)。也就是说,此时的 θ ( i ) \theta^{(i)} θ(i)对于我们计算后验概率的时候可以不加以考虑。所以, p ( z j = 1 ∣ y j ; θ ( i ) ) = p ( y j , z j = 1 ) p ( y j ) = p ( y j ∣ z j = 1 ) p ( z j = 1 ) ∑ z j p ( y j , z j ) = p ( y j ∣ z j = 1 ) p ( z j = 1 ) p ( y j , z j = 1 ) + p ( y j , z j = 0 ) = ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j ⋅ π ( i ) ( p ( i ) ) y j ( 1 − p ( i ) ) 1 − y j ⋅ π ( i ) + ( q ( i ) ) y j ( 1 − q ( i ) ) 1 − y j ⋅ ( 1 − π ( i ) ) = μ j ( i + 1 ) p(z_j=1|y_j;θ^{(i)})=\frac{p(y_j,z_j=1)}{p(y_j)}\\=\frac{p(y_j|z_j=1)p(z_j=1)}{∑_{z_j}p(y_j,z_j)}\\=\frac{p(y_j|z_j=1)p(z_j=1)}{p(y_j,z_j=1)+p(y_j,z_j=0)}\\=\frac{(p^{(i)})^{y_j}(1−p^{(i)})^{1−y_j}\cdot π^{(i)}}{(p^{(i)})^{y_j}(1−p^{(i)})^{1−y_j}\cdot π^{(i)}+(q^{(i)})^{y_j}(1−q^{(i)})^{1−y_j}\cdot (1−π^{(i)})}\\=\mu_j^{(i+1)} p(zj=1∣yj;θ(i))=p(yj)p(yj,zj=1)=∑zjp(yj,zj)p(yj∣zj=1)p(zj=1)=p(yj,zj=1)+p(yj,zj=0)p(yj∣zj=1)p(zj=1)=(p(i))yj(1−p(i))1−yj⋅π(i)+(q(i))yj(1−q(i))1−yj⋅(1−π(i))(p(i))yj(1−p(i))1−yj⋅π(i)=μj(i+1) p ( z j = 0 ∣ y j ; θ ( i ) ) = 1 − μ j ( i + 1 ) p(z_j=0|y_j;θ^{(i)})=1-\mu_j^{(i+1)} p(zj=0∣yj;θ(i))=1−μj(i+1)
对于联合概率, p ( y j , z j = 1 ∣ θ ) = p ( y j ∣ z j = 1 , θ ) p ( z j = 1 ∣ θ ) = π p y j ( 1 − p ) 1 − y j p(y_j,z_j=1|θ)=p(y_j|z_j=1,θ)p(z_j=1|θ)=πp^{y_j}(1−p)^{1−y_j} p(yj,zj=1∣θ)=p(yj∣zj=1,θ)p(zj=1∣θ)=πpyj(1−p)1−yj p ( y j , z j = 0 ∣ θ ) = p ( y j ∣ z j = 0 , θ ) p ( z j = 0 ∣ θ ) = ( 1 − π ) q y j ( 1 − q ) 1 − y j p(y_j,z_j=0|θ)=p(y_j|z_j=0,θ)p(z_j=0|θ)=(1-π)q^{y_j}(1−q)^{1−y_j} p(yj,zj=0∣θ)=p(yj∣zj=0,θ)p(zj=0∣θ)=(1−π)qyj(1−q)1−yj
所以, Q ( θ , θ ( i ) ) = E Z ∣ Y , θ ( i ) [ log ( P ( Y , Z ∣ θ ) ) ] = ∑ j = 1 n [ p ( z j = 1 ∣ y j , θ ( i ) ) log ( p ( y j , z j = 1 ∣ θ ) ) + p ( z j = 0 ∣ y j , θ ( i ) ) log ( p ( y j , z j = 1 ∣ θ ) ) ] = ∑ j = 1 n [ μ j ( i + 1 ) ⋅ log ( π p y j ( 1 − p ) 1 − y j ) + ( 1 − μ j ( i + 1 ) ) ⋅ log ( ( 1 − π ) q y j ( 1 − q ) 1 − y j ) ] Q(\theta,\theta^{(i)})=E_{Z|Y,θ^{(i)}}[\log(P(Y,Z|θ))]\\=∑_{j=1}^n\Big[p(z_j=1|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))+p(z_j=0|y_j,θ^{(i)})\log(p(y_j,z_j=1|θ))\Big]\\=∑_{j=1}^n\Big[μ^{(i+1)}_j\cdot\log(πp^{y_j}(1−p)^{1−y_j})+(1−μ^{(i+1)}_j)\cdot\log((1−π)q^{y_j}(1−q)^{1−y_j})\Big] Q(θ,θ(i))=EZ∣Y,θ(i)[log(P(Y,Z∣θ))]=j=1∑n[p(zj=1∣yj,θ(i))log(p(yj,zj=1∣θ))+p(zj=0∣yj,θ(i))log(p(yj,zj=1∣θ))]=j=1∑n[μj(i+1)⋅log(πpyj(1−p)1−yj)+(1−μj(i+1))⋅log((1−π)qyj(1−q)1−yj)]
M-Step
注意:在求偏导时,把 μ j ( i + 1 ) \mu_j^{(i+1)} μj(i+1)看作一个常数。
方法:分别对 π , p , q π,p,q π,p,q求偏导并令其等于0,求出参数的估计。
- 对 π \pi π求偏导并令其等于0
∂ f ∂ π = ∑ j = 1 n [ μ j ( i + 1 ) ⋅ 1 π − ( 1 − μ j ( i + 1 ) ) ⋅ 1 1 − π ] = ∑ j = 1 n π − μ j ( i + 1 ) π ( 1 − π ) = n π − ∑ j = 1 n μ j ( i + 1 ) π ( 1 − π ) = 0 \frac{∂f}{∂π}=∑_{j=1}^n\Big[{μ^{(i+1)}_j\cdot \frac1π−(1−μ^{(i+1)}_j)\cdot\frac1{1−π}}\Big]\\=∑_{j=1}^n\frac{π−μ^{(i+1)}_j}{π(1−π)}\\=\frac{nπ−∑_{j=1}^nμ^{(i+1)}_j}{π(1−π)}=0 ∂π∂f=j=1∑n[μj(i+1)⋅π1−(1−μj(i+1))⋅1−π1]=j=1∑nπ(1−π)π−μj(i+1)=π(1−π)nπ−∑j=1nμj(i+1)=0
所以 π \pi π的估计为 π = 1 n ∑ j = 1 n μ j ( i + 1 ) \pi=\frac1n\sum_{j=1}^n\mu_j^{(i+1)} π=n1j=1∑nμj(i+1)- 对 p p p求偏导并令其等于0 ∂ f ∂ p = ∑ j = 1 n μ j ( i + 1 ) π { y j p y j − 1 ( 1 − p ) 1 − y j + p y j [ − ( 1 − y j ) ( 1 − p ) − y j ] } π p y j ( 1 − p ) 1 − y j = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j p y j − 1 ( 1 − p ) − y j ⋅ ( 1 − p ) + p y j − 1 ⋅ p [ ( y j − 1 ) ( 1 − p ) − y j ] p y j ( 1 − p ) 1 − y j = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j ( 1 − p ) + p ( y j − 1 ) p ( 1 − p ) = ∑ j = 1 n μ j ( i + 1 ) ⋅ y j ( 1 − p ) + p ( y j − 1 ) p ( 1 − p ) = ∑ j = 1 n μ j ( i + 1 ) y j − p p ( 1 − p ) = 0 \frac{∂f}{∂p}=∑_{j=1}^nμ^{(i+1)}_j\frac{π\Big\{y_jp^{y_j−1}(1−p)^{1−y_j}+p^{y_j}[−(1−y_j)(1−p)^{−y_j}]\Big\}}{πp^{y_j}(1−p)^{1−y_j}}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_jp^{y_j−1}(1−p)^{−y_j}\cdot(1−p)+p^{y_j−1}\cdot p[(y_j−1)(1−p){−y_j}]}{p^{y_j}(1−p)^{1−y_j}}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_j(1−p)+p(y_j−1)}{p(1−p)}\\=∑_{j=1}^nμ^{(i+1)}_j\cdot\frac{y_j(1−p)+p(y_j−1)}{p(1−p)}\\=∑_{j=1}^nμ^{(i+1)}_j\frac{y_j−p}{p(1−p)}=0 ∂p∂f=j=1∑nμj(i+1)πpyj(1−p)1−yjπ{ yjpyj−1(1−p)1−yj+pyj[−(1−yj)(1−p)−yj]}=j=1∑nμj(i+1)⋅pyj(1−p)1−yjyjpyj−1(1−p)−yj⋅(1−p)+pyj−1⋅p[(yj−1)(1−p)−yj]=j=1∑nμj(i+1)⋅p(1−p)yj(1−p)+p(yj−1)=j=1∑nμj(i+1)⋅p(1−p)yj(1−p)+p(yj−1)=j=1∑nμj(i+1)p(1−p)yj−p=0
所以 p p p的估计为 p = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) p=\frac{\sum_{j=1}^n\mu_j^{(i+1)}y_j}{\sum_{j=1}^n\mu_j^{(i+1)}} p=∑j=1nμj(i+1)∑j=1nμj(i+1)yj- 对 q q q求偏导并令其等于0 ∂ f ∂ q = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) ( 1 − π ) { y j q y j − 1 ( 1 − q ) 1 − y j + q y j [ − ( 1 − y j ) ( 1 − q ) − y j ] } ( 1 − π ) q y j ( 1 − q ) 1 − y j = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j q y j − 1 ( 1 − q ) − y j ( 1 − q ) + q y j − 1 q [ ( y j − 1 ) ( 1 − q ) − y j ] q y j ( 1 − q ) 1 − y j = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j ( 1 − q ) + q ( y j − 1 ) q ( 1 − q ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j − q q ( 1 − q ) = 0 \frac{∂f}{∂q}=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{(1−π)\Big\{y_jq^{y_j−1}(1−q)^{1−y_j}+q^{y_j}[−(1−y_j)(1−q)^{−y_j}]\Big\}}{(1−π)q^{y_j}(1−q)^{1−y_j}}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_jq^{y_j−1}(1−q)^{−y_j}(1−q)+q^{y_j−1}q[(y_j−1)(1−q)^{−y_j}]}{q^{y_j}(1−q)^{1−y_j}}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_j(1−q)+q(y_j−1)}{q(1−q)}\\=∑_{j=1}^n(1−μ^{(i+1)}_j)\frac{y_j−q}{q(1−q)}=0 ∂q∂f=j=1∑n(1−μj(i+1))(1−π)qyj(1−q)1−yj(1−π){ yjqyj−1(1−q)1−yj+qyj[−(1−yj)(1−q)−yj]}=j=1∑n(1−μj(i+1))qyj(1−q)1−yjyjqyj−1(1−q)−yj(1−q)+qyj−1q[(yj−1)(1−q)−yj]=j=1∑n(1−μj(i+1))q(1−q)yj(1−q)+q(yj−1)=j=1∑n(1−μj(i+1))q(1−q)yj−q=0
所以 q q q的估计为 q = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y i ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) q=\frac{\sum_{j=1}^n(1-\mu_j^{(i+1)})y_i}{\sum_{j=1}^n(1-\mu_j^{(i+1)})} q=∑j=1n(1−μj(i+1))∑j=1n(1−μj(i+1))yi
二、EM算法的收敛性
还是以上面的三硬币模型为例,
由书上的9.1.2的推导可以得到 L ( θ ) ≥ B ( θ , θ ( i ) ) L(\theta)\ge B(\theta,\theta^{(i)}) L(θ)≥B(θ,θ(i))
收敛性证明: L ( θ ( i + 1 ) ) ≥ B ( θ ( i + 1 ) , θ ( i ) ) ≥ B ( θ ( i ) , θ ( i ) ) = L ( θ ( i ) ) ≥ ⋯ ≥ L ( θ ) L(\theta^{(i+1)})\ge B(\theta^{(i+1)},\theta^{(i)})\ge B(\theta^{(i)},\theta^{(i)})=L(\theta^{(i)})\ge\cdots\ge L(\theta) L(θ(i+1))≥B(θ(i+1),θ(i))≥B(θ(i),θ(i))=L(θ(i))≥⋯≥L(θ)
又 L ( θ ) 上 界 为 l o g 1 = 0 , L ( θ ) 单 调 上 升 , 所 以 收 敛 又L(\theta)上界为log1=0,L(\theta)单调上升,所以收敛 又L(θ)上界为log1=0,L(θ)单调上升,所以收敛
三、EM算法在高斯混合模型学习中的应用
可以看看这篇文章:白板推导系列笔记(十一)-高斯混合模型,有非常详细的推导。
四、EM算法的推广
(一) F F F函数的极大-极大算法
F F F函数写成 F ( P ^ ( Z ) , θ ) = E P ^ ( Z ) [ log P ( Y , Z ∣ θ ) ] − E P ^ ( Z ) [ log P ^ ( Z ) ] F( \hat P~ (Z),θ)=E_{\hat P~ (Z)} [\log P(Y,Z∣θ)]−E_{\hat P~ (Z)} [\log \hat P~ (Z)] F(P^ (Z),θ)=EP^ (Z)[logP(Y,Z∣θ)]−EP^ (Z)[logP^ (Z)]
- 固定 θ \theta θ求第一个极大,向右寻找最佳的隐变量(隐变量分布)使得当前的 F θ ( P ^ ( Z ) ) F_\theta(\hat P(Z)) Fθ(P^(Z))函数取得极大,而这个极大正好取在 P ( Z ∣ Y , θ ) P(Z|Y,\theta) P(Z∣Y,θ)这个点上
- 固定 P ^ ( Z ) \hat P(Z) P^(Z)求第二个极大
(二)GEM算法
基于 F F F函数的。
下一章传送门:统计学习方法读书笔记(十)-隐马尔可夫模型