此文章主要是结合哔站shuhuai008大佬的白板推导视频:玻尔兹曼机_147min
全部笔记的汇总贴:机器学习-白板推导系列笔记
参考花书20.1
一、介绍
玻尔兹曼机连接的每个节点都是离散的二值分布,是全连接的,是为了解决局部最小值的问题而提出的玻尔兹曼机。
v = { 0 , 1 } D h = { 0 , 1 } P L = [ L i j ] D ∗ D J = [ J i j ] P ∗ P W = [ W i j ] D ∗ P v=\{0,1\}^D\;\;\;\;\;h=\{0,1\}^P\\L=\Big[L_{ij}\Big]_{D*D}\\J=\Big[J_{ij}\Big]_{P*P}\\W=\Big[W_{ij}\Big]_{D*P} v={ 0,1}Dh={ 0,1}PL=[Lij]D∗DJ=[Jij]P∗PW=[Wij]D∗P
{ p ( v , h ) = 1 Z exp { − E ( v , h ) } E ( v , h ) = − ( v T W h + 1 2 v T L v + 1 2 h T J h ) θ = { W , L , J } \left\{\begin{matrix} p(v,h)= \frac1Z\exp\{-E(v,h)\}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\E(v,h)=-(v^TWh+\frac12v^TLv+\frac12h^TJh)\end{matrix}\right.\\\theta=\{W,L,J\} { p(v,h)=Z1exp{ −E(v,h)}E(v,h)=−(vTWh+21vTLv+21hTJh)θ={ W,L,J}
二、Log似然的梯度
样本集合: V , ∣ V ∣ = N V,\;|V|=N V,∣V∣=N
P ( v ) = ∑ h p ( v , h ) 1 N ∑ v ∈ V log P ( v ) ← l o g − l i k e l i h o o d ∂ ∂ θ 1 N ∑ v ∈ V log P ( v ) = 1 N ∑ v ∈ V ∂ log P ( v ) ∂ θ ← g r a d i e n t o f l o g − l i k e l i h o o d P(v)=\sum_hp(v,h)\\\frac1N\sum_{v\in V}\log P(v)\leftarrow\;\;log-likelihood\\\frac\partial {\partial \theta}\frac1N\sum_{v\in V}\log P(v)=\frac1N\sum_{v\in V}{\color{blue}\frac{\partial\log P(v)} {\partial \theta}}\leftarrow gradient\;of\;log-likelihood P(v)=h∑p(v,h)N1v∈V∑logP(v)←log−likelihood∂θ∂N1v∈V∑logP(v)=N1v∈V∑∂θ∂logP(v)←gradientoflog−likelihood
∂ log P ( v ) ∂ θ = ∑ v ∑ h p ( v , h ) ⋅ ∂ E ( v , h ) ∂ θ − ∑ h p ( h ∣ v ) ⋅ ∂ E ( v , h ) ∂ θ \frac{\partial\log P(v)} {\partial \theta}=\sum_v\sum_h p(v,h)\cdot\frac{\partial E(v,h)}{\partial \theta}-\sum_hp(h|v)\cdot\frac{\partial E(v,h)}{\partial \theta} ∂θ∂logP(v)=v∑h∑p(v,h)⋅∂θ∂E(v,h)−h∑p(h∣v)⋅∂θ∂E(v,h)
∂ log P ( v ) ∂ W = ∑ v ∑ h p ( v , h ) ⋅ ( − v h T ) − ∑ h p ( h ∣ v ) ⋅ ( − v h T ) = ∑ h p ( h ∣ v ) ⋅ v h T − ∑ v ∑ h p ( v , h ) ⋅ v h T \frac{\partial\log P(v)} {\partial W}=\sum_v\sum_h p(v,h)\cdot(-vh^T)-\sum_hp(h|v)\cdot(-vh^T)\\=\sum_hp(h|v)\cdot vh^T-\sum_v\sum_h p(v,h)\cdot vh^T ∂W∂logP(v)=v∑h∑p(v,h)⋅(−vhT)−h∑p(h∣v)⋅(−vhT)=h∑p(h∣v)⋅vhT−v∑h∑p(v,h)⋅vhT
所以,
1 N ∑ v ∈ V ∂ log P ( v ) ∂ θ = 1 N ∑ v ∈ V ∑ h p ( h ∣ v ) ⋅ v h T − 1 N ∑ v ∈ V ∑ v ∑ h p ( v , h ) ⋅ v h T = 1 N ∑ v ∈ V ∑ h p ( h ∣ v ) ⋅ v h T − ∑ v ∑ h p ( v , h ) ⋅ v h T = E P D a t a [ v h T ] − E P m o d e l [ v h T ] \frac1N\sum_{v\in V}{\frac{\partial\log P(v)} {\partial \theta}}=\frac1N\sum_{v\in V}\sum_hp(h|v)\cdot vh^T-\frac1N\sum_{v\in V}\sum_v\sum_h p(v,h)\cdot vh^T\\=\frac1N\sum_{v\in V}\sum_hp(h|v)\cdot vh^T-\sum_v\sum_h p(v,h)\cdot vh^T\\=E_{P_{Data}}\Big[vh^T\Big]-E_{P_{model}}\Big[vh^T\Big] N1v∈V∑∂θ∂logP(v)=N1v∈V∑h∑p(h∣v)⋅vhT−N1v∈V∑v∑h∑p(v,h)⋅vhT=N1v∈V∑h∑p(h∣v)⋅vhT−v∑h∑p(v,h)⋅vhT=EPData[vhT]−EPmodel[vhT]
P D a t a = P D a t a ( v ) P m o d e l ( h ∣ v ) P m o d e l = P m o d e l ( h , v ) = P m o d e l ( v ) P m o d e l ( h ∣ v ) P_{Data}=P_{Data}(v)P_{model}(h|v)\\P_{model}=P_{model}(h,v)=P_{model}(v)P_{model}(h|v) PData=PData(v)Pmodel(h∣v)Pmodel=Pmodel(h,v)=Pmodel(v)Pmodel(h∣v)
三、基于MCMC的随机梯度上升
由上述推导,同理可得:
Δ W = ∂ ( E P D a t a [ v h T ] − E P m o d e l [ v h T ] ) \Delta W=\partial\Bigg(E_{P_{Data}}\Big[vh^T\Big]-E_{P_{model}}\Big[vh^T\Big]\Bigg) ΔW=∂(EPData[vhT]−EPmodel[vhT])
Δ L = ∂ ( E P D a t a [ v v T ] − E P m o d e l [ v v T ] ) \Delta L=\partial\Bigg(E_{P_{Data}}\Big[vv^T\Big]-E_{P_{model}}\Big[vv^T\Big]\Bigg) ΔL=∂(EPData[vvT]−EPmodel[vvT])
Δ J = ∂ ( E P D a t a [ h h T ] − E P m o d e l [ h h T ] ) \Delta J=\partial\Bigg(E_{P_{Data}}\Big[hh^T\Big]-E_{P_{model}}\Big[hh^T\Big]\Bigg) ΔJ=∂(EPData[hhT]−EPmodel[hhT])P D a t a = P D a t a ( v ) P m o d e l ( h ∣ v ) P m o d e l = P m o d e l ( h , v ) = P m o d e l ( v ) P m o d e l ( h ∣ v ) P_{Data}=P_{Data}(v)P_{model}(h|v)\\P_{model}=P_{model}(h,v)=P_{model}(v)P_{model}(h|v) PData=PData(v)Pmodel(h∣v)Pmodel=Pmodel(h,v)=Pmodel(v)Pmodel(h∣v)
W ( t + 1 ) = W ( t ) + Δ W W^{(t+1)}=W^{(t)}+\Delta W W(t+1)=W(t)+ΔW
Δ w i j = ∂ ( E P D a t a [ v i h j ] ⏟ p o s i t i v e p h a s e − E P m o d e l [ v i h j ] ⏟ n e g a t i v e p h a s e ) \Delta w_{ij}=\partial\Bigg(\underset{positive\;phase}{\underbrace{E_{P_{Data}}\Big[v_ih_j\Big]}}-\underset{negative\;phase}{\underbrace{E_{P_{model}}\Big[v_ih_j\Big]}}\Bigg) Δwij=∂(positivephase EPData[vihj]−negativephase EPmodel[vihj])
但是无论是正向还是负向都是难以处理的,是intractable的。
p ( v i = 1 ∣ h , v − i ) = σ ( ∑ j = 1 P w i j h j + ∑ k = 1 / i D L i k v k ) p ( h i = 1 ∣ v , h − i ) = σ ( ∑ j = 1 D w i j v j + ∑ m = 1 / i P J i m h m ) p(v_i=1|h,v_{-i})=\sigma(\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^DL_{ik}v_k)\\p(h_i=1|v,h_{-i})=\sigma(\sum_{j=1}^Dw_{ij}v_j+\sum_{m=1/i}^PJ_{im}h_m) p(vi=1∣h,v−i)=σ(j=1∑Pwijhj+k=1/i∑DLikvk)p(hi=1∣v,h−i)=σ(j=1∑Dwijvj+m=1/i∑PJimhm)
RBM:(如下图)
p ( h ∣ v ) = ∏ j = 1 3 p ( h j ∣ v ) p ( h j = 1 ∣ v ) = p ( h j = 1 ∣ v , h − j ) = σ ( ∑ i = 1 P W i j v i ) p(h|v)=\prod_{j=1}^3p(h_j|v)\\p(h_{j=1}|v)=p(h_{j=1}|v,h_{-j})=\sigma(\sum_{i=1}^PW_{ij}v_i) p(h∣v)=j=1∏3p(hj∣v)p(hj=1∣v)=p(hj=1∣v,h−j)=σ(i=1∑PWijvi)
四、条件概率推导
p ( v i = 1 ∣ h , v − i ) = σ ( ∑ j = 1 P w i j h j + ∑ k = 1 / i D L i k v k ) p ( h i = 1 ∣ v , h − i ) = σ ( ∑ j = 1 D w i j v j + ∑ m = 1 / i P J i m h m ) p(v_i=1|h,v_{-i})=\sigma(\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^DL_{ik}v_k)\\p(h_i=1|v,h_{-i})=\sigma(\sum_{j=1}^Dw_{ij}v_j+\sum_{m=1/i}^PJ_{im}h_m) p(vi=1∣h,v−i)=σ(j=1∑Pwijhj+k=1/i∑DLikvk)p(hi=1∣v,h−i)=σ(j=1∑Dwijvj+m=1/i∑PJimhm)
p ( v i ∣ h , v − i ) = p ( v , h ) p ( h , v − i ) = 1 Z exp { − E ( v , h ) } ∑ v i 1 Z exp { − E ( v , h ) } = exp { v T W h + 1 2 v T L v + 1 2 h T J h } ∑ v i exp { v T W h + 1 2 v T L v + 1 2 h T J h } = exp { v T W h + 1 2 v T L v } ∑ v i exp { v T W h + 1 2 v T L v } = exp { v T W h + 1 2 v T L v } exp { v T W h + 1 2 v T L v } ∣ v i = 0 + exp { v T W h + 1 2 v T L v } ∣ v i = 1 p(v_i|h,v_{-i})=\frac{p(v,h)}{p(h,v_{-i})}\\=\frac{\frac1Z\exp\{-E(v,h)\}}{\sum_{v_i}\frac1Z\exp\{-E(v,h)\}}\\=\frac{\exp\{v^TWh+\frac12v^TLv+\frac12h^TJh\}}{\sum_{v_i}\exp\{v^TWh+\frac12v^TLv+\frac12h^TJh\}}\\=\frac{\exp\{v^TWh+\frac12v^TLv\}}{\sum_{v_i}\exp\{v^TWh+\frac12v^TLv\}}\\=\frac{\exp\{v^TWh+\frac12v^TLv\}}{\exp\{v^TWh+\frac12v^TLv\}\Bigg|_{v_i=0}+\exp\{v^TWh+\frac12v^TLv\}\Bigg|_{v_i=1}} p(vi∣h,v−i)=p(h,v−i)p(v,h)=∑viZ1exp{ −E(v,h)}Z1exp{ −E(v,h)}=∑viexp{ vTWh+21vTLv+21hTJh}exp{ vTWh+21vTLv+21hTJh}=∑viexp{ vTWh+21vTLv}exp{ vTWh+21vTLv}=exp{ vTWh+21vTLv}∣∣∣∣∣vi=0+exp{ vTWh+21vTLv}∣∣∣∣∣vi=1exp{ vTWh+21vTLv}
所以,
p ( v i = 1 ∣ h , v − i ) = exp { v T W h + 1 2 v T L v } ∣ v i = 1 exp { v T W h + 1 2 v T L v } ∣ v i = 0 + exp { v T W h + 1 2 v T L v } ∣ v i = 1 p(v_i=1|h,v_{-i})=\frac{\exp\{v^TWh+\frac12v^TLv\}\Bigg|_{v_i=1}}{\exp\{v^TWh+\frac12v^TLv\}\Bigg|_{v_i=0}+\exp\{v^TWh+\frac12v^TLv\}\Bigg|_{v_i=1}} p(vi=1∣h,v−i)=exp{ vTWh+21vTLv}∣∣∣∣∣vi=0+exp{ vTWh+21vTLv}∣∣∣∣∣vi=1exp{ vTWh+21vTLv}∣∣∣∣∣vi=1
令 Δ = exp { v T W h + 1 2 v T L v } \Delta=\exp\{v^TWh+\frac12v^TLv\} Δ=exp{ vTWh+21vTLv}
所以, p ( v i = 1 ∣ h , v − i ) = Δ v i = 1 Δ v i = 0 + Δ v i = 1 p(v_i=1|h,v_{-i})=\frac{\Delta_{v_i=1}}{\Delta_{v_i=0}+\Delta_{v_i=1}} p(vi=1∣h,v−i)=Δvi=0+Δvi=1Δvi=1
Δ v i = exp { v T W h + 1 2 v T L v } = exp { ∑ i ^ = 1 D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 D ∑ k = 1 D v i ^ l i ^ k v k } = exp { ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + ∑ j = 1 P v i w i j h j + 1 2 ( ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k + ∑ i ^ = 1 / i D v i ^ l i ^ i v i + ∑ k = 1 / i D v i l i k v k ) } = exp { ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + ∑ j = 1 P v i w i j h j + 1 2 ( ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k + 2 ∑ k = 1 / i D v i l i k v k ) } = exp { v i ( ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k ) + ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } \Delta_{v_i}=\exp\{v^TWh+\frac12v^TLv\}\\=\exp\{\sum_{\hat i=1}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1}^D\sum_{k=1}^Dv_{\hat i}l_{\hat ik}v_k\}\\=\exp\{\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\sum_{j=1}^Pv_{i}w_{ij}h_j+\frac12\Big(\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k+\sum_{\hat i=1/i}^Dv_{\hat i}l_{\hat ii}v_i+\sum_{k=1/i}^Dv_{i}l_{ik}v_k\Big)\}\\=\exp\{\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\sum_{j=1}^Pv_{i}w_{ij}h_j+\frac12\Big(\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k+2\sum_{k=1/i}^Dv_{i}l_{ik}v_k\Big)\}\\=\exp\{v_{i}\Big(\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k\Big)+\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\} Δvi=exp{ vTWh+21vTLv}=exp{ i^=1∑Dj=1∑Pvi^wi^jhj+21i^=1∑Dk=1∑Dvi^li^kvk}=exp{ i^=1/i∑Dj=1∑Pvi^wi^jhj+j=1∑Pviwijhj+21(i^=1/i∑Dk=1/i∑Dvi^li^kvk+i^=1/i∑Dvi^li^ivi+k=1/i∑Dvilikvk)}=exp{ i^=1/i∑Dj=1∑Pvi^wi^jhj+j=1∑Pviwijhj+21(i^=1/i∑Dk=1/i∑Dvi^li^kvk+2k=1/i∑Dvilikvk)}=exp{ vi(j=1∑Pwijhj+k=1/i∑Dlikvk)+i^=1/i∑Dj=1∑Pvi^wi^jhj+21i^=1/i∑Dk=1/i∑Dvi^li^kvk}
不难看出只有第一项与 v i v_i vi有关,所以,
Δ v i = 0 = exp { ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } \Delta_{v_i=0}=\exp\{\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\} Δvi=0=exp{ i^=1/i∑Dj=1∑Pvi^wi^jhj+21i^=1/i∑Dk=1/i∑Dvi^li^kvk}
Δ v i = 1 = exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k + ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } \Delta_{v_i=1}=\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k+\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\} Δvi=1=exp{ j=1∑Pwijhj+k=1/i∑Dlikvk+i^=1/i∑Dj=1∑Pvi^wi^jhj+21i^=1/i∑Dk=1/i∑Dvi^li^kvk}
所以,
p ( v i = 1 ∣ h , v − i ) = Δ v i = 1 Δ v i = 0 + Δ v i = 1 = exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k + ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } exp { ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } + exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k + ∑ i ^ = 1 / i D ∑ j = 1 P v i ^ w i ^ j h j + 1 2 ∑ i ^ = 1 / i D ∑ k = 1 / i D v i ^ l i ^ k v k } = exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k } 1 + exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k } = 1 1 + exp { ∑ j = 1 P w i j h j + ∑ k = 1 / i D l i k v k } − 1 = σ ( ∑ j = 1 P w i j h j + ∑ k = 1 / i D L i k v k ) p(v_i=1|h,v_{-i})=\frac{\Delta_{v_i=1}}{\Delta_{v_i=0}+\Delta_{v_i=1}}\\=\frac{\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k+\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\}}{\exp\{\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\}+\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k+\sum_{\hat i=1/i}^D\sum_{j=1}^Pv_{\hat i}w_{\hat ij}h_j+\frac12\sum_{\hat i=1/i}^D\sum_{k=1/i}^Dv_{\hat i}l_{\hat ik}v_k\}}\\=\frac{\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k\}}{1+\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k\}}\\=\frac1{1+\exp\{\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^Dl_{ik}v_k\}^{-1}}\\=\sigma(\sum_{j=1}^Pw_{ij}h_j+\sum_{k=1/i}^DL_{ik}v_k) p(vi=1∣h,v−i)=Δvi=0+Δvi=1Δvi=1=exp{ ∑i^=1/iD∑j=1Pvi^wi^jhj+21∑i^=1/iD∑k=1/iDvi^li^kvk}+exp{ ∑j=1Pwijhj+∑k=1/iDlikvk+∑i^=1/iD∑j=1Pvi^wi^jhj+21∑i^=1/iD∑k=1/iDvi^li^kvk}exp{ ∑j=1Pwijhj+∑k=1/iDlikvk+∑i^=1/iD∑j=1Pvi^wi^jhj+21∑i^=1/iD∑k=1/iDvi^li^kvk}=1+exp{ ∑j=1Pwijhj+∑k=1/iDlikvk}exp{ ∑j=1Pwijhj+∑k=1/iDlikvk}=1+exp{ ∑j=1Pwijhj+∑k=1/iDlikvk}−11=σ(j=1∑Pwijhj+k=1/i∑DLikvk)
同理可得,
p ( h i = 1 ∣ v , h − i ) = σ ( ∑ j = 1 D w i j v j + ∑ m = 1 / i P J i m h m ) p(h_i=1|v,h_{-i})=\sigma(\sum_{j=1}^Dw_{ij}v_j+\sum_{m=1/i}^PJ_{im}h_m) p(hi=1∣v,h−i)=σ(j=1∑Dwijvj+m=1/i∑PJimhm)
五、基于平均场理论的变分推断
L = E l B O = log p θ ( v ) − K L ( q ϕ ∣ ∣ p θ ) = ∑ h q ϕ ( h ∣ v ) log p θ ( v , h ) + H [ q ] L=ElBO=\log p_\theta(v)-KL(q_\phi||p_\theta)=\sum_hq_\phi(h|v)\log p_\theta(v,h)+H[q] L=ElBO=logpθ(v)−KL(qϕ∣∣pθ)=h∑qϕ(h∣v)logpθ(v,h)+H[q]
q ϕ ( h ∣ v ) = ∏ j = 1 P q ϕ ( h j ∣ v ) q ϕ ( h j = 1 ∣ v ) = ϕ j , ϕ = { ϕ j } j = 1 P q_\phi(h|v)=\prod_{j=1}^Pq_\phi(h_j|v)\;\;\;\;\;\;\;\;\;\;q_\phi(h_j=1|v)=\phi_j,\;\;\;\;\;\;\;\phi=\{\phi_j\}_{j=1}^P qϕ(h∣v)=j=1∏Pqϕ(hj∣v)qϕ(hj=1∣v)=ϕj,ϕ={ ϕj}j=1P
ϕ ^ j = arg max ϕ j L = arg max ϕ j ∑ h q ϕ ( h ∣ v ) [ − log Z + v T W h + 1 2 v T L v + 1 2 h T J h ] + H [ q ] = arg max ϕ j ∑ h q ϕ ( h ∣ v ) [ − log Z + 1 2 v T L v ] + ∑ h q ϕ ( h ∣ v ) [ v T W h + 1 2 h T J h ] + H [ q ] = arg max ϕ j ∑ h q ϕ ( h ∣ v ) [ v T W h + 1 2 h T J h ] + H [ q ] = arg max ϕ j ∑ h q ϕ ( h ∣ v ) ⋅ v T W h ⏟ ① + 1 2 ∑ h q ϕ ( h ∣ v ) ⋅ h T J h ⏟ ② + H [ q ] ⏟ ③ \hat\phi_j=\argmax_{\phi_j} L\\=\argmax_{\phi_j} \sum_hq_\phi(h|v)\Big[-\log Z+v^TWh+\frac12v^TLv+\frac12h^TJh\Big]+H[q]\\=\argmax_{\phi_j} \sum_hq_\phi(h|v)\Big[-\log Z+\frac12v^TLv\Big]+\sum_hq_\phi(h|v)\Big[v^TWh+\frac12h^TJh\Big]+H[q]\\=\argmax_{\phi_j} \sum_hq_\phi(h|v)\Big[v^TWh+\frac12h^TJh\Big]+H[q]\\=\argmax_{\phi_j} \underset{①}{\underbrace{\sum_hq_\phi(h|v)\cdot v^TWh}}+\underset{②}{\underbrace{\frac12\sum_hq_\phi(h|v)\cdot h^TJh}}+\underset{③}{\underbrace{H[q]}} ϕ^j=ϕjargmaxL=ϕjargmaxh∑qϕ(h∣v)[−logZ+vTWh+21vTLv+21hTJh]+H[q]=ϕjargmaxh∑qϕ(h∣v)[−logZ+21vTLv]+h∑qϕ(h∣v)[vTWh+21hTJh]+H[q]=ϕjargmaxh∑qϕ(h∣v)[vTWh+21hTJh]+H[q]=ϕjargmax① h∑qϕ(h∣v)⋅vTWh+② 21h∑qϕ(h∣v)⋅hTJh+③ H[q]
① = ∑ h q ϕ ( h ∣ v ) ⋅ ∑ i = 1 D ∑ j = 1 P v i w i j h j = ∑ h ∏ j ^ = 1 P q ϕ ( h j ^ ∣ v ) ⋅ ∑ i = 1 D ∑ j = 1 P v i w i j h j ①=\sum_hq_\phi(h|v)\cdot \sum_{i=1}^D\sum_{j=1}^Pv_iw_{ij}h_j\\=\sum_h\prod_{\hat j=1}^Pq_\phi(h_{\hat j}|v)\cdot \sum_{i=1}^D\sum_{j=1}^Pv_iw_{ij}h_j ①=h∑qϕ(h∣v)⋅i=1∑Dj=1∑Pviwijhj=h∑j^=1∏Pqϕ(hj^∣v)⋅i=1∑Dj=1∑Pviwijhj
因为, ∑ h ∏ j ^ = 1 P q ϕ ( h j ^ ∣ v ) ⋅ v 1 w 12 h 2 = ∑ h 2 q ϕ ( h 2 ∣ v ) ⋅ v 1 w 12 h 2 ⋅ ∑ h / h 2 ∏ j ^ = 1 / 2 P q ϕ ( h j ^ ∣ v ) = ∑ h 2 q ϕ ( h 2 ∣ v ) ⋅ v 1 w 12 h 2 = q ϕ ( h 2 = 1 ∣ v ) ⋅ v 1 w 12 = ϕ 2 v 1 w 12 \sum_h\prod_{\hat j=1}^Pq_\phi(h_{\hat j}|v)\cdot v_1w_{12}h_2=\sum_{h_2}q_\phi(h_2|v)\cdot v_1w_{12}h_2\cdot\sum_{h/h_2}\prod_{\hat j=1/2}^Pq_\phi(h_{\hat j}|v)\\=\sum_{h_2}q_\phi(h_2|v)\cdot v_1w_{12}h_2\\=q_\phi(h_2=1|v)\cdot v_1w_{12}\\=\phi_2v_1w_{12} h∑j^=1∏Pqϕ(hj^∣v)⋅v1w12h2=h2∑qϕ(h2∣v)⋅v1w12h2⋅h/h2∑j^=1/2∏Pqϕ(hj^∣v)=h2∑qϕ(h2∣v)⋅v1w12h2=qϕ(h2=1∣v)⋅v1w12=ϕ2v1w12
所以, ① = ∑ i = 1 D ∑ j ^ = 1 P ϕ j ^ v i w i j ^ ①=\sum_{i=1}^D\sum_{\hat j=1}^P\phi_{\hat j}v_iw_{i\hat j} ①=i=1∑Dj^=1∑Pϕj^viwij^
同理, ② = ∑ j ^ = 1 P ∑ m = 1 / j P ϕ j ^ ϕ m J j ^ m ②=\sum_{\hat j=1}^P\sum_{m=1/j}^P\phi_{\hat j}\phi_mJ_{\hat jm} ②=j^=1∑Pm=1/j∑Pϕj^ϕmJj^m
③ = − ∑ j = 1 P [ ϕ j log ϕ j + ( 1 − ϕ j ) log ( 1 − ϕ j ) ] ③=-\sum_{j=1}^P\Big[\phi_j\log\phi_j+(1-\phi_j)\log(1-\phi_j)\Big] ③=−j=1∑P[ϕjlogϕj+(1−ϕj)log(1−ϕj)]
分别对①、②、③求偏导,
∂ ① ∂ ϕ j = ∑ i = 1 P v i w i j \frac{\partial①}{\partial\phi_j}=\sum_{i=1}^Pv_iw_{ij} ∂ϕj∂①=i=1∑Pviwij
∂ ② ∂ ϕ j = ∑ m = 1 / j P ϕ m J j m \frac{\partial②}{\partial\phi_j}=\sum_{m=1/j}^P\phi_mJ_{jm} ∂ϕj∂②=m=1/j∑PϕmJjm
∂ ③ ∂ ϕ j = − log ϕ j 1 − ϕ j \frac{\partial③}{\partial\phi_j}=-\log\frac{\phi_j}{1-\phi_j} ∂ϕj∂③=−log1−ϕjϕj
所以,
令, ∂ [ ① + ② + ③ ] ∂ ϕ j = 0 \frac{\partial\Big[①+②+③\Big]}{\partial\phi_j}=0 ∂ϕj∂[①+②+③]=0
得, ϕ j = σ ( ∑ i = 1 D v i w i j + ∑ m = 1 / j P ϕ m J j m ) \phi_j=\sigma(\sum_{i=1}^Dv_iw_{ij}+\sum_{m=1/j}^P\phi_mJ_{jm}) ϕj=σ(i=1∑Dviwij+m=1/j∑PϕmJjm)
不动点方程,解法(坐标上升)
ϕ ^ = { ϕ ^ j } j = 1 P \hat\phi=\{\hat\phi_j\}^P_{j=1} ϕ^={ ϕ^j}j=1P
下一章传送门:白板推导系列笔记(二十九)-深度玻尔兹曼机