[Machine Learning] Expectation Maximization Algorithm (EM algorithm) analysis: Expectation Maximization Algorithm

[Machine Learning] Expectation Maximization Algorithm (EM algorithm): Expectation Maximization Algorithm

1 Introduction

EM algorithm, the full name is Expectation Maximization Algorithm. The expectation maximization algorithm is an iterative algorithm used for maximum likelihood estimation or maximum posterior probability estimation of probabilistic parameter models containing hidden variables (Hidden Variable).

We often find the model parameters of the sample from the sample observation data. The most common method is to maximize the log-likelihood function of the model distribution.

  • However, in some cases, the observed data we obtain contains unobserved implicit data. At this time, we do not know the implicit data and model parameters, so we cannot directly use the maximum log-likelihood function to obtain the parameters of the model distribution. .

How to do it? This is where the EM algorithm comes in handy.

  • The idea of ​​the EM algorithm to solve this problem is to use a heuristic iterative method. Since we cannot directly obtain the model distribution parameters, we can first guess the implicit data (step E of the EM algorithm), and then based on the observed data and guessed implicit data Let's maximize the log-likelihood and solve for our model parameters (the M step of the EM algorithm).
  • Since our previous hidden data was guessed, the model parameters obtained at this time are generally not the results we want. But it doesn’t matter. Based on the currently obtained model parameters, we continue to guess the hidden data (the E step of the EM algorithm), and then continue to maximize the log likelihood and solve our model parameters (the M step of the EM algorithm). By analogy, iteration continues until the model distribution parameters basically remain unchanged, the algorithm converges, and appropriate model parameters are found.

As can be seen from the above description, the EM algorithm is an algorithm for iteratively solving the maximum value. At the same time, the algorithm is divided into two steps in each iteration, the E step and the M step. Iteratively update the hidden data and model distribution parameters round by round until convergence, that is, the model parameters we need are obtained.

  • One of the most intuitive ways to understand the idea of ​​EM algorithm is the K-Means algorithm. See the principle of K-Means clustering algorithm written before. In K-Means clustering, the centroid of each cluster is the hidden data.
  • We will assume K initial centroids, which is the E step of the EM algorithm; then calculate the nearest centroid of each sample, and cluster the samples to the nearest centroid, which is the M step of the EM algorithm. Repeat this E-step and M-step until the center of mass no longer changes, thus completing K-Means clustering.

Of course, the K-Means algorithm is relatively simple, but problems in practice are often not that simple. Next we need to describe it accurately in mathematical language.

Insert image description here

2. Mathematical description of EM algorithm

For MMM sample dataX = { x 1 , x 2 , x 3 , . . . x M } X = \{ x_1, x_2, x_3,... \ x_M \}X={ x1,x2,x3,... xM} , find the model parametersθ \thetaθ , the log-likelihood function of the maximizing model distribution is as follows:
θ ∗ = arg ⁡ max ⁡ θ ∑ i = 1 M log ⁡ P θ ( xi ) \begin{align} \theta ^{*} = \mathop{ \arg\max}\limits_{\theta} \sum^{M}_{i=1} \log P_{\theta}(x_i) \end{align}i=iargmaxi=1MlogPi(xi)

Assume that the observed data we obtain contains unobserved implicit data: C = { c 1 , c 2 , c 3 , . . . z K } C = \{ c_1, c_2, c_3,... \ z_K \}C={ c1,c2,c3,... zK}                                                                                      ⁡ max
⁡ θ ∑ ⁡ max ⁡ θ ∑ i = 1 M log ⁡ ∑ k = 1 KP θ ( xi , ck ) \begin{align}\theta^{*} = \mathop{\arg\max}\limits_{\theta}\sum^{M}_{i=1 } \log P_{\theta}(x_i) = \top{\arg\max}\limits_{\theta} \sum^{M}_{i=1} \log \sum_{k=1}^{K } P_{\theta}(x_i, c_k) \end{align}i=iargmaxi=1MlogPi(xi)=iargmaxi=1Mlogk=1KPi(xi,ck)
There is no way to directly calculate θ \theta from the above formula.θ 's. Therefore, some special skills are needed. We first scale this equation as follows:
∑ i = 1 M log ⁡ ∑ k = 1 KP θ ( xi , ck ) = ∑ i = 1 M log ⁡ ∑ k = 1 KQ ( ck ) P θ ( xi , ck ) Q ( ck ) ≥ ∑ i = 1 M ∑ k = 1 KQ ( ck ) log ⁡ P θ ( xi , ck ) Q ( ck ) \begin{align} \sum^{M} _{i=1} \log \sum_{k=1}^{K} P_{\theta}(x_i, c_k) = & \sum^{M}_{i=1} \log \sum_{k= 1}^{K} Q(c_k) \frac{P_{\theta}(x_i, c_k)}{Q(c_k)}\\ \geq & \sum^{M}_{i=1} \sum_{ k=1}^{K} Q(c_k) \log \frac{P_{\theta}(x_i, c_k)}{Q(c_k)} \end{align}i=1Mlogk=1KPi(xi,ck)=i=1Mlogk=1KQ(ck)Q(ck)Pi(xi,ck)i=1Mk=1KQ(ck)logQ(ck)Pi(xi,ck)
The above equation (3) introduces an unknown new distribution, Q ( c ) Q(c)Q ( c ) , Jensen's inequality is used in equation (4):
log ⁡ ∑ jqjxj ≥ ∑ jqj log ⁡ xj \begin{align} \log \sum_j q_j x_j \geq \sum_j q_j \log x_j \end{align}logjqjxjjqjlogxj
Specifically, since the logarithmic function is a concave function, there is:
f ( E ( x ) ) ≥ E ( f ( x ) ) , if f ( x ) is a concave function (if it is a convex function, then: ≤ ) \ begin{align} f(\mathbb E(x)) \geq \mathbb E(f(x)), if \ f(x) \ is a concave function (if it is a convex function: \leq) \end{align }f(E(x))E ( f ( x )) , if f ( x ) is a concave function (if it is a convex function, then:  
At this time, if we want to satisfy the equality sign of Jensen's inequality, we have:
P θ ( xi , ck ) Q ( ck ) = t , t is a constant\begin{align} \frac{P_{\theta}(x_i, c_k)} {Q(c_k)} = t, t is a constant\end{align}Q(ck)Pi(xi,ck)=t , t is a constant
Since Q ( c ) Q(c)Q ( c ) is a distribution, so:
∑ k = 1 KQ ( ck ) = 1 \begin{align} \sum_{k=1}^KQ(c_k) = 1 \end{align}k=1KQ(ck)=1
Additionally, we have the following equation:
Q ( ck ) = P θ ( xi , ck ) ∑ k = 1 KP θ ( xi , ck ) = P θ ( xi , ck ) P θ ( xi ) = P θ ( ck ) ∣ xi ) \begin{align} Q(c_k) = \frac{P_{\theta}(x_i, c_k)}{ \sum_{k=1}^K P_{\theta}(x_i, c_k)} = \ frac{P_{\theta}(x_i, c_k)}{P_{\theta}(x_i)} = P_{\theta}(c_k|x_i)\end{align}Q(ck)=k=1KPi(xi,ck)Pi(xi,ck)=Pi(xi)Pi(xi,ck)=Pi(ckxi)
Therefore, if Q ( ck ) = P θ ( ck ∣ xi ) Q(c_k) = P_{\theta}(c_k|x_i)Q(ck)=Pi(ckxi) , then equation (4) is a lower bound of our log-likelihood containing hidden data. If we can maximize this lower bound, that means we can maximize our log-likelihood. That is, we need to maximize the following formula:
arg max ⁡ θ ∑ i = 1 M ∑ k = 1 KQ ( ck ) log ⁡ P θ ( xi , ck ) Q ( ck ) \begin{align} \argmax_\theta \sum^ {M}_{i=1} \sum_{k=1}^{K} Q(c_k) \log \frac{P_{\theta}(x_i, c_k)}{Q(c_k)} \end{align }iargmaxi=1Mk=1KQ(ck)logQ(ck)Pi(xi,ck)
Remove the constant part in the above formula, because in the above formula 0 ≤ Q ( ck ) ≤ 1 0 \leq Q(c_k) \leq 10Q(ck)1 , then the lower bound of the log likelihood we need to maximize is,
arg max ⁡ θ ∑ i = 1 M ∑ k = 1 KQ ( ck ) log ⁡ P θ ( xi , ck ) \begin{align} \argmax_\ theta \sum^{M}_{i=1} \sum_{k=1}^{K} Q(c_k) \log P_{\theta}(x_i, c_k) \end{align}iargmaxi=1Mk=1KQ(ck)logPi(xi,ck)
The above formula is the M step of our EM algorithm. What about the E step? Note that Q ( ck ) Q(c_k) in the above formulaQ(ck) is a distribution, so∑ k = 1 KQ ( ck ) log ⁡ P θ ( xi , ck ) \sum_{k=1}^{K} Q(c_k) \log P_{\theta}(x_i, c_k)k=1KQ(ck)logPi(xi,ck) can be understood aslog ⁡ P θ ( xi , ck ) \log P_{\theta}(x_i, c_k)logPi(xi,ck) based on conditional probability distributionQ ( ck ) Q(c_k)Q(ck) an expectation.

So far, we understand the specific mathematical meaning of the E step and M step in the EM algorithm.

3. EM algorithm process

  • enter
    • Observed data: X = { x 1 , x 2 , x 3 , . . . x M }X={ x1,x2,x3,... xM}
    • Joint distribution: P θ ( x , c ) P_\theta (x, c)Pi(x,c)
    • conditional distribution
    • Maximum number of iterations: SSS
  • Main process
    • 1) Randomly initialize model parameters θ \thetaInitial value of θ 0 \theta^0i0
    • 2) EM algorithm iteration (V times)
      • E:Specify the equation for the equation:
        Q ( ck ) = P θ s ( ck ∣ xi ) L ( θ , θ s ) = ∑ i = 1 M ∑ k = 1 KQ ( ck ) log ⁡ P θ ( . xi , ck ) \begin{align} Q(c_k) = P_{\theta^s}(c_k|x_i) \\L(\theta, \theta^s) = \sum^{M}_{i=1 } \sum_{k=1}^{K} Q(c_k) \log P_{\theta}(x_i, c_k) \end{align}Q(ck)=Pis(ckxi)L ( θ ,is)=i=1Mk=1KQ(ck)logPi(xi,ck)
      • M:Constant L ( θ , θ s ) L(\theta, \theta^s)L ( θ ,is ), getθ s + 1 \theta^{s+1}is + 1 :
        θ s + 1 = arg max ⁡ θ L ( θ , θ s ) \begin{align} \theta^{s+1} = \argmax_{\theta}L(\theta,\theta^s) \end{align}is+1=iargmaxL ( θ ,is)
      • result θ s + 1 \theta^{s+1}is + 1 has converged, and the algorithm ends. Otherwise continue iteration.
  • output
    • Model parameters θ \thetai

4. Two questions

4.1 How does the EM algorithm ensure convergence?

To prove that the EM algorithm converges, we need to prove that the value of our log-likelihood function has been increasing during the iteration process. That is:
log ⁡ P θ s + 1 ( xi ) ≥ log ⁡ P θ s ( xi ) \begin{align} \log P_{\theta^{s+1}}(x_i) \geq \log P_{\theta ^{s}}(x_i) \end{align}logPis+1(xi)logPis(xi)
For example,
L ( θ , θ s ) = ∑ i = 1 M ∑ k = 1 KP θ s ( ck ∣ xi ) log ⁡ P θ ( xi , ck ) ... ... Equation: Q ( ck ) = P θ s ( ck ∣ xi ) \begin{align}L(\theta,\theta^s) = \sum^{M}_{i=1}\sum_{k=1}^{K}P_{\theta^s}( c_k|x_i) \log P_{\theta}(x_i, c_k) ......Let:Q(c_k) = P_{\theta^s}(c_k|x_i) \end{align}L ( θ ,is)=i=1Mk=1KPis(ckxi)logPi(xi,ck)……Note: Q ( ck)=Pis(ckxi)
For example:
H ( θ , θ s ) = ∑ i = 1 M ∑ k = 1 KP θ s ( ck ∣ xi ) log ⁡ P θ ( ck ∣ xi ) \begin{align} H(\theta, \theta^s ) = \sum^{M}_{i=1} \sum_{k=1}^{K} P_{\theta^s}(c_k|x_i) \log P_{\theta}(c_k|x_i) \ end{align}H ( i ,is)=i=1Mk=1KPis(ckxi)logPi(ckxi)
Determine the equation:
L ( θ , θ s ) − H ( θ , θ s ) = ∑ i = 1 M ∑ k = 1 KP θ s ( ck ∣ xi ) log ⁡ P θ ( xi , ck ) P θ ( ck ∣ xi ) = ∑ i = 1 M ∑ k = 1 KP θ s ( ck ∣ xi ) log ⁡ P θ ( xi ) ... ... P θ s ( ck ∣ xi ) Infinitely, one and one 1 = ∑ i = 1 M log ⁡ P θ ( xi ) \begin{align} L(\theta, \theta^s) - H(\theta, \theta^s) = & \sum^{M}_{i=1 } \sum_{k=1}^{K} P_{\theta^s}(c_k|x_i) \log \frac{P_{\theta}(x_i, c_k) }{P_{\theta}(c_k|x_i). )} \\ = & \sum^{M}_{i=1} \sum_{k=1}^{K} P_{\theta^s}(c_k|x_i) \log P_{\theta}(x_i ) ...... P_{\theta^s}(c_k|x_i) exponent, one and one1\\ = & \sum^{M}_{i=1} \log P_{\theta}(x_i) \end {align}L ( θ ,is)H ( i ,is)===i=1Mk=1KPis(ckxi)logPi(ckxi)Pi(xi,ck)i=1Mk=1KPis(ckxi)logPi(xi)……Pis(ckxi) is known, and the sum is 1i=1MlogPi(xi)
In the final formula of the above formula, θ \thetaθ takesθ s \theta^{s}is θ s + 1 \theta^{s+1} is + 1 ,theoretically:
∑ i = 1 M log ⁡ P θ s + 1 ( xi ) − ∑ i = 1 M log ⁡ P θ s ( xi ) = [ L ( θ s + 1 , θ s ) − L ( θ s , θ s ) ] − [ H ( θ s + 1 , θ s ) − H ( θ s , θ s ) ] \begin{align} & \sum^{M}_{i=1} \log P_{\theta^{s+1}}(x_i) - \sum^{M}_{i=1} \log P_{\theta^{s}}(x_i) \\ = & [L( \theta^{s+1}, \theta^s) - L(\theta^{s}, \theta^s)] - [H(\theta^{s+1}, \theta^s) - H (\theta^{s}, \theta^s)] \end{align}=i=1MlogPis+1(xi)i=1MlogPis(xi)[ L ( is+1,is)L ( is,is)][ H ( is+1,is)H ( is,is)]
To prove the convergence of the EM algorithm, we only need to prove that the right side of the above equation is non-negative. Since θ s + 1 \theta^{s+1}is + 1oneL ( θ , θ s ) L(\theta, \theta^s)L ( θ ,is )Specific, slope:
L ( θ s + 1 , θ s ) − L ( θ s , θ s ) ≥ 0 \begin{align} L(\theta^{s+1}, \theta^s) - L(\theta^{s}, \theta^s) \geq 0 \end{align}L ( is+1,is)L ( is,is)0
Let us give the equation:
H ( θ s + 1 , θ s ) − H ( θ s , θ s ) = ∑ i = 1 M ∑ k = 1 KP θ s ( ck ∣ xi ) log ⁡ P θ s + 1 ( ck ∣ xi ) P θ s ( ck ∣ xi ) ≤ ∑ i = 1 M log ⁡ ∑ k = 1 KP θ s ( ck ∣ xi ) P θ s + 1 ( ck ∣ xi ) P θ s ( ck ∣ xi ) = ∑ i = 1 M log ⁡ ∑ k = 1 KP θ s + 1 ( ck ∣ xi ) = 0 \begin{align} H(\theta^{s+1}, \theta^s) - . H(\theta^{s}, \theta^s) = &\sum^{M}_{i=1}\sum_{k=1}^{K}P_{\theta^s}(c_k|x_i ) \log \frac{P_{\theta^{s+1}}(c_k|x_i)}{P_{\theta^{s}}(c_k|x_i)}\\ \leq & \sum^{M} _{i=1} \log \sum_{k=1}^{K} P_{\theta^s}(c_k|x_i) \frac{P_{\theta^{s+1}}(c_k|x_i) }{P_{\theta^{s}}(c_k|x_i)} \\ = & \sum^{M}_{i=1} \log \sum_{k=1}^{K} P_{\theta ^{s+1}}(c_k|x_i) = 0\end{align}H ( is+1,is)H ( is,is)==i=1Mk=1KPis(ckxi)logPis(ckxi)Pis+1(ckxi)i=1Mlogk=1KPis(ckxi)Pis(ckxi)Pis+1(ckxi)i=1Mlogk=1KPis+1(ckxi)=0
Among them, equation (25) uses Jensen's inequality, which is just the opposite of the use in Section 2. Type (26) uses the property that the probability distribution accumulates to 1. At this point, we get:
log ⁡ P θ s + 1 ( xi ) − log ⁡ P θ s ( xi ) ≥ 0 \begin{align} \log P_{\theta^{s+1}}(x_i) - \log P_{\theta^{s}}(x_i) \geq 0 \end{align}logPis+1(xi)logPis(xi)0
This further proves the convergence of the EM algorithm.

4.2 If the EM algorithm converges, is it guaranteed to converge to the global maximum?

It is not difficult to see from the above derivation that the EM algorithm can guarantee convergence to a stable point, but it cannot guarantee convergence to the global maximum point, so it is a local optimal algorithm. Of course, if our optimization goal L ( θ , θ s ) L(\theta, \theta^s)L ( θ ,is )is convex, then the EM algorithm can ensure convergence to the global maximum, which is the same as iterative algorithms such as the gradient descent method.

5. Summary

If we think about the EM algorithm from the perspective of algorithmic thinking, we can find:

  • What is known in the algorithm is the observation data,
  • What is unknown are the underlying data and model parameters,
  • In step E, what we do is to fix the values ​​of the model parameters and optimize the distribution of the hidden data.
  • In step M, what we do is to fix the implicit data distribution and optimize the values ​​of model parameters.

Comparing other machine learning algorithms, in fact, many algorithms have similar ideas. For example, the SMO algorithm (Support Vector Machine Principle (4) SMO Algorithm Principle) and the coordinate axis descent method (Lasso Regression Algorithm: Summary of Coordinate Axis Descent Method and Minimum Angle Regression Method) all use similar ideas to solve problems.

reference

【1】http://www.cnblogs.com/pinard/p/6912636.html

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/133220955