HMM - parameter learning problem (solving parameters)

First, if we know the observation sequence and its corresponding state sequence $\left \{ \left (o _{1},i_{1} \right ),\left (o _{2},i_{2} \right ),...,\left (o _{T},i_{T} \right ) \right \}$ , we can directly calculate the initial state probability (π), state transition probability (A), and generated observation probability (B) through the frequency:

$a_{ij}=\frac{A_{ij}}{\sum_{N}^{n=1}A_{in}}$

$b_{i}\left ( x \right )=\frac{B_{i}\left ( x \right )}{\sum_{m=1}^{M}B_{i}\left ( m \right )}$

Take $a_{ij}$ as an example, count the number of transitions from state i to j in the state sequence $A_{ij}$ , and divide by the sum of all state transitions $\sum_{N}^{n=1}A_{in}$ to get the corresponding state transition probability.

However, in the HMM model of speech recognition, we often do not have the hidden state sequence corresponding to the observation state. How should we solve the parameters at this time $\lambda =\left ( \pi ,A,B \right )$ ?

There are two methods: Viterbi learning algorithm (hard alignment), Baum-Welch learning algorithm (soft alignment)

Given a sequence of observations $O=\left (o _{1},o _{2},...,o _{T} \right )$ , solve for parameters $\lambda =\left ( \pi ,A,B \right )$ such that $P\left ( O|\lambda \right )$ the maximum

At this time, compared with the problem at the beginning of the article, the state sequence is unknown, which is equivalent to a parameter estimation problem with hidden variables, which requires the EM algorithm.

The full name of the EM algorithm is the Expectation Maximization algorithm, that is, the expectation maximization algorithm. The EM algorithm is one of the top ten methods of machine learning. It was formally proposed by Arthur P. Dempster, Nan Laird and Donald Rubin in 1977. It is an iterative algorithm for estimating unknown variables when some related variables are known.

The algorithm flow of EM is relatively simple, and its steps are as follows:

step1: Initialize distribution parameters

step2: Repeat steps E and M until convergence:

a) Step E - Expected value calculation : According to the assumed value of the parameter, the expected estimate of the unknown variable is given and applied to the missing value.

b) M step - maximization calculation : According to the estimated value of the unknown variable, the maximum likelihood estimate of the current parameter is given.

1. Viterbi learning algorithm (hard alignment)

First, an HMM initial model parameter needs to be initialized $\lambda =\left ( \pi ,A,B \right )$ .

Algorithm steps:

1) Solve the optimal path through the Viterbi algorithm. (hard alignment)

2) At this time, a new observation sequence and its corresponding state sequence are obtained $\left \{ \left (o _{1},i_{1} \right ),\left (o _{2},i_{2} \right ),...,\left (o _{T},i_{T} \right ) \right \}$ , and new model parameters can be calculated $\lambda =\left ( \pi ,A,B \right )$ .

3) Repeat 1) 2) until convergence.

2. Baum-Welch learning algorithm (soft alignment)

For a given hidden state I and observed state O, the $\lambda$ logarithmic maximum likelihood estimation function of the parameters is:

$L\left ( \lambda \right ) = logP\left ( I,O|\lambda \right )$

Given the model $\lambda$ and observed variable O, the conditional probability of hidden state I occurring is:

$P=P\left ( I|O,\lambda \right )$

Then define: Q function is the logarithmic likelihood function of complete data on the conditional probability distribution expectation (E-step) of hidden variables under the premise of given model parameters and observed variables, and then only needs to maximize it.

$Q\left ( \lambda, \overline{\lambda } \right )=\sum_{I}^{}logP\left ( O,I|\lambda \right )P\left ( I|O,\overline{\lambda } \right )$

The first one $\lambda$ represents the required maximization parameter, and the second one $\overline{\lambda$ represents the current estimated value of the parameter. Each iteration is seeking the maximum value of the Q function.

The second half of which:

$P\left ( I|O,\overline{\lambda } \right )=\frac{P\left ( I,O|\overline{\lambda } \right )}{P\left ( O|\overline{\lambda } \right )}$

Here, the denominator is a constant for the estimated parameters.

New Q function:

$Q\left ( \lambda, \overline{\lambda } \right )=\sum_{I}^{}logP\left ( O,I|\lambda \right )P\left ( I,O|\overline{\lambda } \right )$

in:

$P\left ( O,I|\lambda \right )=\pi _{i_{1}}\prod_{t=1}^{T-1}a_{i_{t}i_{t+1}}\prod_{t=1}^{T-1}b_{i_{t}}\left ( o_{t} \right )$

The logarithmic expansion of the Q function gives:

$Q\left ( \lambda, \overline{\lambda } \right )= \sum_{I}^{}\left (log\pi _{i}+\sum_{t=1}^{T-1}loga_{i_{t}i_{t+1}}+\sum_{t=1}^{T}logb_{i_{t}}\left ( o_{t} \right ) \right )P\left ( I,O|\overline{\lambda } \right )$

It can be found by observation that the three items in the brackets are the parameters we want to solve, and they are additive to each other, so to find the overall maximum value, we only need to maximize them separately (M-step).

Take the first item as an example:

$\sum_{I}^{}log\pi _{i}P\left ( I,O|\overline{\lambda } \right )=\sum_{n=1}^{N}log\pi _{i}P\left ( O,i_{1}=n| \overline{\lambda }\right ),\sum_{n=1}^{N}\pi _{i}=1$

Find the extreme value under the condition, Lagrange multiplier method!

Then bring the obtained parameters back to the iterative calculation of the Q function. When the parameters change very little, the iteration ends and the parameters are returned $\lambda =\left ( \pi ,A,B \right )$ .

The Viterbi learning algorithm is called hard alignment because in the EM process, the state sequence is either 0 or 1 (that is, the calculated sequence is fixed). The Baum-Welch learning algorithm does not determine a specific state, but calculates its maximum likelihood, and a frame belongs to a certain state with a certain probability, so it is called soft alignment.