Thanks B. H. Juang in Georgia Institute of Technology.
Thanks Wayne Ward in Carnegie Mellon University.

Introduction

Problem Formulation

Now we talk about Hidden Markov Model. Well, what is HMM used for? Consider the following problem:

Given an unknown observation: $O$ , recognize it as one of $N$ classes with minimum probability of error.

So how to define the error and error probability?
Conditional Error: Given $O$ , the risk associated with deciding that it is a class $i$ event:

R (S i | O) = \sum j = 1 N e i j P (S j | O)

$R(S_{i} | O) = \sum_{j=1}^{N}e_{ij}P(S_{j} | O)$ where

P(Sj|O) $P(S_{j} | O)$ is the probability of that

O $O$ is a class

Sj $S_{j}$ event and

eij $e_{ij}$ is the cost of classifying a class

j $j$ event as a class

i $i$ event.

eij>0,eii=0 $e_{ij}>0, e_{ii}=0$ .
Expected Error:

E = \int R (S (O) | O) p (O) d O

$\mathcal{E}=\int R(S(O) | O)p(O)dO$ where

S(O) $S(O)$ is the decision made on

O $O$ based on a policy. Then the question can be considered as:

How should $S(O)$ be made to achieve minimum error probability? Or $P(S(O)|O)$ is maximized?

Bayes Decision Theory

If we institute the policy: $S(O) = S_{i} = \arg\max_{S_{j}}P(S_{j} | O)$ then $R(S(O)|O)=\min_{S_{j}}R(S_{j} | O)$ . It is the so-called Maximum A Posteriori (MAP) decision. But how do we know $P(S_{j}|O), i=1,2,\dots,M$ for any $O$ ?

Markov Model

States : $S=\{S_{0}, S_{1},S_{2},\dots, S_{N}\}$
Transition probabilities : $P(q_{t}=S_{i} |q_{t-1}=S_{j})$
Markov Assumption:

P (q t = S i | q t - 1 = S j, q t - 1 = S k, \dots) = P (q t = S i | q t - 1 = S j) = a j i, a j i \geq 0, \sum i = 1 N a j i = 1, \forall j

$P(q_{t}=S_{i} |q_{t-1}=S_{j}, q_{t-1}=S_{k},\dots)=P(q_{t}=S_{i} |q_{t-1}=S_{j})=a_{ji}, \quad a_{ji}\geq 0, \sum_{i=1}^{N}a_{ji}=1,\forall j$

Hidden Markov Model

States: $S=\{S_{0}, S_{1},S_{2},\dots, S_{N}\}$
Transition probabilities : $P(q_{t}=S_{i} |q_{t-1}=S_{j})=a_{ji}$
Output probability distributions (at state $j$ for symbol $k$ ): $P(y_{t}=O_{k} | q_{t}=S_{j})=b_{j}(k, \lambda_{j})$ parameterized by $\lambda_{j}$ .

HMM

HMM Problems and Solutions

Evaluation: Given a model, compute probability of observation sequence.
Decoding: Find a state sequence which maximizes probability of the observation sequence.
Training: Adjust model parameters which maximizes probability of observed sequences.

Evaluation

Compute the probability of observation sequence $O=o_{1} o_{2} \dots o_{T}$ , given a HMM model parameter $\lambda$ :

P (O | λ) = \sum \forall Q P (O, Q | λ), Q = q 0 q 1 q 2 \dots q T = \sum \forall Q a q 0 q 1 b q 1 (o 1) \cdot a q 1 q 2 b q 2 (o 2) \dots a q T - 1 q T b q T (o T)

$\begin{split} P(O|\lambda) & = \sum_{\forall Q}P(O, Q|\lambda),\qquad Q=q_{0}q_{1} q_{2}\dots q_{T} \\ & = \sum_{\forall Q} a_{q_{0}q_{1}}b_{q_{1}}(o_{1})\cdot a_{q_{1}q_{2}}b_{q_{2}}(o_{2})\cdots a_{q_{T-1}q_{T}}b_{q_{T}}(o_{T}) \end{split}$
This is not practical since the number of paths is

O(NT) $O(N^{T})$ . How to deal with it?

Forward Algorithm

α t (j) = P (o 1 o 2 \dots o t, q t = S j | λ)

$\alpha_{t}(j)=P(o_{1}o_{2}\dots o_{t}, q_{t}=S_{j} | \lambda)$
Compute

α $\alpha$ recursively:

α 0 (j) = {1, if S j is the start state 0, otherwise α t (j) = [\sum i = 0 N α t - 1 (i) a i j] b j (o t), t > 0

$\begin{gather} \alpha_{0}(j)=\left\{\begin{aligned}&1,\quad \text{if $S_j$ is the start state}\\ &0,\quad \text{otherwise}\end{aligned}\right.\\ \alpha_{t}(j)=\left[ \sum_{i=0}^{N}\alpha_{t-1}(i)a_{ij} \right]b_{j}(o_{t}),\quad t>0 \end{gather}$ Then

P (O | λ) = α T (N)

$P(O|\lambda)=\alpha_{T}(N)$
Computation is

O(N2T) $O(N^{2}T)$ .

Backward Algorithm

β t (i) = P (o t + 1 o t + 2 \dots o T | q t = S i, λ)

$\beta_{t}(i)=P(o_{t+1}o_{t+2}\dots o_{T}|q_{t}=S_{i}, \lambda)$
Compute

β $\beta$ recursively:

β T (i) = {1, if S i is the end state 0, otherwise β t (i) = \sum j = 0 N a i j b j (o t + 1) β t + 1 (j), t < T

$\begin{gather} \beta_{T}(i)=\left\{\begin{aligned}&1,\quad \text{if $S_i$ is the end state}\\ &0,\quad \text{otherwise}\end{aligned}\right.\\ \beta_{t}(i)= \sum_{j=0}^{N}a_{ij} b_{j}(o_{t+1})\beta_{t+1}(j),\quad t<T \end{gather}$ Then

P (O | λ) = β 0 (0)

$P(O|\lambda)=\beta_{0}(0)$
Computation is

O(N2T) $O(N^{2}T)$ .

Decoding

Find the state sequence $Q$ which maximizes $P(O, Q | \lambda)$ .

Viterbi Algorithm

V P t (i) = max q 0 q 1 \dots q t - 1 P (o 1 o 2 \dots o t, q t = S i | λ)

$VP_{t}(i)=\max_{q_{0}q_{1}\dots q_{t-1}}P(o_{1}o_{2}\dots o_{t}, q_{t}=S_i|\lambda)$
Compute

VP $VP$ recursively:

V P t (j) = max i = 0, 1, \dots N V P t - 1 (i) a i j b j (o t) t > 0

$VP_{t}(j)=\max_{i=0,1,\dots N}VP_{t-1}(i)a_{ij}b_{j}(o_{t})\quad t>0$ Then

P (O, Q | λ) = V P T (N)

$P(O, Q|\lambda)=VP_{T}(N)$ Save each maximum for backtrace at end.

Training

For the sake of tuning $\lambda$ to maximize $P(O|\lambda)$ , there is NO efficient algorithm for global optimum, nonetheless, efficient iterative algorithm capable of finding a local optimum exists.

Baum-Welch Reestimation

Define the probability of transiting from $S_{i}$ to $S_{j}$ at time $t$ given $O$ as1

ξ t (i, j) = P (q t = S i, q t + 1 = S j | O, λ) = α t ( i ) a i j b j ( O t + 1 ) β t + 1 ( j ) P ( O | λ )

$\xi_{t}(i,j)=P(q_{t}=S_{i}, q_{t+1}=S_{j} |O, \lambda)=\frac{\alpha_{t}(i)a_{ij}b_{j}(O_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}$

Let

a ¯ i j = Expected num. of trans. from S i to S j Expected num. of trans. from S i = \sum T - 1 t = 0 ξ t ( i , j ) \sum T - 1 t = 0 \sum N j = 0 ξ t ( i , j ) b ¯ j (k) = Expected num. of times in S j with symbol k Expected num. of times in S j = \sum t : O t + 1 = k \sum N i = 0 ξ t ( i , j ) \sum T - 1 t = 0 \sum N i = 0 ξ t ( i , j )

$\begin{gather}\bar{a}_{ij}=\frac{\text{Expected num. of trans. from $S_{i}$ to $S_{j}$}}{\text{Expected num. of trans. from $S_{i}$}}=\frac{\sum_{t=0}^{T-1}\xi_{t}(i,j)}{\sum_{t=0}^{T-1}\sum_{j=0}^{N}\xi_{t}(i,j)} \\ \bar{b}_{j}(k)=\frac{\text{Expected num. of times in $S_{j}$ with symbol $k$}}{\text{Expected num. of times in $S_{j}$}}=\frac{\sum_{t:O_{t+1}=k}\sum_{i=0}^{N}\xi_{t}(i,j)}{\sum_{t=0}^{T-1}\sum_{i=0}^{N}\xi_{t}(i,j)} \end{gather}$

Training Algorithm:

Initialize $\lambda = (A, B)$ .
Compute $\alpha, \beta$ and $\xi$ .
Estimate $\bar{\lambda}=(\bar{A}, \bar{B})$ from $\xi$ .
Replace $\lambda$ with $\bar{\lambda}$ .
If not converge, go to 2.

Reference

More details needed? Refer to :

“An Introduction to Hidden Markov Models”, by Rabiner and Juang.
“Hidden Markov Models: Continuous Speech Recognition”, by Kai-Fu Lee.

Forward-Backward Algorithm
↩

浅谈隐式马尔可夫模型 - A Brief Note of Hidden Markov Model (HMM)