语言-英语翻译(tutorial on hmm and applications)

real-world processes generally produce observable outputs which can be characterized as signals

真实世界的过程通常产生可以表征为信号的可观察输出

the signals can be discrete in nature(e.g,characters from a finite alphabet,quantized vectors from a codebook,etc.),or continuous in nature(e.g.,speech samples,temperature measurements,music,etc.)

信号本质上可以是离散的(例如,来自有限字母表的字符,来自码本的量化矢量等),或者本质上是连续的(例如,语音采样,温度测量,音乐等)

the signal source can be stationary(i.e.,its statistical properties do not vary with time),or nonstationary(i.e.,the signal properties vary over time).

信号源可以是静止的(即,其统计特性不随时间变化),或者非平稳(即,信号特性随时间变化)。

the signals can be pure(i.e.,coming strictly from a single source),or can be corrupted from other signal sources(e.g.,noise) or by transmission distortions,reververation,etc.

信号可以是纯粹的(即,严格来自单个源),或者可以从其他信号源(例如噪声)或由传输失真,混响等破坏。

a problem of fundamental interest is characterizing such real-world signals in terms of signal models.

根本兴趣的问题是用信号模型表征这样的现实世界信号

there are several reasons why one is interested in applying signal models

为什么有兴趣应用信号模型有几个原因

first of all,a signal model can provide the basis for a theoretical description of a signal processing system which can be used to process the signal so as to provide a desired output

首先,信号模型可以为信号处理系统的理论描述提供基础,该信号处理系统可以用来处理信号以提供期望的输出

for example if we are interested in enhancing a speech signal corrupted by noise and transmission distortion,we can use the signal model to design a system which will optimally remove the noise and undo the transmission distortion

例如,如果我们对增强受噪声和传输失真破坏的语音信号感兴趣,我们可以使用信号模型来设计一个系统,该系统将最优地消除噪声并消除传输失真.

a second reason why signal models are important is that they are potentially capable of letting us learn a great deal about the signal source(i.e.,the real-world process which produced the signal) without having to have the source available.

信号模型重要的第二个原因是它们有可能让我们在不需要信源的情况下了解信号源(即产生信号的真实世界过程)的大量信息。

this property is especially important when the cost of getting signals from the actual source is high.

当从实际信号源获取信号的成本很高时,此特性尤其重要。

in this case ,with a good signal model,we can simulate the source and learn as much as possible via simulations.

在这种情况下,通过一个好的信号模型,我们可以模拟源并通过模拟尽可能多地学习。

finally,the most important reason why signal models are important is that they often work extremely well in pracitce,and enable us to realize important practical systems-e.g.,prediction systems,recognition systems,identification systems,etc,in a very efficient manner.

最后,信号模型的重要性最重要的原因是它们在实践中经常工作得非常好,并且使我们能够以非常有效的方式实现重要的实际系统,例如预测系统,识别系统,识别系统等。

these are several possible choices for what type of signal model is used for characterizing the properties of a given signal.

对于使用什么类型的信号模型来表征给定信号的特性,这些是几种可能的选择。

broadly one can dichotomize the types of signal models into the class of deterministic the types of signal models into the class of deterministic models,and the class of statistical models.

从广义上讲,人们可以将信号模型的类型分为确定性类型的信号模型类型和确定性模型类别以及统计模型类别。

deterministic models generally exploit some known specific properties of the signal,e.g.,that the signal is a sine wave,or a sum of exponentials,etc.

确定性模型通常利用信号的一些已知的特定属性,例如信号是正弦波,或者指数的总和等。

in these cases,specification of the signal model is generally straightforward;all that is required is to determine(estimate) values of the parameters of the signal model(e.g.,amplitude,frequency,phase of a sine wave,amplitudes and rates of exponentials,etc.)

在这些情况下,信号模型的规范通常是直截了当的;所需要的只是确定(估计)信号模型的参数值(例如,幅度,频率,正弦波的相位,指数的幅度和速率, 等等。)

the second broad class of signal models is the set of statistical models in which one tries to characterize only the statistical properties of the signal.

第二大类的信号模型是一组统计模型,其中人们只试图表征信号的统计特性。

examples of such statistical models include Gaussian processes,Poisson processes,Markov processes,and hidden Markov processes,among others.

这样的统计模型的例子包括高斯过程,泊松过程,马尔科夫过程和隐马尔可夫过程等等。

the underlying assumption of the statistical model is that the signal can be well characterized as a parametric random process,and that the parameters of the stochastic process can be determined(estimated)in a precise,well-defined manner.

统计模型的基本假设是信号可以很好地表征为参数随机过程,并且随机过程的参数可以以精确的,明确定义的方式确定(估计)。

for the applications of interest,namely speech processing,both deterministic and stochastic signal models have had good success.

对于感兴趣的应用,即语音处理,确定性和随机信号模型都取得了很好的成功。

in this paper we will concern ourselves strictly with one type of stochastic signal model,namely the hidden markov model.these models are referred to as markov sources or probabilistic functions of markov chains in the communications literature.

在本文中,我们将严格地关注一种随机信号模型,即隐马尔可夫模型。这些模型被称为马尔可夫链或通信文献中的马尔可夫链的概率函数。

we will first review the theory of markov chains and then extend the ideas to the class of hidden markov models using serveral simple examples.

我们首先回顾马尔可夫链的理论,然后用一些简单的例子将这些思想扩展到隐马尔可夫模型的类。

we will then focus our attention on the three fundamental problems for hmm design,namely:the evaluation of the probability(or likelihood) of a sequence of observations given a specific hmm; the determination of a best sequence of model states;and the adjustment of model parameters so as to best account for the observed signal.

我们将把注意力集中在HMM设计的三个基本问题上,即:评估给定具体HMM的观察序列的概率(或可能性); 确定模型状态的最佳序列;以及调整模型参数以便最好地解释观察到的信号。

we will show that once these three fundamental problems are solved,we can apply hmms to selected problems in speech recognition.

我们将证明,一旦这三个基本问题得到解决,我们就可以将hmms应用于语音识别中的选定问题。

the second reason was that the original applications of the theory to speech processing did not provide sufficient tutorial material for most readers to understand the theory and to be able to apply it to their own research.

第二个原因是该理论在语音处理中的原始应用并未为大多数读者提供足够的指导材料来理解该理论并能够将其应用到他们自己的研究中。

as a result ,several tutorial papers were written which provided a sufficient level of detail for a number of research labs to begin work using hmms in individual speech processing applications

因此,编写了几篇教程,为许多研究实验室提供了足够的详细信息,以开始在单独的语音处理应用中使用hmms

the paper combines results from a number of original sources and hopefully provides a single source for acquiring the background required to pursue further this fascinating area of research.

该论文结合了许多原始资料的结果,并希望为获得进一步研究这一迷人领域所需的背景提供单一来源。

the organization of this paper is as follows. 本文的结构如下

in section ii we review the theory of discrete markov chains and show how the concept of hidden states,where the observation is a probabilistic function of the state,can be used effectively.

在第二节中,我们回顾了离散马尔可夫链理论,并展示了如何有效地使用隐含状态的概念,其中观察是状态的概率函数。

we illustrate the theory with two simple examples,namely coin-tossing,and the classic balls-in-urns system.

我们用两个简单的例子来说明理论,即硬币投掷和经典的球投入系统

in section iv we discuss the various types of hmms that have been studied including ergodic as well as left-right models.

在第四节中,我们讨论了已经研究的各种类型的HMM,包括遍历和左右模型。

the state duration density,and the optimization criterion for choosing optimal hmm parameter values.

状态持续时间密度以及用于选择最佳hmm参数值的优化标准。

in section v we discuss the issues that arise in implementing hmms including the topics of scaling,initial parameter estimates,model size,model form,missing data,and multiple observation sequences.

在第五节中,我们讨论实施hmms时出现的问题,包括缩放,初始参数估计,模型大小,模型形式,缺失数据和多观察序列等主题。

in section vi we describe an isolated word speech recognizer,implemented with hmm ideas,and show how it performs as compared to alternative implementations.

在第六节中,我们描述了一个孤立的单词语音识别器,实现了很多想法,并展示了它与其他实现相比的表现。

in section vii we extend the ideas presented in section vi to the problem of recognizing a string of spoken words based on concatenating individual hmms of each word in the vocabulary.

在第vii节中,我们将第vi节中提出的思想延伸到基于连接词汇表中每个词的单个词的识别一串口语词的问题。

in section viii we briefly outline how the ideas of hmm have been applied to a large vocabulary speech recognizer,and in section ix we summarize the ideas discussed throughout the paper.

在第viii节中,我们简要地概述了HMM的思想如何应用于大型词汇语音识别器,在第九节我们总结了本文讨论的想法。

discrete markov processes 离散马尔可夫过程

consider a system which may be described at any time as being in one of a set of N distinct states,s1,s2,...,sn,as illustrated in fig.1(where n = 5 for simplicity)

考虑可以在任何时间描述为处于N个不同状态集合s1,s2,...,sn中的一个中的系统,如图1所示(其中为了简单起见n = 5)

at regularly spaced discrete times,the system undergoes a change of state(possibly back to the same state)according to a set of probabilities associated with the state.

在规则间隔的离散时间,根据与状态相关的一组概率,系统经历状态改变(可能回到相同状态)。

we denote the time instants associated with state changes as t = 1,2,...,and we denote the actual state at time t as qt.

我们将与状态变化相关的时刻表示为t = 1,2,...,并且我们将时刻t的实际状态表示为qt。

a full probabilistic description of the above system would,in general,require specification of the current state(at time t),as well as all the predecessor states.

对上述系统的全面概率描述通常要求规定当前状态(在时间t)以及所有前驱状态。

问题:为什么是这个表达式。P2

for the special case of a discrete,first order,markov chain,this probabilistic description is truncated to just the current and the predecessor state,i.e.(???)

对于离散的一阶马尔可夫链的特殊情况,这个概率描述被截断为只有当前状态和前一状态,

P[qt = Si|qt-1 =Sj,qt-2 = Sk,…] = P[qt = Sj | qt-1 = Si]

问题:这句话没读懂。P2

furthermore we only consider those processes in which the right-hand side of is independent of time,thereby leading to the set of state transition probabilities aij of the form with the state transition coefficients having the properties since they obey standard stochastic constraints.

此外,我们只考虑其中右手边与时间无关的那些过程,由此导致状态转移系数具有属性的形式的状态转移概率a ij的集合,因为它们服从标准随机约束。

Aij = P[qt = Sj|qt-1 = si], 1 <=I,j<=N

Aij >=0 ,∑aij = 1

the above stochastic process could be called an observable Markov model since the output of the process is the set of states at each instant of time,where each state corresponds to a pysical(observable)event.

上述随机过程可以称为可观察马尔可夫模型,因为过程的输出是每个时刻的状态集合,其中每个状态对应于一个物理(可观察)事件。

to set ideas,consider a simple 3 state markov model of the weather.

设定想法,考虑一个简单的三态马尔可夫模型的天气

we assume that once a day(e.g.,at noon),the weather is observed as being one of the following:state 1:rain or snow , state 2:cloudy, state 3:sunny

we postulate that the weather on day is characterized by a single one of the three states above,and that the matrix A of state trainsition probabilities is

我们假设当天的天气特征是上述三种状态中的单一状态,并且状态转移概率的矩阵A是

问题:为什么公式是这样的?

Given that the weather on day1(t=1) is sunny(state 3),we can ask the question:what is the probability(according to the model) that the weather for the next 7 days will be "sun-sun-rain-rain-sun-cloudy-sun ..."?

stated more formally,we define the observation sequence O as O={S3,S3,S3,S1,S1,S3,S2,S3}corresponding to t=1,2,...,8,and we wish to determine the probability of O,given the model.This probability can be expressed(and evaluated) as

更正式地说,我们将观测序列O定义为对应于t = 1,2,...,8的O = {S3,S3,S3,S1,S1,S3,S2,S3},并且我们希望确定 给定模型的概率为O.这个概率可以表示为(和评估)为

P(O|Model) = P[S3,S3,S3,S1,S1,S3,S2,S3|Model] = P[S3].P[S3|S3].P[S3|S3].P[S1|S3].P[S1|S1].P[S3|S1].P[S2|S3].P[S3|S2] = ∏3 . a33 .a33.a31.a11.a13.a32.a23 =1*0.8*0.8*0.1*0.4*0.3*0.1*0.2

Where we use the notation ∏I = P[q1 = Si] 1 <=i<=N to denote the initial state probabilities.

Given that the model is in a known state,what is the probability it stays in that state for exactly d days?this probability can be evaluated as the probability of the observation sequence O = {Si,Si,Si,...,Si,Sj <> Si}, given the model ,which is

the quantity pi is the (discrete) probability density function fo duration d in state i.this exponential duration density is characteristic of the state duration in a markov chain.

数量pi是状态i中的持续时间d的(离散的)概率密度函数。该指数持续时间密度是马尔可夫链中状态持续时间的特征。

问题为什么是这个公式?

based on pi(d),we can readily calculate the expected number of observations(duration) in a state,conditioned on starting in that state as

基于pi(d),我们可以很容易地计算一个状态下的预期观察次数(持续时间),以该状态开始为条件

thus the expected number of consecutive days of sunny weather,according to the model,is 1/(0.2) = 5;for cloudy it is 2.5;for rain it is 1.67

因此根据该模型,连续的晴天天数预计为1 /(0.2)= 5;阴天为2.5;下雨为1.67

so far we have considered markov models in which each state corresponded to an observable(physical)event.this model is too restrictive to be applicable to many problems of interest.in this section we extend the concept of markov models to include the case where the observation is a probabilistic function of the state-i.e.,the resulting model (which is called a hiddden markov model) is a doubly embedded stochastic process with an underlying stochastic process that is not observable(it is hidden),but can only be observed through another set of stochastic processes that produce the sequence of observations. to fix ideas,consider the following model of some simple coin tossing experiments

到目前为止,我们已经考虑了马尔可夫模型,其中每个状态与可观察(物理)事件相对应。该模型过于严格,不适用于许多感兴趣的问题。在本节中,我们扩展了马尔可夫模型的概念,以包括 观察是状态的概率函数 - 也就是说,所得到的模型(被称为hiddden马尔可夫模型)是一个双重嵌入的随机过程,其潜在的随机过程不可观察(它是隐藏的),但只能通过观察 另一组产生观察序列的随机过程。 为了解决这些问题,可以考虑一下简单的抛硬币实验的模型

this model is depicted in fig 这个模型在图1中被描述

in this case the markov model is observable,and the only issue for complete specification of the model would be to decide on the best value for the bias

在这种情况下,马尔可夫模型是可观察的,并且完整规定模型的唯一问题将是决定偏差的最佳值

Interestingly(有趣的是),an equivalent (相当) hmm  to that of fig would be a degenerate (退化)1-state model,where the state corresponds to the single biased coin,and the unknown parameter is the bias of the coin.

each state is characterized by a probability distribution fo heads and tails,and transitions between states are characterized by a state transition matrix

每个状态的特征是头部和尾部的概率分布,状态之间的过渡由状态转移矩阵表征

the physical mechanism which accounts for how state transitions are selected could itself be a set of independent coin tosses,or some other probabilistic event.

说明如何选择状态转换的物理机制本身可能是一组独立的硬币投掷或其他概率事件。

thus,with the greater degrees of freedom,the larger hmms would seem to inherently be more capable of modeling a series of coin tossing experiments than would equivalently smaller models.

因此,随着自由度越大,较大的hmms似乎本身就能够模拟一系列硬币投掷实验,而不是等效较小的模型。

although this is the theoretically true,we will see later in this paper that practical considerations impose some strong limitations on the size of models that we can consider

虽然这在理论上是正确的,但我们在本文后面会看到,实际的考虑对我们可以考虑的模型的大小有很大的限制

furthermore,it might just be the case that only a single coin is being tossed.then using the 3coin model of fig would be inappropriate,since the actual physical event would not correspond to the model being used-i.e.we would be using an underspecified system

此外,可能仅仅是一个硬币被投掷的情况。然后使用fig的3coin模型是不合适的,因为实际的物理事件不会对应于正在使用的模型 - 即,我们将使用未指定的系统

it should be obvious that the simplest hmm that corresponds to the urn and ball process is one in which each state corresponds to a specific urn,and for which a (ball) color probability is defined for each state.the choice of urns is dictated by the state transition matrix of the hmm.

很明显,对应于urn和球过程的最简单的hmm是其中每个状态对应于特定的urn并且为每个状态定义了(球)颜色概率的一个hvm的选择是由 hmm的状态转换矩阵。

hence,in the coin tossing experiments,each state corresponded to a distinct biased coin. in the urn and ball model,the states corresponded to the urns.generally the states are interconnected in such a way that any state can be reached from any other state(e.g.an ergodic model);however , we will see later in this paper that other possible interconnections of states are often of interest.

因此,在硬币投掷实验中,每个状态都对应着一个独特的有偏见的硬币。 在urn和球模型中,这些状态与urn相对应。一般来说,状态之间是相互联系的,任何状态都可以通过任何其他状态达到(an遍历模式);但是,我们将在本文后面看到 其他可能的状态互联往往是有利的。

the number of distinct observation symbols per state,i.e.,the discrete alphabet size.

每个状态的不同观察符号的数量,即离散字母大小。

otherwise terminate the procedure 否则终止程序

it can be seen from the above discussion that a complete specification of an hmm requires specification of two model parameters(N and M ),specification of observation symbols,and the specification of the three probability measures A,B,and pi.For convenience,we use the compact notation to indicate the complete parameter set of the model.

从上面的讨论可以看出,一个完整的hmm规范需要规定两个模型参数(N和M),观察符号的规格,以及三个概率测度A,B和pi的规格。为了方便起见, 我们使用紧凑符号来表示模型的完整参数集。

how do we choose a corresponding state sequence which is optimal in some meaningful sense(i.e., best "explains" the observations)

我们如何选择一个在某种有意义的意义上是最优的对应状态序列(即,最好的“解释”观察结果)

we can also view the problem as one of scoring how well a given model matches a given observation sequence

我们也可以将问题看作是给定模型与给定观察序列相匹配的得分之一

we usually use an optimality criterion to solve this problem as best as possible

我们通常使用最优性标准尽可能地解决这个问题

unfortunately,as we will see,there are several reasonable optimality criteria that can be imposed,and hence the choice of criterion is a strong function of the intended use for the uncovered state sequence.Typical uses might be to learn about the structure of the model,to find optimal state sequences for continuous speech recognition,or to get average statistics of individual states,etc.

不幸的是,正如我们将看到的那样,有几个合理的最优性标准可以强加,因此标准的选择对于未被覆盖的状态序列的预期用途是强有力的函数。典型的用途可能是了解模型的结构 寻找用于连续语音识别的最佳状态序列,或者获得各个状态的平均统计等。

the ovservation sequence used to adjust the model parameters is called a training sequence since it is used to "train" the HMM.The training problem is the crucial one for most applications of HMMs,since it allows us to optimally adapt model parameters to observed training data-i.e.,to create best models for real phenomena.

用于调整模型参数的观察序列称为训练序列,因为它用于“训练”HMM。训练问题对于HMM的大多数应用是至关重要的,因为它允许我们最优地将模型参数适应于观察到的训练 数据 - 即为真实现象创建最佳模型。

we represent the speech signal of a given word as a time sequence of coded spectral vectors

我们将给定字的语音信号表示为编码频谱矢量的时间序列

we assume that the coding is done using a spectral codebook with M unique spectral vectors;hence each observation is the index of the spectral vector closest(in some spectral sense) to the original speech signal.thus , for each vocabulary word,we have a training sequence consisting of a number of repetitions of codebook indices of the word(by one or more talkers)

我们假定使用具有M个独特频谱矢量的频谱码本来完成编码;因此,每个观察值都是与原始语音信号最接近(在某些频谱意义上)的频谱矢量的索引。因此,对于每个词汇单词,我们有一个 训练序列由单词的多个码本索引的重复(由一个或多个谈话者组成)

to develop an understanding of the physical meaning of the model states,we use the solution to problem 2 to segment each of the word training sequences into states,and then study the properties of the spectral vectors that lead to the observations occurring in each state.the goal here would be to make refinements on the model(e.g.,more states,different codebook size,etc.)so as to improve its capability of modeling the spoken word sequences.finally,once the set of W hmms has been designed and optimized and thoroughly studied,recognition of an unknown word is performed using the solution to  score each word model based upon te given test observation sequence,and select the word whose model score is highest(i.e.,the highest likelihood).

为了理解模型状态的物理意义,我们使用问题2的解决方案将每个单词训练序列分割成状态,然后研究导致观察在每个状态中发生的频谱向量的特性。 这里的目标是对模型进行改进(例如,更多的状态,不同的码本大小等),以提高其对口语单词序列建模的能力。最终,一旦W hms集合已经被设计和优化 并且进行彻底研究,使用该解决方案对基于给定测试观察序列的每个词模型进行评分,并且选择模型评分最高(即最高似然度)的词进行未知词的识别。

the most straightforward way of doing this is through enumerating every possible state sequence of lenght T(the number of observations)

这样做的最直接的方法是通过枚举每个可能的长度T的状态序列(观察次数)

the joint probability of O and Q,i.e.,the probability that O and Q occur simultaneously,is simply the product of the above two terms,i.e.,

O和Q的联合概率,即O和Q同时发生的概率,仅仅是上述两项的结果,即,

the interpretation of the computation in the above equation is the following.

上述等式中计算的解释如下。

the clock changes from time t to t+1 and we make a transition to state q2 from state q1 with probability a,and generate symbol O2 with probability b

时钟从时间t变化到t + 1,并且以概率a从状态q1转换到状态q2,并且以概率b产生符号O2

a little thought should convince the reader that the calculation of P,according to its direct definition involves on the order of 2T.Nt calculations,since at every t= 1,2,3...,T,there are N possible state sequences),

有一点应该让读者相信,根据它的直接定义,P的计算涉及2T.Nt计算的顺序,因为在每个t = 1,2,3 ...,T,有N个可能的状态序列)

this calculation is computationally unfeasible,even for small values of N and T;

这种计算在计算上是不可行的,即使对于N和T的小值也是如此;

summing this product over all the N possible states Si,1 < i < N at time t results in the probability of Sj at time t+1 with all the accompanying previous partial observations.

在时间t处的所有N个可能状态S i上,1 <i <N对该乘积进行求和导致在时间t + 1具有所有伴随的先前部分观测的S j的概率。

once this is done and sj is known,it is easy to see that  a(t+1) is obtained by accounting for observation Ot in state j,i.e.,by multiplying the summed quantity by the probability bj

一旦完成并且sj已知,很容易看出通过考虑状态j中的观测值ot得到(t + 1),即通过将总和的数量乘以概率bj

the computation of is performed for all states j,1< j < N,for a given t; the computation is then iterated for t=1,2,...,t-1.

对于给定的t,针对所有状态j,1 <j <N执行计算。 然后对t = 1,2,...,t-1进行迭代计算。

finally,step 3 gives the desired calculation of P as the sum of the terminal forward variables a.

最后,步骤3给出P的期望计算,作为终端正向变量a的总和。

a saving s of about 69 orders of magnitude. 节约了大约69个数量级

the forward probability calculation is,in effect ,based upon the lattice (or trellis) structure shown in Fig4.6

实际上,前向概率计算基于图4.6中所示的格子(或网格)结构

the key is that since there are only N states(nodes at each time slot in the lattice),all the possible state sequences will remerge into these N nodes,no matter how long the observation sequence.at time t= 1(the first time slot in the lattice),we need to calculate values of a1,1< i < N.At times t = 2,3,...,T,we only need to calculate values of ai,1 < j < N,where each calculation involves only N previous values of ai-1 because each of the N grid points is reached from the same N grid points at the previous time slot.

关键在于,由于只有N个状态(网格中每个时隙的节点),所有可能的状态序列都会重新汇入这N个节点,无论观察序列有多长。在时间t = 1时(第一次 我们需要计算a1,1 <i <N的值。在时间t = 2,3,...,T中,我们只需要计算ai的值,1 <j <N,其中 每个计算只涉及ai-1的N个先前值,因为N个网格点中的每一个都是从前一个时隙的相同N个网格点到达的。

it should be noted that the Viterbi algorithm is similar(except for the backtracking step)in implementation to the forward calculation of(19)-(21).The major difference is the maximization in over previous states which is used in place of the summing procedure in (20). It also should be clear that a lattice (or trellis ) structure efficiently implements the computation of the Viterbi procedure.

应该注意的是维特比算法在执行到(19) - (21)的正向计算时是类似的(除了回溯步骤)。主要的区别是在之前的状态中用于代替求和的最大化 (20)中的程序。 还应该清楚,格子(或格子)结构有效地实现维特比过程的计算。

the third ,and by far the most difficult,problem of hmms is to determine a method to adjust the model parameters(A,B,pi)to maximize the probability of the observation sequence given the model.there is no known way to analytically solve for the model which maximizes the probability of the observation sequence.

第三种,也是迄今为止最困难的,hmms的问题是确定一种方法来调整模型参数(A,B,pi)以最大化给定模型的观察序列的概率。没有已知的分析解决方法 对于最大化观察序列的概率的模型。

there is no known way to analytically solve for the model which maximizes the probability of the observation sequence.

对于最大化观察序列概率的模型,没有已知的分析解决方法。

in fact,given any finite observation sequence as training data,there is no optiomal way of estimating the model parameters.

事实上,给定任何有限的观察序列作为训练数据,没有估计模型参数的最佳方式。

we can,however,choose ab such that p is locally maximized using an iterative procedure such as the Baum-Welch method(or equivalently the EM(expectation-modification) method23),or using gradient techniques[14].In this section we discuss one iterative procedure,based primarily on the classic work of Baum and his colleagues,for choosing model parameters.

但是,我们可以选择ab,使得使用Baum-Welch方法(或者等价的EM(期望修正)方法)或者使用梯度技术[14]等迭代过程将p局部最大化。在本节中,我们讨论 一个迭代过程,主要基于Baum和他的同事的经典着作,用于选择模型参数。

where the numerator term is just p and the division by p1 gives the desired probability measure.

分子项只是p,除以p1给出所需的概率测度。

if we sum over the time index t, we get a quantity which can be interpreted as the expected(over time)number of times that state si is visited,or equivalently,the expected number of transitions made from state si(if we exclude the time slot t =T from the summation)

如果我们总结时间指数t,我们得到的数量可以被解释为状态si被访问的预期(随时间)的次数,或者等价地,从状态si得到的预期转换次数(如果我们排除 时隙t =来自总和的T)

similarly , summation of ab over t(from t = 1 to t = T-1)can be interpreted as the expected number of transitions from state Sj to state Sj1.

类似地,ab上t(从t = 1到t = T-1)的总和可以被解释为从状态Sj到状态Sj1的预期转换次数。

we have found a new model na from which the observation sequence is more likely to have been produced.

我们发现了一个新的模型na,从中观察序列更可能产生。

eventually the likelihood function converges to a critical point.

最终可能性函数收敛到一个临界点。

thus the Baum-Welch reestimation equations are essentially identical to the EM steps for this particular problem.

因此Baum-Welch重估方程与这个特定问题的EM步骤基本相同。

by looking at the parameter estimation problem as a contrained optimization of P(subject to the constraints of 43),the techniques of lagrange multipliers can be used to fid the values of a,b,and c which maximize p(we use the notation p as short-hand in this section).

通过将参数估计问题看作P的约束优化(受制于43的约束),拉格朗日乘子的技术可以用于对a,b和c的值进行寻找,以使p最大化(我们使用符号p 在本节中为短手)。

by appropriate manipulation of (44),the right-hand sides of each equation can be readily converted to be identical to the right-hand sides of each part of (40a)-(40c),thereby showing that the reestimation formulas are indeed exactly correct at critical points of P.

通过对(44)的适当操作,每个方程的右边可以很容易地转换成与(40a) - (40c)的每个部分的右边相同,从而表明重估公式确实是完全正确的 在P的临界点纠正

in fact the form of (44) is essentially that of a reestimation formula in which the left-hand side is the reestimate and the right-hand side is computed using the current values of the variables.

实际上(44)的形式基本上是一个重估公式,其中左边是reestimate,右边是使用变量的当前值计算的。

finally,we note that since the entire problem can be set up as an optimization problem,standard gradient techniques can be used to solve for "optimal" values of the model parameters [14].Such procedures have been tried and have been shown to yield solutions comparable to those of the standard reestimation procedures.

最后,我们注意到,由于整个问题可以设置为优化问题,所以可以使用标准梯度技术来求解模型参数的“最优”值[14]。这些程序已经尝试过, 解决方案可与标准重估程序相媲美。

until now ,we have only considered the special case of ergodic or fully connected HMMs in which every state of the model could be reaced(in a single step)from every othere state of the model.(Strictly speaking,an ergodic model has the property that every state can be reached from every other state in a finite number of steps.) As shown in Figy7,for an N =4 state model,this type of model has the property that every aij coefficient is positive.Hence for the example of Fig7a we have

直到现在,我们只考虑了遍历或全连通HMM的特殊情况,其中模型的每个状态都可以从模型的每个其他状态(在一个单独的步骤中)得到反应(严格地说,遍历模型具有属性 如图7所示,对于一个N = 4的状态模型,这种类型的模型具有每个aij系数都为正的特性。对于 图7a我们有

clearly the left-right type of HMM has the desirable property that it can readily model signals whose properties change over time e.g.,speech.

显然左右型HMM具有所希望的性质,即它可以容易地模拟其性质随时间变化的信号,例如语音。

no transitions are allowed to states whose indices are lower than the current state.

不允许转换到索引低于当前状态的状态。

Often,with left-right models,additional contraints are placed on the state transition coefficients to make sure that large changes in state indices do not occur;hence a constraint of the form is often used.

通常,在左右模型中,对状态转换系数施加了额外的限制,以确保不会发生状态指数的大变化;因此经常使用表单的约束。

although it is possible to quantize such continuous signals via codebooks,etc.,there might be serious degradation associated with such quantization.hence it would be advantageous to be able to use HMMs with continuous observation densities.

虽然有可能通过码本等量化这样的连续信号,但是可能存在与这种量化相关的严重劣化。尽管能够使用具有连续观测密度的HMM将是有利的。

in order to use a continuous observation density,some restrictions have to be placed on the form of the model probability density function to insure that the parameters of the pdf can be reestimated in a consistent way.

为了使用连续观察密度,必须对模型概率密度函数的形式施加一些限制,以确保可以以一致的方式重新估计pdf的参数。

the most general representation of the pdf,for which a reestimation procedure has been formulated[24-26],is a finite mixture of the form  where O is the vector being modeled,cjm is the mixture coefficient for the mth mixture in state j and a is any log-concave or elliptically symmetric density[24](e.g.,Gaussian),with mean vector Ujm and covariance matrix Ujm for the mth is used for pi.

最重要的pdf表示形式,其中重新估计过程已经形成[24-26],是一种有限的混合形式,其中O是被建模的向量,cjm是状态j中第m个混合物的混合系数, a是任何对数凹或椭圆对称密度[24](例如高斯),其中第m个的平均向量Ujm和协方差矩阵Ujm用于pi。

the mixture gains Cjm satisfy the stochastic constraint so that the pdf is properly normalized,i.e.,

混合增益Cjm满足随机约束,使得pdf被正确地归一化,即,

the pdf of (49) can be used to approximate,arbitrarily closely,any finite,continuous density function.Hence it can be applied to a wide range of problems. (49)的pdf可以用来逼近任意紧密的任何有限的连续密度函数。因此它可以应用于各种各样的问题。

it can be shown [24-26] that the reestimation formulas for the coefficients of the mixture density,i.e.,Cjm,mu,and Ujk,are of the form

可以示出[24-26]混合密度的系数即C jm,mu和U jk的重估公式具有以下形式

where prime denotes vector transpose and where Yt(j,k) is the probability of being in state j at time t with the kth mixture component accounting for Ot,i.e.,

其中素数表示向量转置,并且其中Yt(j,k)是在时间t处于状态j的概率,其中第k个混合分量占Ot,即,

(the term Y(j,k) generalizes to Y(j) of(26) in the case of a simple mixture,or a discrete density.)The reestimation formula for aij is identical to the one used for discrete observation densities.The interpretation of (52-54) is fairly straightforward.The reestimation formula for Cjk is the ratio between the expected number of times the system is in state j using the kth mixture component,and the expected number of times the system is in state j.

(在简单混合情况下,术语Y(j,k)推广到(26)的Y(j)或离散密度。)aij的重估公式与用于离散观测密度的重估公式相同。 (52-54)的解释是相当简单的.Cjk的重估公式是系统处于使用第k个混合分量的状态j的预期次数与系统处于状态j的预期次数之间的比率。

similarly,the reestimation formula for the mean vector mu weights each numerator term of (52) by the observation,thereby giving the expected value of the portion of the observation vector accounted for by the kth mixture component.

类似地,平均向量mu的重估公式通过观察对(52)的每个分子项进行加权,由此给出第k个混合分量所占的观测向量部分的期望值。

A similar interpretation can be given for the reestimation term for the covariance matrix U jk

对于协方差矩阵U jk的重估项可以给出类似的解释

Autoregressive HMMs 自回归HMMs

although the general formulation of continuous density HMMs is applicable to a wide range of problems,there is one other very interesting class of HMMs that is particularly applicable to speech processing.

尽管连续密度HMM的一般公式适用于广泛的问题,但还有一类非常有趣的HMM特别适用于语音处理。

To be more specific,consider the observation vector O with components(x0,x1,x2,..,xk-1).Since the basis probability density function for the observation vector is Gaussian autoregressive (or order P),then the components of O are related by where ek,k=0,1.2.3....,k-1 are Gaussian,independent,identically distributed random variables with zero mean and variance , and aj are the autoregression of predictor coefficients.

更具体地说,考虑具有分量(x0,x1,x2,...,xk-1)的观测向量O.由于观测向量的基本概率密度函数是高斯自回归(或阶数P),则 O是相关的,其中ek,k = 0,1.2.3 ....,k-1是具有零均值和方差的高斯,独立,相同分布的随机变量,并且aj是预测器系数的自回归。

variants on HMM Structures-Null Transitions and Tied States

HMM结构变体 - 空转换和绑定状态

arcs of the model 模型的弧

Where null arcs have been successfully utilized

with a large number of states in which it is possible to omit transitions between any pair of states. 有大量状态可以忽略任何状态对之间的转换。

Hence it is possible to generate observation sequences with as few as 1 observation and still account for a path which begins in state 1 and ends in state N.

因此,可以用少至1个观测值生成观测序列,并仍然考虑从状态1开始并以状态N结束的路径

the example of Fig8 is a finite state network representation of a word in terms of linguistic unit models(i.e.,the sound on each arc is itself an HMM) 图8的例子是用语言单位模型表示单词的有限状态网络表示(即每个弧上的声音本身就是HMM)

for this model the null transition gives a compact and efficient way of describing alternate word pronunciations(i.e.,symbol delections).

对于该模型,空转换给出了描述替代单词发音(即符号删除)的紧凑而有效的方式。

Finally the FSN of Fig8 shows how the ability to insert a null transition into a grammar network allows a relatively simple network to generate arbitrarily long word(digit) sequences.In the example shown in Fig8,the null transition allows the network to generate arbitrary sequences of digits of arbitrary length by returning to the initial state after each individual digit is produced.

最后,图8的FSN示出了如何将空转换插入到语法网络中的能力如何允许相对简单的网络生成任意长的字(数字)序列。在图8所示的示例中,空转换允许网络生成任意序列 通过在产生每个单独的数字之后返回到初始状态来选择任意长度的数字。

Another interesting variation in the HMM structure is the concept of parameter tieing.Basically the idea is to set up an equivalence relation between HMM parameters in different states.

HMM结构中另一个有趣的变化是参数绑定的概念。基本上这个想法是在不同状态下建立HMM参数之间的等价关系。

in this manner the number of independent parameters in the model is reduced and the parameter estimation becomes somewhat simpler.

以这种方式,模型中的独立参数的数量减少并且参数估计变得更简单。

parameter tieing is used in cases where the observation density (for example) is known to be the same in 2 or more states.Such cases occur often in characterizing speech sounds.the technique is especially appropriate in the case where there is insufficient training data to estimate,reliably,a large number of model parameters.for such cases it is appropriate to tie model parameters so as to reduce the number of parameters(i.e.,size of the model) thereby making the parameter estimation problem somewhat simpler.We will discuss this method later in this paper.

在已知观测密度(例如)在两种或更多种状态下相同的情况下使用参数联系。这种情况经常发生在表征语音的情况下。该技术尤其适用于没有足够的训练数据 可靠地估计大量模型参数。对于这种情况,将模型参数联系起来以减少参数的数量(即模型的大小)是合适的,从而使得参数估计问题稍微简单一些。我们将讨论这个 方法在本文后面。

inclusion of explicit  state duration density in HMMs

在HMM中包含明确的状态持续时间密度

perhaps the major weakness of conventional HMMs is the modeling of state duration.Earlier we showed(5) that the inherent duration probability density pi associated with state Si,with self transition coefficient aij,was of the form

或许传统隐马尔可夫模型的主要弱点是状态持续时间的建模。早先我们表明(5)与状态Si有关的自身持续时间概率密度pi与自变换系数aij有关

a duration d1 is chosen according to the state duration density p.(For expedience and ease of implementation the duration density P is truncated at a maximum duration value D)

根据状态持续时间密度p选择持续时间d1(为了方便和易于实现,持续时间密度P在最大持续时间值D被截断)

clearly this is a requirement since we assume that ,in state q1,exactly d1 observations occur

显然这是一个要求,因为我们假设在q1状态下,恰好有d1个观察值出现

the importance of incorporating state duration densities is reflected in the observation that,for some problems,the quality of the modeling is significantly improved when explicit state duration densities are used.

纳入状态持续时间密度的重要性反映在观察中,即对于某些问题,当使用明确的状态持续时间密度时,建模质量显着提高。

to alleviate this type of problem,there has been proposed at least two alternatives to the standard maximum likelihood optimization procedure for estimating HMM parameters.

为了减轻这种类型的问题,已经提出了用于估计HMM参数的标准最大似然优化过程的至少两个替代方案。

the first alternative is based on the idea that several HMMs are to be designed and we wish to design them all at the same time in such a way so as to maximize the discrimination power of each model(i.e.,each model's ability to distinguish between observation sequences generated by the correct model and those generated by alternative models).

第一种选择是基于这样的想法,即要设计多个HMM,并且我们希望以这种方式同时设计它们,以最大化每个模型的辨别能力(即,每个模型区分观察 由正确模型生成的序列和由替代模型生成的序列)。

the proposed alternative design criterion is the maximum mutual information criterion in which the average mutual information i between the observation sequence O and the complete set of models na is maximized.

所提出的替代设计标准是最大互信息标准,其中观测序列O和整套模型na之间的平均互信息i被最大化。

there are various theoretical reasons why analytical (or reestimaion type) solutions to (86) cannot be realized.Thus the only known way of actually solving (86) is via general optimization procedures like the steepest descent methods.

(86)的分析(或reestimaion类型)解决方案无法实现的理由存在各种理论上的原因。因此,唯一已知的实际解决方法(86)是通过一般优化程序,如最速下降方法。

the second alternative philosophy is to assume that the signal to be modeled was not necessarily generated by a markov source,but does obey certain constraints(e.g.,positive definite correlation function)

第二种替代的哲学是假定要建模的信号不一定是由马尔可夫信源产生的,但是确实服从某些约束(例如正定相关函数)

the goal of the design procedure is therefore to choose HMM parameters which minimize the discrimination information or the cross entropy between the set of valid(i.e.,which satisfy the measurements)signal probability densities(call this set Q),and the set of HMM probability densities (call this set P),where the DI between Q and P can generally be written in the form xxx where q and p are the probability density functions corresponding to Q and P.

因此设计过程的目标是选择最小化鉴别信息或在有效(即,满足测量)信号概率密度的集合(称为该集合Q)和一组HMM概率密度之间的交叉熵的HMM参数 密度(称为P组),其中Q和P之间的DI一般可以写成xxx形式,其中q和p是对应于Q和P的概率密度函数。

techniques for minimizing(87)(thereby giving an MDI solution) for the optimum values of na =(A,B,pi) are highly nontrivial;however,they use a generalized Baum algorithm as the core of each iteration,and thus are efficiently tailored to hidden markov modeling [33]

用于使na =(A,B,pi)的最优值最小化(由此给出MDI解)的技术是非常平凡的;然而,它们使用广义Baum算法作为每次迭代的核心,因此是有效的 针对隐马尔可夫模型[33]

it has been shown that the ML,MML,and MDI approaches can all be uniformly formulated as Mdi approaches. The three approaches differ in either the probability density attributed  to the source being modeled,or in the model effectively being used.None of the approaches,however,assumes that the source has the probability distribution of the model.

已经表明,ML,MML和MDI方法都可以统一表述为Mdi方法。 这三种方法在归因于被建模源的概率密度上或者在有效使用的模型中不同。然而,没有一种方法假定源具有模型的概率分布。

several interpretations of(88) exist in terms of cross entropy,or divergence,or discrimination information.

在交叉熵,分歧或歧视信息方面存在着对(88)的几种解释。

for some of these implementation issues we can prescribe exact analytical solutions;for other issues we can only provide some seat-of-the-pants experience gained from working with HMMs over the last serveral years.

对于其中一些实施问题,我们可以提供精确的分析解决方案;对于其他问题,我们只能提供一些在过去几年中与HMM合作获得的座位经验。

for sufficiently large t(e.g.,100 or more the dynamic range of the a computation will exceed the precision range of essentially any machine (even in double precision))

对于足够大的t(例如,100或更多,计算的动态范围将超过基本上任何机器的精度范围(即使是双精度))

hence the only reasonable way of performing the computation is by incorportating a scaling procedure.

因此执行计算的唯一合理方式是通过引入缩放程序。

the scaling coefficients are canceled out exactly. 缩放系数被精确抵消

the only real change to the HMM procedure because of scaling is the procedure for computing P.

由于缩放对HMM过程的唯一真正改变是计算P的过程。

we cannot merely sum up the a terms since these are scaled already

我们不能仅仅总结一个词,因为这些词已经缩小了

since the reestimation formulas are based on frequencies of occurrence of various events,the reestimation formulas for multiple observation sequences are modified by adding together the individual frequencies of occurrence for each sequence.

由于重估公式基于各种事件发生的频率,所以通过将每个序列的出现频率加在一起来修改多个观测序列的重估公式。

in this manner,for each sequence O,the same scale factors will appear in each term of the sum over t as appears in the P term,and hence will cancel exactly.Thus using the scaled values of the alphas and betas results in an unscaled.

以这种方式,对于每个序列O,相同的比例因子将出现在P项中出现的t之上的总和的每个项中,因此将精确地抵消。因此,使用α和β的缩放值导致未缩放。

in this manner,for each sequence O,the same scale factors will appear in each term of the sum over t as appears in the P term,and hence will cancel exactly.Thus using the scaled values of the alphas and betas results in an unscaled.

以这种方式,对于每个序列O,相同的比例因子将出现在P项中出现的t之上的总和的每个项中,因此将精确地抵消。因此,使用α和β的缩放值导致未缩放。

in theory,the reestimation equations should give vlues of the HMM parameters which correspond to a local maximum of the likelihood function. a key question is therefore how do we choose initial estimates of the HMM parameters so that the local maximum is the global maximum of the likelihood function

在理论上,重估方程应该给出对应于似然函数的局部最大值的HMM参数的线索。 一个关键的问题是如何选择HMM参数的初始估计值,以使局部最大值是似然函数的全局最大值

subject to the stochastic and the nonzero value constraints 受制于随机和非零价值限制

A parameters is adequate for giving useful reestimates of these parameters in almost all cases. However , fot the B parameterse,experience has shown that good initial estimates are helpful in the discrete symbol case,and are essential(when dealing with multiple mixtures ) in the continuous distribution case .Such initial estimates can be obtained in a number of ways,including manual segmentation of the observation sequence into states with averaging of observations within states,maximum likelihood segmentation of observation with clustering,etc.We discuss such segmentation techniques later in this paper

一个参数足以在几乎所有情况下给出这些参数的有用reestimates。 然而,对于B参数,经验表明良好的初始估计在离散符号情况下是有用的,并且在连续分布情况下是必不可少的(当处理多个混合物时)。这种初始估计可以以多种方式获得, 包括将观察序列手动分割为状态内的观察值平均的状态,聚类观测的最大似然分割等。我们在本文后面讨论这种分割技术

effects of insufficient training data  训练数据不足的影响

often this is impractical. 通常这是不切实际的

a second possible solution is to reduce the size of  the model(e.g.,number of states,number of symbols per state,etc).although this is always possible,often there are physical reasons why a given model is used and therefore the model size cannot be changed.

第二种可能的解决方案是减小模型的大小(例如,状态数量,每个状态的符号数量等)。尽管这总是可能的,但是为什么使用给定模型通常存在物理原因,并且因此模型大小 不能改变。

the way in which the smaller model is chosen is by tieing one or more sets of parameters of the initial model to create the smaller model.

选择较小模型的方式是通过绑定初始模型的一组或多组参数来创建较小的模型。

A modified version of this training procedure,called the method of deleted interpolation,iterates the above procedure through multiple partitions of the training set.For example one might consider a partition of the training set such that T1 is 90 percent of T and T2 is the remaining 10 percent of T. There are a large number of ways in which such a partitioning can be accomplished but one particularly simple one is to cycle T2 through the data,i.e.,the first partition uses the last 10 percent of the data as T2,the second partition uses the next-to-last 10 percent of the data as T2,etc.

该训练过程的修改版本称为删除插值方法,通过训练集的多个分区迭代上述过程。例如,可以考虑对训练集进行分区,使得T1为T的90%,并且T2为 剩余的T是10%。有很多方法可以完成这样的分区,但其中一个特别简单的方法是在数据中循环T2,即第一个分区使用最后10%的数据作为T2, 第二个分区使用倒数第二个10%的数据作为T2等。

the constraints can be applied as a postprocessor to the reestimation equations such that if a constraint is violated,the relevant parameter is manually corrected,and all remaining parameters are rescaled so that the densities obey the required stochastic constraints.Such post-processor techniques have been applied to several problems in speech processing with good success.it can be seen from that this procedurre is essentially equivalent to a simple form of deleted interpolation in which the model na is a uniform distribution model,and the interpolation value ab is chosen as the fixed constant(1-alpa)

该约束可以作为重估方程的后处理器来应用,使得如果违反约束,则手动纠正相关参数,并且重新缩放所有其余参数,使得密度服从所需的随机约束。这样的后处理器技术已经被 应用于语音处理中的几个问题,取得了很好的成功。从这个过程本质上等价于一种简单形式的删除插值,其中模型na是一个均匀分布模型,插值ab选作固定常数(1-ALPA)

as such,we will not strive to be as thorough or as complete in our descriptions as to what was done as we were in describing the theory of HMMs.

因此,我们不会努力在描述HMMs理论时所做的那样彻底或完整。

a spectral and /or temporal analysis of the speech signal is performed to give observation vectors which can be used to train the HMMs which characterize various speech sounds.

执行语音信号的频谱和/或时间分析以给出可用于训练表征各种语音的HMM的观察向量。

first a choice of speech recognition unit must be made.possibilities include linguistically based sub-word units such as phones (or phone-like units),diphones,demisyllables,and syllables,as well as derivative units such as fenemes,fenones,and acoustic units.Other possibilities in clude whole word units,and even units which correspond to a group of 2 or more words (e.g.,and an ,in the ,of a,etc)

首先必须选择语音识别单元。可能性包括基于语言的子单位单位,例如电话(或类似电话的单位),双音素,demisyllables和音节,以及派生单位如fenemes,fenones和acoustic 单位。其他可能性包括整个单词单位,甚至包含与一组2个或更多单词相对应的单位(例如,和,等等)

Generally,the less complex the unit(e.g.,phones),the fewer of them there are in the language,and the more complicated(variable) their structure in continuous speech.for large vocabulary speech recognition(involving 1000 or more words),the use of sub-word speech units is almost mandatory as it would be quite difficult to record an adequate training set for designing HMMs for units of the size of words or larger.

一般来说,单位(例如电话)越不复杂,语言越少,连续语音中结构越复杂(可变)。对于大型词汇语音识别(涉及1000个或更多字), 子词语音单元的使用几乎是强制性的,因为记录足够的训练集来设计单词或更大尺寸的单元的HMM是相当困难的。

however,for specialized applications(e.g.,small vocabulary,constrained task),it is both reasonable and practical to consider the word as a basic speech unit.we will consider such systems exclusively in this and the following section.independent of the unit chosen for recognition,an inventory of such units must be obtained via training.typically each such unit is characterized by some type of HMM whose parameters are estimated from a training set of speech data.the unit matching system provides the likelihoods of a match of all sequences of speech recognition units to the unkonwn input speech.Techniques for providing such match scores,and in particular determining the best match score(subject to lexical and syntactic constraints of the system) include the stack decoding procedure,various forms of frame synchronous path decoding , and a lexical access scoring procedure.

然而,对于专业应用(例如,小词汇量,受限任务),将该词作为基本语音单元来考虑是合理和实际的。我们将仅在本节和下一节中考虑这些系统。不依赖于选择的单元 识别,必须通过训练获得这些单元的清单。通常,每个这样的单元的特征是某种类型的HMM,其参数是从语音数据的训练集中估计的。单元匹配系统提供了所有序列匹配的可能性 语音识别单元与未知输入语音进行比较。用于提供这种匹配分数并且尤其是确定最佳匹配分数(受制于系统的词汇和语法约束)的技术包括堆栈解码过程,各种形式的帧同步路径解码以及 一个词汇访问评分程序。

this process places constraints on the unit matching system so that the paths investigated are those corresponding to sequences of speech units which are in a word dictionary(a lexicon).This procedure implies that the speech recognition word vocabulary must be specified in terms of the basic units chosen for recognition.such a specification can be deterministic(e.g.,one or more finite state networks for each word in the vocabulary) or statistical(e.g.,probabilities attached to the arcs in the finite state representation of words).in the case where the chosen units are words(or word combinations),the lexical decoding step is essentially eliminated and the structure of the recognizer is greatly simplified.

该过程对单元匹配系统施加约束,使得所研究的路径是与词典(词典)中的语音单元序列相对应的路径。该过程意味着语音识别词汇词汇表必须根据基本 被选择用于识别的单元。这样的规格可以是确定性的(例如,词汇表中的每个词的一个或多个有限状态网络)或统计的(例如,在词的有限状态表示中附加到弧的概率)。在 所选择的单位是单词(或单词组合),词法解码步骤基本上被消除并且识别器的结构被大大简化。

this process ,much like lexical decoding,places further constraints on the unit matching system so that the paths investigated are those corresponding to speech units which comprise words(lexical decoding) and for which the words are in a proper sequence as specified by a word grammar.

这个过程与词法解码非常类似,对单元匹配系统有进一步的限制,使得所研究的路径是对应于包括单词(词法解码)的语音单元的那些路径,并且单词语法指定的单词按照适当的顺序。

such a word grammar can again be represented by a deterministic finite state network(in which all word combinations which are accepted by the grammar are enumerated),or by a statistical grammar(e.g.,a trigram word model in which probabilities of sequences of 3 words in a specified order are given).For some command and control tasks ,only a single word from a finite set of equiprobable is required to be recognized and therefore the grammar is either trivial or unnecessary.Such tasks are often referred to as isolated word speech recognition tasks.For other applications(e.g.,digit sequences ) very simple grammars are often adequate(e.g.,any digit can be spoken and followed by any other digit)

这样的单词语法可以再次由确定性有限状态网络(其中列举了被文法接受的所有单词组合)或统计语法(例如,三字词模型,其中3个单词的序列的概率 按照指定的顺序给出)。对于一些命令和控制任务,只需要识别有限等概率集合中的单个单词,因此语法不是微不足道就是不必要的。这些任务通常被称为孤立词语音 识别任务。对于其他应用程序(例如,数字序列),非常简单的语法通常是足够的(例如,可以说出任何数字并且跟随任何其他数字)

semantic analysis 语义分析

depending on the recognizer state certain syntactically correct input strings are eliminated from consideration.

取决于识别器状态,某些语法上正确的输入字符串从考虑中消除。

this again serves to make the recognition task easier and leads to higher performance of the system.

这再次用于使识别任务更容易并导致系统的更高性能。

there is one additional factor that has a significant effort on the implementation of a speech recognizer and that is the problem of separating background silence from the input speech.

还有一个额外的因素对语音识别器的实现有很大的影响,那就是将背景静音与输入语音分开的问题。

explicitly detecting the presence of speech via techniques which discriminate background from speech on the basis of signal energy and signal durations.

通过基于信号能量和信号持续时间来区分背景与语音的技术明确检测语音的存在。

such methods have been used for template-based approaches because of their inherent simplicity and their success in low to moderate noise backgrounds

这种方法已经用于基于模板的方法,因为它们固有的简单性以及它们在低到中等噪声背景下的成功

build a model of the background silence,e.g.,a statistical model,and represent the incoming signal as an arbitrary sequence of speech and background。where the silence part of the signal is optional in that it may not be present before or after the speech

建立背景寂静的模型,例如统计模型,并且将输入信号表示为语音和背景的任意序列。其中信号的寂静部分是可选的,因为它可能在语音之前或之后不存在

further assume that for each word in the vocabulary we have a training set of K occurrences of each spoken word(spoken by 1 or more talkers) where each occurrence of the word constitutes an observation sequence,where the observations are some appropriate representation of the (spectral and/or temporal) characteristics of the word.

进一步假设对于词汇表中的每个单词,我们有每个说出的单词(由一个或多个说话人说出)的K次出现的训练集,其中每次出现该单词构成一个观察序列,其中观察结果是( 光谱和/或时间)特征。

fortunately a great deal of work has gone into devising an excellent iterative procedure for designing codebooks based on having a representative training sequence of vectors.

幸运的是,大量工作已经用于设计基于具有代表性向量训练序列的码本设计的优良迭代过程。

the procedure basically partitions the training vectors into M disjoint sets (where M is the size of the codebook),represents each such set by a single vector,which is generally the controid of the vecotrs in the training set assigned to the mth region,and then iteratively optimizes the partition and the codebook(i.e., the centroids of each partition)

该过程基本上将训练向量分割成M个不相交集合(其中M是码本的大小),通过单个向量表示每个这样的集合,其通常是指派给第m个区域的训练集中的向量的控制者,以及 然后迭代地优化分区和码本(即,每个分区的质心)

associated with VQ is a distortion penalty since we are representing an entire region of the vector space by a single vector.clearly it is advantageous to keep the distortion penalty as small as possible .however, this implies a large size codebook,and that leads to problems in implementing HMMs with a large number of parameters.

与VQ相关的是一种失真惩罚,因为我们用一个矢量来表示矢量空间的整个区域。明显地保持尽可能小的失真惩罚是有利的。但是,这意味着大尺寸的码本,并且这导致 用大量参数实现HMM的问题。

Fig.14 illustrates the tradeoff of quantization distortion versus M (on a log scale).Although the distortion steadily decreases as M increases,it can be seen from Fig.14 that only small decreases in distortion accrue beyond a value of M = 32 .Hence HMMs with codebook sizes of from M = 32 to 256 vectors have been used in speech recognition experiments using HMMs.

图14说明了量化失真与M(在对数尺度上)之间的折衷。虽然失真随着M增加而稳定减小,但从图14可以看出,只有小的失真增加超过M = 32的值。 因此,具有从M = 32到256向量的码本大小的HMM已经用于使用HMM的语音识别实验中。

furthermore we can envision the physical meaning of the model states as distinct sounds(e.g.,phonemes,syllables) of the word being modeled

此外,我们可以将模型状态的物理意义想象为被建模的单词的不同声音(例如,音素,音节)

also,for the continuous models ,we have found that it is preferable to use diagonal covariance matrices with several mixtures,rather than fewer mixtures with full covariance matrices.The reason for this is simple,namely the difficulty in performing reliable reestimation of the offdiagonal components of the covariance matrix from the necessarily limited training data.to illustrate the need for using mixture densities for modeling LPC observation vectors(i.e., eighth-order cepstral vectors with log energy appended as the ninth vector component),fig.16 shows a comparison of marginal distributions Bj against a histogram of the actual observations within a state (as determinded by a maximum likelihood segmentation of all the training observations into states).the observation vectors are ninth order,and the model density uses M = 5 mixtures.The covariance matrices are constrained to be diagonal for each individual mixture.The results of Fig.16 are for the first model state of the word "zero".The need for values of M > 1 is clearly seen in the histogram of the first parameter(the first cepstral component) which is inherently multimodal;similarly the second, fourth,and eight cepstral parameters show the need for more than a single Gaussian component to provide good fits to the empirical data.

同样,对于连续模型,我们发现最好使用具有多个混合的对角线协方差矩阵,而不是使用具有完全协方差矩阵的较少的混合。其原因很简单,即难以执行非对角线分量的可靠​​重新估计为了说明使用混合密度来建模LPC观测向量(即附加对数能量作为第九个向量分量的八阶倒谱向量)的需要,图16显示了边缘分布Bj与状态内实际观测值的直方图(由所有训练观测值的最大似然分割确定为状态确定)。观测向量为九阶,模型密度使用M = 5个混合值。协方差矩阵被限制为每个单独混合物的对角线。图16的结果针对单词“零”的第一模型状态, 。对于M> 1的值的需要在第一参数(第一倒频谱分量)的直方图中清楚地看到,该固有多模态;类似地,第二,第四和第八倒谱参数表明需要多于一个单独的高斯分量为经验数据提供良好的拟合。

another experimentally verified fact about the HMM is that it is important to limit some of the parameter estimates in order to prevent them from becoming too small.for example,for the discrete symbol models,the constraint that bj be greater than or equal to some minimum value e is necessary to insure that even when the kth symbol never occurred in some state j in the training observation set,there is always a finite probability of its occurrence when scoring an unknown observation set .

关于HMM的另一个经实验验证的事实是限制一些参数估计以防止它们变得太小是重要的。例如,对于离散符号模型,约束bj大于或等于某个最小值 值e对于确保即使在训练观察集中的某些状态j中从未发生第k个符号时,在对未知观察集进行评分时总是存在其有发生的概率。

shows a curve of average word error rate versus the parameter e (on a log scale) for a standard word recognition experiment.it can be seen that over a very broad range the average error rate remains at about a constant value;however,when e is set to 0 ,then the error rate increases sharply.

示出了对于标准词语识别实验的平均单词错误率对参数e(在对数尺度上)的曲线。可以看出,在非常宽的范围内,平均错误率保持在大约恒定值;然而,当e 设置为0,则错误率急剧增加。

segmental k-means segmentation into states

将分段k-均值分割成状态

the procedure is a variant on the well-known k-means  iterative procedure for clustering data

该过程是众所周知的用于聚类数据的k-均值迭代过程的变体

in the case where we are using continuous observation densities,a segmental K-means procedure is used to cluster the observation vectors within each state Sj into a set of M clusters(using a Euclidean distortion measure),where each cluster represents one of the M mixtures of the bj density.From the clustering ,an updated set of model parameters is derived as follows:

在我们使用连续观测密度的情况下,使用分段K均值过程将每个状态Sj内的观测向量聚类为一组M个聚类(使用欧几里德失真测量),其中每个聚类代表M个中的一个 bj密度的混合。从聚类中,更新的一组模型参数如下得出:

incorporation of state duration into the hmm 将状态持续时间并入到HMM中

A postprocessor then increments the log-likelihood score of the Viterbi algorithm

后处理器然后递增维特比算法的对数似然分数

the incremental cost of the postprocessor for duration is essentially negligible,and experience has shown that recognition performance is essntially as good as that obtained using the theoretically correct duration model

后处理器持续时间的增量成本基本上可以忽略不计,经验表明,识别性能与使用理论上正确的持续时间模型所获得的性能基本相同

HMM performance on isolated word recognition  HMM在孤立词识别上的表现

100 occurrences of each digit,每个数字出现100次

conventional template-based recognizer using dynamic time warping alignment

传统的基于模板的识别器使用动态时间扭曲对齐

a block diagram of the overall level building connected digit recognizer is given in Fig.21.There are essentially three steps in the recognition process:

图21给出了整体水平建立连接数字识别器的框图。识别过程基本上有三个步骤:

the candidate digit strings are subjected to further validity tests(e.g.,duration),to eliminate unreasonable (unlikely) candidates.The postprocessor chooses the most likely digit string from the remaining (valid) candidate strings.

候选数字串经受进一步的有效性测试(例如持续时间),以消除不合理的(不太可能的)候选者。后处理器从剩余的(有效的)候选字符串中选择最可能的数字串。

each new level begins with the initial best probability at the preceding frame on the preceding level and increments the Viterbi score by matching the word models begining at the new initial frame.This process is repeated through a number of levels equivalent to the maximum expected number of digits in any string.At the end of each level,a best string of size l words with probability P is obtained by backtracking using the backpointer array F to give the words in the string. The overall best string is the maximum of P over all possible levels l.

每个新的水平从前一水平的前一帧的初始最佳概率开始,并通过匹配在新的初始帧开始的单词模型来增加维特比评分。该过程通过等同于最大期望数目的多个水平重复 数字在任何字符串中。在每个级别的末尾,通过使用反向数组F来回溯以获得字符串中的单词来获得具有概率P的大小为l的单词的最佳字符串。 整体最佳字符串是所有可能级别l中的最大字符串。

the key to success in connected word recoginition is to derive word models from representative connected word strings.We have found that although the fromal reestimation procedures developed in this paper work well,they are costly in terms of computation,and equivalently good parameter estimates can be obtained using a segmental K-means procedure of the type discussed in section VI.The only difference in the procedure , from the one discussed earlier,is that the training connected word strings are first segmented into individual digits,via a Viterbi alignment procedure,then each set of digits is segmented into states,and the vectors within each state are clustered into the best  M cluster solution.

关联词识别成功的关键是从代表性连接词串中推导出词模型。我们发现,尽管本文开发的重新预测程序运行良好,但它们在计算上代价很高,并且等价的好的参数估计值可以 使用部分VI中讨论的类型的分段K-means过程获得。与先前讨论的过程中唯一的区别在于,训练连接的字串首先通过维特比比对过程分割成单个数字,然后 每组数字被分割成状态,并且每个状态中的矢量被聚集到最佳M聚类解决方案中。

the segmental k-means reestimation of the HMM parameters is about an order of magnitude faster than the Baum-Welch reestimation procedure,and all our experimentation indicates that the resulting parameter estimates are essentially identical in that the resulting HMMs have essentially the same likelihood values.As such,the segmental K-means procedure was used to give all the results presented later in this section.

HMM参数的分段k均值重估大约比Baum-Welch重估过程快一个数量级,并且我们所有的实验都表明,所得到的参数估计基本上是相同的,因为得到的HMM具有基本上相同的似然值。 因此,使用分段K均值程序来给出本节后面提出的所有结果。

duration modeling for connected digits

连续数字的持续时间建模

this leads to an expanded network with an astronomical number of equivalent states; 这导致了具有天文学数量的等同状态的扩大的网络;

in another attempt to apply HMMs to continuous speech recognition,an ergodic HMM was used in which each state represented an acoustic-phonetic unit.Hence about 40-50 states are required to represent all sounds of English.

在另一个应用HMM进行连续语音识别的尝试中,使用了遍历HMM,其中每个状态代表一个声学语音单元。因此需要大约40-50个状态来表示所有的英语声音。

the model incorporated the variable duration feature in each state to account for the fact that vowel-like sounds have vastly different durational characteristics than consonant-like sounds.

该模型将可变持续时间特征结合到每个状态中,以说明类似元音的声音具有与类似辅音的声音大不相同的持续特性的事实。

in this approach,lexical access was used in conjunction with a standard pronouncing dictionary to determine the best matching word sequence from the output of the sub-word HMM.Again the details of this recognition system are beyond the scope of this paper. The purpose of this brief dicussion is to point out the vast potential of HMMs for characterizing the basic processes of speech producction;hence their applicability to problems in large vocabulary speech recognition.

在这种方法中,词汇访问与标准发音词典结合使用,从子词HMM的输出中确定最佳匹配词序列。再次,这种识别系统的细节超出了本文的范围。 这个简短的讨论的目的是指出HMMs在表征语音产生的基本过程方面的巨大潜力;因此它们适用于大词汇量语音识别中的问题。


猜你喜欢

转载自blog.csdn.net/lhm1019/article/details/79864010
HMM