Read the paper notes sixty-four: Architectures for deep neural network based acoustic models defined over windowed speech waveforms (INTERSPEECH 2015)

 

Papers site: https://pdfs.semanticscholar.org/eeb7/c037e6685923c76cafc0a14c5e4b00bcf475.pdf

Summary

      In this paper, the use of neural networks and depth retrograde speech automatic speech recognition (ASR) model, which is directly input speech wave-shaped input window (WSW). This paper demonstrates that the network needs to be automated in Mel spectrum has similar characteristics (Mel spectrum is what? Reference, https://blog.csdn.net/qq_28006327/article/details/59129110 ), this paper DNN mining the structural features WSW. First, an improved structure for capturing dynamic bottlenect DNN time domain spectral information represented difficult conditions. Based on the redundant information within DNNN the WSW also be taken into account. WSW speech models speech model based on feature Mel spectral correlation characteristics (MFSC) are compared in the Wall Street corpus data set. The results show that in the speech model based on WSJ Corpus get WSW features than the model based on WER MFSC features increased 3.0%. However, when combined MFSC characteristics, as compared to a single best model based DNNN MFSC 4.1% drop characteristics.

Keywords: speech recognition, the depth of the neural network, characterized in Bottleneck, the speech waveform

Introduction

      Some studies indicate that can be achieved by using the depth automatic speech recognition neural network, the input of the speech wave data window type (WSW), most of the research, a multilayer structure of the network, at the same time, an evaluation indicating that task on different areas, ASR word error rates (WERs, reference links: https://zhuanlan.zhihu.com/p/59252804 ) which may be a reasonable approximation of the more commonly characterized based MFSC WSW feature. However, based on this study, still can not be compared with the method based on MFSC features voice-based model WSW. WERs generally higher compared to the model-based feature WSW MFSC 15% to 20%. In this paper, the main response to these problems, the establishment of an effective network structure learning algorithm for automatic analysis based on the characteristics of the depth of neural networks, voice over performance-based model approach MFSC.

      In the field of automatic speech recognition, speech-based analysis described herein, wherein the depth of three parts learning method. The first is an input to the analysis model DNN characterized WSW. Clear model for this type of static or dynamic spectrum represents the spectrum of information indicating the better, as well as its robustness in different signal to noise sources change. Experiments show that the depth of the network may be MFSC Mel spectrum in a representation similar features from WSW Secondary features. These features may be fully connected network or the depth of a convolutional network structure implemented. The second part is for the network weights for analysis. Show, similar to the DNN fully connected network based on Wall Street before Corpus layers of representation and spectral characteristics Mel, but the size of the corpus used less.

      The second part of the research into the replacement of the network structure to interpret the voice information is not automatically learn to get from the window-type voice wave. This paper focuses on the ability of DNN WSW based on dynamic modeling of super-band spectrum. Both spectral transform human speech recognition and speech recognition model transformation spectrum has a very important position. Wherein the rate of change in information contained in the semantic obtained from short segments. The semantic model segment is described by performing Fourier transform on the spectral characteristics MFSC.

      MFSC semantic model, the dynamic spectral feature vector by using a plurality of cascaded spectral vector form or the form of the frequency spectrum difference to add the static coefficient of correlation is captured. Such spectral characteristic representation may capture the motion of 150 milliseconds to 250 milliseconds. Such features of the window type even if the wave is increased WSW time interval based on the network DNN difficult to learn from the sound waves. Bottleneck network structure layer may be incorporated in section2 to capture feature, the frame can be spliced ​​to the output of 250ms.

      The third part is based on this paper, even DNN speech model can achieve the best performance of the system based on results MFSC features but needs some additional computational complexity and redundancy. This assumption is right will be a simpler Mel filterbank good full or partially connected by a link depth training network weights were replaced. For the full connection DNN. Each network over 1,000,000 operations. In the fourth chapter considers the redundancy filterbank automatic training.

      The second chapter describes the WSW analysis based on the weight of the hidden layer DNN training weight. Section III describes a continuous bottleneck based WSW is characterized DNN provides an improved dynamic model of the spectrum.

Based on the analysis of the DNN WSW

      This section examines the intermediate layer shown in the DNN WSWs training corpus obtained from Wall Street. First, the network structure of DNN, then the next experiment used corpus, the final result is weight training analysis of the first layer.

      Training and semantic model structure: DNN based WSW input frame is a part of 150ms sampled speech waveform. As used herein, broadband 16KHz sampling section 2400 samples of speech. For each analysis frame, the position should be entered in advance 10ms or 160 samples. For a fully connected DNN contains three hidden layers, each composed of 1024 nodes. After transformed by the nonlinear node ReLU. Softmax output layer uses the layer 2019 nodes, each representing this context dependent Hidden Markov Model (context dependent CD).

      And model training corpus: corpus Wall Street speech model for all training and assessment in this article. It contains a record reading newspaper discourse in a higher signal to noise ratio environment. WSJ0 / WSJ1 SI-284 training for all voice HMM and DNN model. 80 hours long speech and 37,961 words included, from 284 speakers. Test-Dev93 contains 515 words as validation set, Test-Eval92 contains 330 words for testing, using the test conditions corresponding to open vocabulary of 20,000 words of the language model used for all evaluations. Automatic speech recognition decoder based on Gaussian mixture HMMs continuous intensive (HMM-GMMs) for aligning the voice box and CD HMM state MFCCs (frequency cepstral coefficients), HMM-GMM context state and automatically clustering model trainer 2019 to CD status achieved by KalDI tool. These models are trained using MFCC features by LDA and the maximum likelihood linear transformation (MLLT) conversion. The training process at the same time used the training in this adaptation training. ASR decoder 2019 will assign a label to a CD state speech frame to be trained, as DNN supervision of cross-training.

      To assess the DNN WSW standards: To assess the performance of the basic mixture HMM-DNN ASR system, the network structure described above using a plurality of training feature sets, with MFCC feature comprising WSW and MFSC features. HMM-GMM and HMM-DNN compared below in Table 1 the system shown in FIG. Can be found by comparing the first two lines, HMM-DNN compared to HMM-GMM there was a substantial reduction in WER. Compared with MFCC feature, WER MFSC 3% relative reduction characteristics. Based on the bottom WSW feature MFCC feature a 15.5% increase in WER compared.

      In the speech model may be defined on the original speech samples No special design features based on a corpus Street WER can drop below 9%. You may be obtained through line wipe some inspiration from the analysis parameters to the network information parameter estimation captured. In the weight matrix W, using a Mel spectrum shape representation. In the first layer, as shown in DNN based WSW shown, this indicates that the associated weight matrix values ​​and calculation of the amplitude spectrum can be seen the number of which approximates the response of the bandpass filter according to the value on these lines.

      下图1展示了权重矩阵W1024行包含信息的总结。图中的第i行表示权重矩阵W中第i行的平滑对数幅度谱。平滑对数幅度谱通过对w进行padding,并对权重计算其快速傅里叶变换

 

,然后使用一个高斯核进行平滑处理。权重矩阵W的行数根据平滑后谱中每行的峰值计算得到的频率进行记录。最终对于记录的行数根据进行描点。由图中可以看出DNN已经学习到了类似于梅尔频谱的特征表示。

 

 

stacked bottoleneck architecture

      本节描述将bottleneck DNN应用到基于WSW的DNN语音模型中。该改进模型可以看作是一种机制,用于连续的将低维的bottleneck frames进行拼接,从而可以对帧间谱动态进行建模。ASR中许多基于BN DNN结构被提出来。其通用的结构形式如下图所示,BN-DNN通过级联一些高维度的非线性隐藏层及低纬度的隐藏层构建。这种设计的最初动因是对非线性空间进行降维处理。

     如上图所示,输入维度为2400(包含150ms)及一个40维度的bottleneck。一些BN-DNN通过拼接帧附近的bottleneck层的输出得到的局部谱信息进行增强。当将MFSC特征应用到BN-DNN时,BN只减少了一点WER。bottleneck谱信息的结合对于基于WSW的DNN是一个研究点,这是因为无法通过简单的方法在特征分层次对谱信息进行利用。因此,期望基于WSW 特征的BN-DNN结构可以拼接bottoleneck输出进而对ASR WER产生一个较大的影响。

     BN-DNN的结构设计如下:2400个输入节点对应着2400采样WSW,两个1024节点的隐藏层。及一个40个维度的bottleneck层。每层后面跟着一个ReLU。bottleneck层在具有1800个一二阶不同相关性节点向量的15帧进行拼接,表示150ms内频谱的动态变化。在解码过程中,级联的bottleneck输出送到三个1024维度的隐藏层的网络及2019节点的softmax输出层,DNN中softmax的输出对应HMM中上下文的相关状态(CD)。

     上图3下半部分显示的BNN-DNN中的DNN层分离出来进行训练,图的上半部分为HMM/DNN。BN-DNN基于CE损失标准进行训练,训练后,将bottleneck层移除,同时将BN层的激活值进行保留作为BN-DNN的输出。

     基于WSW及MFSC特征的BN-DNN WER性能结果如下图所示。将1,3行进行比较,基于MFSC特征对模型增加stacked bottleneck WER并没有发生很大的改变。这是由于1800维的MFSC特征作为BN网络的输入已经被拼接的15帧MFSC 帧图像格式化了。将第2行与第4行进行比较发现。BN-DNN将WSW特征的WER降低了14.2%,已经同最好的基于MFSC的WER很接近了。

     对WSW/MFSC特征结合使用,对于10ms,窗型输入,40维的WSW的BN-DNN与40维的MFSC向量进行拼接。80维的向量与+/-7帧向量进行拼接作为输入传到bottleneck中。上图最后一行显示了结果,相比MFSC特征WER,减少了%4。

基于WSW的DNN训练结构初始化

      通过前面对网络第一层度权重矩阵的分析,训练一个基于WSW的全连接DNN可以得到一个具有识别结构的网络。其包含的结构可能对于分类性能很重很。但很难从轶事中观察中进行表征。一种方法是增加一个类似于filter-bank的结构,选择一个与梅尔filter-bank特征分析类似的参数化。本文的工作重点是研究是否可以通过训练一个全连接网络来发现这个结构。确定研究网络的哪个部分来通过连续的迭代来提升网络的性能及效率。根据图1的第一层权重矩阵相邻行显示了大部分情况下中心频率相似,但相位及增益不同的filter的响应。通过观察,是否可以将该层进行隔离,从而可以使DNN更有效训练的结构。

      设计了两步过程,根据少量的"basis rows"的延迟及缩放变换来近似权重矩阵第一层的行。在过程的第一步,得到与带通滤波器相关的矩阵行数,该带通滤波器的中心频率接近于梅尔滤波器的中心频率。

其可以作为"basisi rows",用进行表示。在第二步,将最接近basis rows hi中心频率的滤波器的权重矩阵被看作是basis rows的缩放或者延迟版本。即对于权重矩阵第wj行,

其近似,其中,a_i,j 及d_i,j分别代表wj相对于hi(其傅里叶变换于带通滤波器的中心频率最相似)缩放尺度及延迟数值 。

      上述形成的具有行形式的第一层权重矩阵用于初始化训练一个新的基于WSW的DNN。下图显示了基于前文初始化得到的每个训练使其的验证集的帧精度(FAC)。与随机初始化的DNN参数得到的FAC进行比较,基于第一层权重矩阵的结构初始化,FAC的精度始终更高。另外,结构初始化使WER进行小幅度的下降。7.64%下降至7.51.同时,还使basis row与第一层权重矩阵的剩余行之间的平均近似误差减小。

 

Guess you like

Origin www.cnblogs.com/fourmi/p/10955012.html