Speech noise reduction paper "A Hybrid Approach for Speech Enhancement ..." study

Recently, I carefully read a paper on speech noise reduction (A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier). It is a method of noise reduction using a mixed model, that is, using both a generative model (MoG Gaussian model) and a discriminative model (neural network NN model). This article sorts out the principles based on my own understanding.

 

The paper is based on the MixMAX model proposed by "Speech Enhancement Using a Mixture-Maximum Model". Assuming that the noise is additive noise, the clean speech is x(t) and the noise is y(t), then the noisy speech z(t) in the time domain can be expressed as z(t) = x(t) + y(t) . Perform short-time Fourier transform (STFT) on z(t) to get Z(k), and then take the log-spectral (log-spectral) to get Zk (k represents the k-th dimension of the log-spectrum, that is, the log-spectrum The kth frequency band (frequency bin). If there are L samples for STFT, the dimension of the logarithmic spectrum is L/2 + 1). Correspondingly, Xk and Yk can be obtained. The MixMAX model means that the value Zk on each frequency band of the noise-added speech is the larger value of the corresponding Xk and Yk, that is, = MAX(Xk, Yk).

 

Speech x is composed of phonemes, and a phoneme can be represented by a Gaussian. Assuming that there are m phonemes, the density function f(x) of clean speech can be expressed as the following formula:

 

fi(x) represents the density function of the i-th phoneme. Since x is represented by a multidimensional logarithmic spectrum, and the vectors of each dimension are independent of each other, fi(x) can be expressed as the product of the density function fi,k(xk) of each dimension vector. The density function of each dimension is expressed as follows  

μi,k represents the mean value on this dimension, and δi,k represents the variance on this dimension. ci represents the weight of this phoneme, and the weighted sum of the weights must be 1.

 

The noise y is represented by only one Gaussian. Like speech, y is also represented by a multidimensional logarithmic spectrum, and the density function of y can be expressed as follows:

                 

Similarly gk(yk) is expressed as follows:

 

For the density function on each dimension of y, its probability distribution function Gk(y) is:

 

Where erf() is the error function, expressed as follows:

     

In the same way, the probability distribution function on each dimension of each phoneme in the clean speech can be obtained, as follows:

 

For the noisy speech Z, when the speech phoneme is given (i.e., when i is given), the distribution function Hi,k(z) of the k-th dimension component Zk of its logarithmic spectrum can be obtained by the following formula:

     

The above formula is to find the conditional probability when I = i. Since X and Y are independent of each other, it becomes the product of distribution functions on the k-th dimension vector of X and Y. Deriving the distribution function Hi,k(z) of Zk, the density function hi,k(z) is obtained, which is expressed as follows:

        

So the density function h(z) of z is obtained by the following formula:

      

 The noisy speech Z is known, and our goal is to estimate the clean speech X from the noisy speech, that is, find the conditional expectation of X under the condition that Z is known. Based on MMSE estimation, the conditional expectation/estimation of X is expressed as follows:

       

The conditional expectation of X in the above formula is transformed into the weighted sum of the conditional expectations of each phoneme. The conditional probability q(i | Z = z) can be obtained according to the total probability formula, as follows:

             

The conditional expectation for each phoneme is expressed as follows: . The conditional expectation for each dimension of the log spectrum of each phoneme is expressed as follows:

      

in:

     

Definition , the estimation of each dimension of the logarithmic spectrum of x can be deduced as follows:

     

An alternative based on spectral subtraction can be used , where β represents the degree of denoising. ρk can be regarded as the probability of clean speech. so

     

Offset the positive and negative terms to get:

    

The above formula is the mathematical expression for finding the per-dimensional vector of the logarithmic spectrum of the denoised speech . zk can be obtained from the noisy speech, β requires tuning, and the estimation of xk can be obtained after knowing ρk. The inverse transformation is performed on the obtained vector of each dimension, and the value in the time domain after denoising can be obtained.

 

It has been given above , where p(I = i | Z = z) represents the possible probability of each phoneme when Z is known, or a frame of noisy speech is the possible probability of each phoneme, represented by pi . pi can be found by the total probability formula, ie . But for each language, the total number of phonemes is known (for example, there are 39 phonemes in English), so finding the probability that each frame is a certain phoneme is a typical classification problem. Neural network (NN) is superior to traditional methods in dealing with classification problems, so NN can be used to train a model, and this model is used to calculate the probability of each frame belonging to each phoneme during processing, that is, to calculate pi, and then multiply by ρi and k Accumulate (ρi, k are calculated by the method based on MOG model), and then ρk can be obtained ( ). With ρk, the estimation of xk can be obtained. It can be seen that the role of the NN model is to replace the traditional method of calculating pi to make the calculation of pi more accurate.

 

The Gaussian model of clean speech is not trained with the conventional EM algorithm, but is obtained based on a corpus that has been phonetically annotated. The author of the paper uses the TIMIT library. Each frame corresponds to a phoneme one by one, and all frames belonging to a phoneme are classified into one category, and the values ​​of each vector of the logarithmic spectrum are calculated, and finally the mean and variance are calculated to obtain the density function expression of this vector, the mean and variance. The calculation is as follows:

    

where Ni represents the number of frames belonging to a certain phoneme. The density expression of all vectors of a phoneme is multiplied to obtain the density function expression of this phoneme. Then the weight is obtained by the proportion of the number of frames belonging to this phoneme to all the number of frames

( ), so that the Gaussian model of clean speech is established.

 

For non-stationary noise, the noise parameters (μY,k and δY,k) are preferably adaptive. The initial value of the noise parameter can be obtained from the first 250 milliseconds of each sentence (based on the assumption that the first 250 milliseconds are noise). The calculation method is the same as the Gaussian model of clean speech above, and the mathematical expression is as follows:

   

The update of the noise parameters is based on the following formula:

  

Where α is the smoothing coefficient, 0 < α < 1, and tuning is also required. The noise parameters (μY,k and δY,k) are updated, Gk(y) and gk(yk) are updated, hi,k(z) is also updated, and thus ρi,k are also updated.

 

In summary, the noise reduction algorithm based on the generative-discriminant hybrid model is as follows:

1) Training stage

enter:

According to the corpus of marked phonemes, log spectrum vectors z1,...zn (for calculating MOG), MFCC vectors v1,...,vn (for NN training) and phoneme labels i1,...,in corresponding to each frame are obtained .

 

MoG model training:

Calculate the MOG of clean speech according to the log spectrum vector z1,...,zn

 

NN model training:

Train a phoneme-based multi-classification model according to (v1,i1),…(vn,…,in)

 

2) Reasoning stage

enter:

Log spectrum vector and MFCC vector of noisy speech

 

output:

Noise Canceled Speech

 

calculation steps:

 

Guess you like

Origin blog.csdn.net/david_tym/article/details/116904596