Paper translation: 2021_A New Real-Time Noise Suppression Algorithm for Far-Field Speech Communication Based on ...

Paper address: A new real-time noise suppression algorithm for far-field voice communication based on recurrent neural network

引用格式:Chen B, Zhou Y, Ma Y, et al. A New Real-Time Noise Suppression Algorithm for Far-Field Speech Communication Based on Recurrent Neural Network[C]//2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE, 2021: 01-05.


Summary

  In teleconferencing scenarios, speech is often affected by background noise, which reduces speech clarity and quality. Therefore, it is very necessary to enhance speech in a noisy environment. This paper investigates a far-field real-time speech enhancement method based on a Gated Recurrent Unit (GRU) modified Recurrent Neural Network (RNN). The ideal amplitude mask (IAM) of the reverberant target speech is used as the training target of the RNN. We also employ feature normalization and the proposed subband normalization technique to reduce feature variance, further facilitating RNN to learn long-term patterns. At the same time, in order to further suppress the residual interharmonic pseudo-stationary noise caused by subband segmentation, we combine RNN with the optimally modified log-spectral amplitude (OMLSA) algorithm. Experimental results show that the method improves speech quality, reduces distortion, and has low real-time computational complexity.

Keywords: speech enhancement; recurrent neural network; ideal amplitude mask; optimally modified log-spectral amplitude (OMLSA)

1 quote

   In speech technology and its practical applications, speech is often distorted by background noise and reverberation, resulting in a decline in speech communication experience and a poor speech/speaker automatic recognition rate [1]. Speech enhancement has become a key means of combating noise and reverberation. In recent years, deep learning-based speech enhancement methods [2] have received extensive attention and success, showing their advantages compared with traditional signal processing methods. A major benefit is the relative ease of integrating complex learning objectives, which will facilitate the development of enhanced speech towards better quality and intelligibility [3]. However, due to the large scale and high computational complexity of most neural networks, real-time noise suppression and de-everberation using neural networks is still a challenging task.

  In indoor voice communication scenarios, the room audio will produce a certain amount of reverberation. In an environment with moderate reverberation, there will be no drop in speech. Therefore, noise removal alone is crucial to improving speech quality and intelligibility. On the other hand, a certain degree of reverberation helps to improve hearing comfort and clarity [4]. Therefore, in this paper, we ignore the problem of de-everberation and just focus on developing an improved method with low complexity and real-time processing capability to remove noise in the far-field environment.

  A major challenge in optimizing speech enhancement algorithms is to suppress noise in far-field environments while maintaining the perceived quality of speech as much as possible. Classical speech enhancement methods include spectral subtraction, Wiener filtering, methods based on statistical models [5], and so on. However, most of these methodological strategies rely on the estimation of the noise spectrum and the assumption of prior information. While they work well in most finite noise environments, they do not work as expected after dealing with non-stationary and diffuse noise. Therefore, researchers are more inclined to use deep learning techniques to study more effective solutions. In recent years, Valin proposed a low computational complexity method based on recurrent neural network (RNN), which can combine deep learning and signal processing techniques for real-time processing of audio sampled at 48kHz [6]. However, the speech intelligibility and speech quality degradation obtained by this method hinder the direct application of this method.

  Inspired by [6], this paper proposes an improved low-complexity RNN approach for real-time, high-sampling-rate (48KHz) speech enhancement systems in noisy and moderately reverberant environments . First, features and feature/subband normalization techniques are analyzed . Then, a parallel processing method using classical signal processing algorithm and RNN algorithm to calculate the gain respectively is proposed. The goal is to further remove residual noise between speech harmonics, since the fine structure of the spectrum cannot be modeled in the subband partitioning. The results show that this method can avoid large computational complexity and further improve speech quality and intelligibility.

  The rest of the paper is organized as follows: Section II presents the RNN-based algorithm proposed in this paper. Section III presents the experimental setup and results along with corresponding evaluations. Section IV summarizes the conclusions of this paper.

2 System architecture and method

A signal model

   Let $y(n)$, $x(n)$ and $u(n)$ denote time domain noise, reverberation signal and noise signal respectively

$$ Formula 1: y(n)=x(n)+u(n)$$

Analysis and synthesis use the same window as follows

$$公式2:w(n)=sin[\frac{\pi}{2}sin^2(\frac{\pi n}{N})]$$

where N is the length of the window. After windowing, fast Fourier transform (FFT) is performed, and finally the following frequency domain expression is obtained as

$$ 公式3:Y(l,k)=X(l,k)+U(l,k)$$

Where $Y(l,k)$, X(l,k) and $U(l,k)$represent the FFT of the above time-domain signal model in the $l$th time frame and the $k$th frequency bin, respectively .

  The block diagram of the system is shown in Figure 1. Most of the noise suppression is achieved by the RNN computing the gain on the low-resolution spectrum, rather than processing it at frequency bins. Therefore, there will inevitably be noise residues between speech harmonics, such as pseudo-stationary noise . Combined with the gains calculated by the improved OMLSA algorithm, a finer suppression step is used to further attenuate the noise between the speech harmonics.

Figure 1 The framework diagram of the proposed system

B-band structure and feature representation

  在训练神经网络的过程中,最重要的是寻找合适的特征,这将极大地影响神经网络[7]的训练效率和最终的演绎性能。然而,在[2]中引入的方法中,使用了许多神经网络来估计频率bin的mask或幅度,这通常需要数百万权值来处理在48kHz采样的语音。这种类型的神经网络很难部署在需要低功耗和实时处理能力的系统中。考虑到[8]中描述的子带信号处理方法,我们选择在Mel-scale上对频谱进行划分,Mel子带的划分在低频密集,在高频稀疏。在mel-scale上,频率bin被划分为48维频带。设$w_m(k)$为频带$m$在bin $k$处的振幅,定义频带$m$的传递函数为

$$公式4:w_{m}(k)=\left\{\begin{array}{cc}
0, & k<f(m-1) \\
\frac{k-f(m-1)}{f(m)-f(m-1)}, & f(m-1) \leq k \leq f(m) \\
\frac{f(m+1)-k}{f(m+1)-f(m)}, & f(m)<k \leq f(m+1) \\
0, & k>f(m+1)
\end{array}\right.$$

其中

$$公式5:f(m)=\left(\frac{N}{f_{s}}\right) F_{\text {mel }}^{-1}\left(F_{\text {mel }}\left(f_{1}\right)+m \frac{F_{\text {mel }}\left(f_{\mathrm{h}}\right)-F_{\text {mel }}\left(f_{1}\right)}{M+1}\right),$$

$$公式6:F_{mel}(f)=1125 log(1+\frac{f}{700})$$

$M=48$是频带数,音频处理在48kHz采样率,$f_h=24000$,$f_1=0$,$f_s=48000$,$F^{-1}_{mel}(f)$是$F_{mel}(f)$的逆函数。然后,将481维的特征压缩为48维。

  在此,我们提出一种子带归一化( sub-band normalization,SN)技术来归一化子带振幅系数。其目的是减小子带划分不同所引起的子带能量差异。因此,频带$m$的子带归一化方法描述为

$$公式7:w'_m(k)\frac{w_m(k)}{\sum_kw_m(k)},\ \sum_kw'_m(k)=1$$

  对于变换后的信号$Y(k)$,频带$m$的对数能量被计算为:

$$公式8:E_{Y}(m)=\log \left(\max \left(\sum_{k} w_{m}^{\prime}(k)|Y(k)|^{2}, \alpha\right)\right)$$

其中$\alpha=10^{-1}$和子带能量最低为-60dB。因此,得到了48维mel-scale子带能量。此外,我们还介绍了频谱平坦度(spectral flatness,SF),以帮助语音检测和辅助VAD估计。第$l$帧SF的值为:

$$公式9:S F(l)=10 \times \log _{10}\left(\frac{\exp \left(\frac{1}{K} \sum_{k} \ln (Y(l, k))\right)}{\frac{1}{K} \sum_{k} Y(l, k)}\right)$$

如下平滑光谱平坦度产生:

$$公式10:S F_{\text {smooth }}(l)=\gamma S F_{\text {smooth }}(l-1)+(1-\gamma) S F(l)$$

其中K = 400为总共所需的频率bin,$\beta=10^{-3}$,$\gamma $为平滑参数。总共使用了49个输入特性

C  特征归一化

  在提取Mel scale的频带能量特征后,发现高频子带和低频子带之间的能量差异相对较大,统一归一化这可能导致忽略具有较小能量尺度的特征。因此,我们引入了在线归一化技术,并将其与上述特性结合起来。该方法首先消除了特征尺度差异的影响,然后进一步促进了RNN学习长期模式。在上述归一化方法中,我们将利用以下衰减指数平滑运行均值和方差。

$$公式11:\mu_{E(Y)}(l, m)=\lambda \mu_{E(Y)}(l-1, m)+(1-\lambda) E_{Y}(l, m)$$

$$公式12:\sigma_{E(M)}^{2}(l, m)=\lambda \sigma_{E(Y)}^{2}(l-1, m)+(1-\lambda) E_{Y}^{2}(l, m)$$

$$公式13:E_{Y}^{\prime}(l, m)=\frac{E_{Y}(l, m)-\mu_{E(Y)}(l, m)}{\sqrt{\sigma_{E(Y)}^{2}(l, m)-\mu_{E(Y)}^{2}(l, m)}}$$

其中$\lambda=exp(-\bigtriangleup \frac{t}{\tau })$,$\bigtriangleup t$是以秒为单位的帧偏移,$\tau =1s$是可以控制自适应更新速度的时间常数。我们每2000帧用零重新初始化当前帧中的均值和方差,然后对后续帧执行相应的计算。该方法可以进一步提高网络的鲁棒性。

D  学习机器和训练设置

  RNN在实时语音增强(SE)任务[7]中具有优异的性能。虽然LSTM和GRU都是根据门控单元更新序列信息的主导技术,避免了指数权重衰减或爆炸问题,但由于GRU的计算效率和优异的性能[9]优于LSTM。因此,在提出的方法中,我们首先叠加三个GRU层,然后是一个全连接层,以估计48维子带增益和VAD同时。定义理想比值掩码(IRM)的训练标签为

$$公式14:g_{birm}(m)=max(\sqrt{\frac{E_x(m)}{E_Y(m)}},10^{-3})$$

式中$E_X(m)$、$E_Y(m)$分别为纯净混响语音、带噪混响语音。此外,当某一频段没有语音和噪声时,ground True增益被明确标记为未定义,因此在训练过程中忽略相应帧的子带增益,以避免恶化网络的训练整体效果。此外,我们考虑利用一种结合均方误差(MSE)和尺度不变信噪比(SI-SDR)[10]的损失函数,可以提供更好的性能。它可以写成

$$公式15:L=\varsigma\left(\sqrt{\widehat{g}_{b i m n}(m)}-\sqrt{g_{b i m}(m)}\right)^2+(1-\varsigma)\left(-10 \log _{10}(\frac{||\frac{\hat{x}^{T} x}{||x||^{2}} x||^{2}}{||\frac{\hat{x}^{T} x}{||x||^{2}} x-\hat{x}||^{2}})\right)$$

其中x和x分别表示时域语音波的真实情况和预测结果。在一系列的实验中,我们发现$\zeta = 0.4$可以很好地在噪声抑制和语音失真之间进行权衡。

E  RNN与改进的OMLSA相结合的策略

  本文提出的基于RNN架构的输出信号基于Mel尺度准则,因为每一帧的频谱被划分为48维子带,这会导致频谱分辨率的降低。它的主要缺点是不能在频谱中模拟更精细的细节。同时还发现语音谐波之间存在一定的残余噪声。因此,我们考虑引入一种优化的OMLSA算法,并将其与RNN相结合,并行处理降噪和残差噪声抑制。

  实际应用中,更准确的语音检测有利于噪声的更新,提高了算法在语音信号处理中的鲁棒性。因此,我们考虑利用RNN逐帧预测的VAD值来有效替代传统的OMLSA VAD决策方法,可以促进语音活动检测,辅助噪声更新,降低计算复杂度。由于篇幅限制,本文省略了改进的OMLSA中包括的其他功能模块的细节。用OMLSA计算的400个频率点的增益标记为$g_{omlsa}$,而用RNN估计的增益为48维,我们需要利用参数插值矩阵(constant interpolate matrix,IM)将带IRM $g_{birm}$从48维变换到400维,并标记插值增益为$g_{irm}$。因此,最终增益$g$由

$$公式16:g=min(g_{omlsa},g_{min})$$

可以发现,在两种不同方法计算的增益之间取最小值的操作,不仅可以保留网络的优势,还可以进一步消除子带划分带来的语音谐波之间的残留噪声。

3  实验和评估

A  实验设置与性能评估

   基于RNN的训练和评估需要纯净混响语音和带噪混响语音。通过对混响语音进行加噪处理,人工构建训练数据。此外,我们通过将纯净语音与房间脉冲响应(RIR)卷积得到纯净混响语音。对于纯净语音数据和噪声,我们使用了公开可用的DNS Challenge 3语音数据库(中文、德语和西班牙语)、英国-爱尔兰英语方言语音数据库[11]和麦吉尔TSP语音数据库[6]。利用各种噪声源,包括电脑风扇、汽车、办公室、街道、噪音和火车的噪音。采用RIR生成器,利用成像方法[12][13]生成RIR,模拟不同房间大小,随机选择接收机和说话人的位置。此外,接收机与扬声器之间的距离保持在1m - 5m范围内,随机产生。此外,混响时间(T60)也在[0.1,0.6]秒范围内随机选取。

  实验使用CUDNN GRU实现,在Keras后端Nvidia GPU上进行。将所有训练数据分成长度为2000的连续帧(RNNoise也是分为2000连续帧),依次输入到网络中进行训练。对网络进行60个epoch的训练,选择的batch size为64。同时,利用Adam算法对网络进行优化,使损失函数最小。所有信号采样在48kHz。帧窗长为20毫秒,帧偏移为10毫秒。

  将该方法与SpeexDSP库中的RNNoise和基于MSE的噪声抑制方法进行了比较。我们将采用STOI和PESQ 来评估通过不同方法增强的语音的可读性和质量。正常情况下,分数越高,增强语音质量越好。

B  评价结果与讨论

  在实验中,利用不同的纯净混响语音信号和5种噪声(汽车、办公室、街道、babble和静止噪声),产生了总共8小时的不同信噪比的测试集语料。此外,在训练阶段,测试数据的噪声和RIRs都没有被利用。计算了五种噪声类型下各方法判据的改进情况。表1和表2分别表示了在相同信噪比和相同混响时间条件下,不同噪声类型测试集语料的平均STOI和PESQ得分。从表1和表2的评分结果来看,虽然实验中使用的三种方法都表现出相同的语音增强趋势,但本文方法明显优于另外两种基线。此外,高信噪比下的性能优于低信噪比下的性能。

   图2给出了对真实房间场景记录语料进行噪声抑制的处理结果示例,红色矩形表示本文方法对非语音段噪声的抑制性能优于其他两种基线算法。在语音部分,该方法可以显著降低语音失真,进一步消除语音谐波间的残留噪声。此外,所提出的方法也存在一些问题。一是高频存在一定的残余噪声,二是没有利用差分特征和基音周期,导致算法在抑制语音段中存在的瞬态噪声时性能下降。我们可以考虑引入上述特性来进一步训练网络。总体结果表明,该方法可以提高不同远场场景下的语音质量和可读性。

图2  将真实房间场景中记录的语料库的语谱图和基于mse、RNNoise和所提出方法的处理结果分别进行比较。

   图2给出了对真实房间场景记录语料进行噪声抑制的处理结果示例,红色矩形表示本文方法对非语音段噪声的抑制性能优于其他两种基线算法。在语音部分,该方法可以显著降低语音失真,进一步消除语音谐波间的残留噪声。此外,所提出的方法也存在一些问题。一是高频存在一定的残余噪声,二是没有利用差分特征和基音周期,导致算法在抑制语音段中存在的瞬态噪声时性能下降。我们可以考虑引入上述特性来进一步训练网络。总体结果表明,该方法可以提高不同远场场景下的语音质量和可读性。

C  复杂度分析

  计算复杂度是算法实现实时性的一个重要问题。可执行文件总共保存了92109个权值(大约70KB),这些权值由神经网络的217个神经元学习。每一帧数据通过带权值的乘法和加法运算实现实时语音增强。对于每一帧的处理,算法的复杂度总共需要35M flops,分别用于FFT/IFFT变换、特征提取和神经网络计算。此外,在Intel (R) Core (TM) i5-9300H CPU @ 2.40GHz的笔记本电脑上,该方法处理每个语音帧的平均时间为0.5ms,而RNNoise的处理时间约为0.8ms。相比之下,每帧的处理速度提高了60%。

4  结论

  本文提出了一种在远场环境下将RNN和改进的OMLSA算法相结合的实时噪声抑制方法。该方法仅针对正常混响场景条件下的噪声去除。在不同噪声和T60条件下对系统进行了评估,实验结果表明,该方法不仅提高了噪声抑制能力,而且降低了语音失真。此外,由此产生的低复杂性和更快的处理速度使该方法适合于嵌入式设备和视频会议系统。

参考文献

[1] Y . Zhao, D. L. Wang, I. Merks, and T. Zhang, DNN-based enhancement of noisy and reverberant speech, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.6525 6529.

[2] Wang, DeLiang, and Jitong Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Transactions on Audio, Speech, and Language Processing 26.10, 2018, pp. 1702-1726.

[3] Y. Xia, S. Braun, C. K. A. Reddy, H. Dubey, R. Cutler, and I. Tashev, Weighted speech distortion losses for neural-network-based real-time speech enhancement, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 871 875.

[4] J. S. Bradley, H. Sato, and M. Picard, On the importance of early reflections for speech in rooms, Journal of the Acoustical Society of America, vol. 113, pp. 3233 3244, 2003.

[5] Y. Hu and P. Loizou, Evaluation of objective measures for speech enhancement, in Proc. Interspeech, 2006, pp. 1447 1450.

[6] J.-M. V alin, A hybrid DSP/deep learning approach to real-time fullband speech enhancement, in 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), 2018, pp. 1 5.

[7] J. Chen, Y. Wang, and D. Wang, A feature study for classificationbased speech separation at very low signal-to-noise ratio, in Proc. ICASSP, 2014, pp. 7059 7063.

[8] S. Davis aud P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," Acoustics, Speech and Signal Processing, iEEE Transactions on, vol. 28, no. 4, pp. 357-366, 1980.

[9] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, A Scalable Noisy Speech Dataset and Online Subjective Test Framework, in ISCA INTERSPEECH 2019, 2019, pp.1816 1820.

[10] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey, SDR half-baked or well done? , in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019, pp. 626 630.

[11] Isin Demirsahin, Oddur Kjartansson, Alexander Gutkin, and Clara Rivera, Open-source Multi-speaker Corpora of the English Accents in the British Isles, in Proceedings of The 12th Language Resources and Evaluation Conference (LREC), Marseille, France, May 2020, pp. 6532 6541, European Language Resources Association (ELRA).

[12] J. B. Allen and D. A. Berkley, Image method for efficiently simulating small-room acoustics, Journal of the Acoustical Society of America, vol. 65, pp. 943 950, 1979.

[13] Habets, Emanuel AP. "Room impulse response generator." Technische Universiteit Eindhoven, Tech. Rep 2.2.4 (2006): 1.

Guess you like

Origin blog.csdn.net/qq_34218078/article/details/126549066