RNNoise: Learning Noise Suppression

Table of contents

1. Introduction to RNNoise

2. Noise suppression

3. Deep Learning and Recurrent Neural Networks

4. A Hybrid Approach

6. About the dataset

7. From python to C language

8. Additional resources


RNNoise: Learning Noise Suppression

Original address: RNNoise: Learning Noise Suppression (jmvalin.ca)

1. Introduction to RNNoise

This example shows how to use deep learning for noise suppression. The main purpose is to combine traditional signal processing methods with machine learning to obtain a speech noise reduction algorithm with minimal memory consumption and faster operation speed, and does not require GPU participation. , it can even run on a Raspberry Pi, the effect is better than traditional noise suppression systems, and the algorithm parameters are easier to adjust.

2. Noise suppression

Noise suppression is a perpetual topic in speech signal processing, as the name suggests, is to take a noisy signal and remove as much noise as possible while causing minimal distortion to the speech segment of interest.

This is a block diagram of a traditional noise suppression algorithm.

(1) The Voice Activity Detection (VAD) module detects when the signal contains speech and when it is just noise

(2) The noise spectrum estimation module is used to estimate the spectral characteristics of the noise (power spectrum of each frequency)

(3) Direct subtraction, i.e. "subtracting" the noise from the input audio

 From the block diagram above, noise suppression appears to be quite simple: only three simple tasks need to be done. We can easily design the code of the noise suppression algorithm, the key is whether our algorithm can work well at any time, not affected by various types of noise, and in any occasion. This required very careful tuning and study of every part of the algorithm, a lot of testing for some weird signals and many special cases. There are always some weird signals that make your algorithm incompatible. Even some well-known DSP open source libraries cannot be said to work 100%, such as speexdsp:

Speex: a free codec for free speech

3. Deep Learning and Recurrent Neural Networks

Recurrent Neural Network: Recurrent Neural Network/Recurrent Neural Network.

The emergence of deep learning has promoted the development and application of speech signal processing. The progress in recent years includes:

(1) The depth of the neural network is deeper, and it has exceeded two hidden layers

(2) Recurrent neural network has deeper memory

(3) The data set used for training is more abundant

RNN can realize the modeling of time series, not just consider the input and output separately. This is especially important for noise suppression, since noise estimation is time consuming. For quite a while, the capabilities of RNNs were severely limited because they could not hold information for long periods of time, and the gradient descent process involved in backpropagation calculations was very inefficient. Both of these problems can be solved by designing gated units, such as: LSTM-long short-term memory network ( long short-term memory ), GRU-recursive gated unit (gated recurrent unit), and many other methods.

Note: GRU (Gate Recurrent Unit) is a type of cyclic neural network. Like LSTM (Long-Short Term Memory), it is also proposed to solve problems such as long-term memory and gradients in backpropagation

RNNoise uses a gated recursive unit (GRU) because it performs slightly better than LSTM on this task and requires fewer resources (balancing CPU computing power and memory). Compared with general recurrent units, GRU has two additional door.

1) Reset gate: used to control whether the current state is used to calculate the new state

2) Update gate: Based on the new input, how much the current state will change (how much weight the input can contribute).

This update gate (when closed) makes it possible (and easy) for GRUs to remember information for a long time, which is why GRUs (and LSTMs) perform better than general recurrent units.

the term:

feed-forward unit: forward feedback unit

simple recurrent unit: general neural recurrent unit

As shown below:

 A simple recurrent unit is compared with a GRU: the difference lies in the r gate and z gate of the GRU, which realizes long-term memory and learning functions. Both are soft switches (values ​​between 0 and 1), calculated based on the previous state and input of the entire layer, using the sigmod activation function. When the update gate z is on the left, the state can remain unchanged for a long time until a certain condition is true that causes z to switch to the right, and the state changes.

4. A Hybrid Approach

Due to the success of deep learning, it is now popular to apply deep neural networks to entire problems. These methods are called end-to-end methods, which are all neuron-based. End-to-end approaches have been applied to speech recognition and speech synthesis. On the one hand, these end-to-end systems have demonstrated the power of deep neural networks. On the other hand, sometimes these systems may not be optimal and waste system resources. For example, some noise suppression methods use thousands of neurons and tens of millions of weights for noise suppression in each layer alone. The downside is not only the computational cost of running a deep network, but also the size of the model itself, a thousand lines of code, and tens of megabytes of neuron weight values.

Here we propose a different approach, keeping the basic signal processing approach (don't let the neural network try to simulate it). What neural networks do: the tricky part of endless tuning after signal processing (referring to parameter tuning...). Another difference from existing deep learning noise suppression methods is that the goal is real-time, unlike speech recognition, which does not have such high latency requirements, so it cannot accept long model reasoning runtimes.

In order to avoid having very many outputs (as this would require designing a large number of neurons), we decided not to use samples or spectra directly. We think about using frequency bands, which are a frequency scale that matches the way we perceive sound. In total we used 22 subbands instead of 80 (complex) spectral values.

 the term:

Bark scale: Bark frequency scale

Comparing the Opus frequency band and the actual Barker frequency scale , for RNNoise, a network layer similar to Opus is used. Since we overlap the frequency bands, the border between the Opus bands becomes the center of the overlapping RNNoise bands, the higher the frequency, the wider the band, and the human ear is not very sensitive to frequency resolution; the lower the frequency, the wider the band Narrow, but not as narrow as given by the Barker scale, because then we would not have enough data to make a better estimate of the noise.

Instead of directly reconstructing from these 22 sub-bands, the gain values ​​are calculated correspondingly and then acted on these sub-bands. It can be thought of as a frequency equalizer of length 22 , which can quickly change the value on each subband, with the purpose of attenuating the noise part and letting the speech signal pass.

Calculating the gain for each frequency band has several advantages:

(1) First of all, a segment of speech is divided by the frequency band as the standard, and the number of frequency bands is small, making the model simpler

(2) Secondly, the so-called "music noise" will not be generated, because when adjacent frequency points are suppressed, only one single tone can pass through. Musical noise is very common in noise suppression, and it is also a very annoying problem. For a wide frequency band, it does not let it pass completely, nor does it completely suppress it.

(3) The third advantage is the optimization of the model. The boundary of the gain value is 0~1, which can be directly calculated with the sigmod function without introducing additional noise.

the term:

Rectified Linear activation function: Linear rectification function, a type of activation function commonly used in artificial neural networks (activation function)

MFCC: Mel Frequency Cepstral Coefficients

For the output, a linear rectification function can be used to calculate the attenuation value between 0 and infinity (the unit is db). During training, in order to better optimize the gain value, the loss function is based on the MSE method based on the mean square error. , the gain value is equal to the power of α. It was found that α=0.5 produced the best results perceptually. As α tends to 0, it amounts to minimizing the log-spectral distance, which is problematic because the optimal gain may be very close to zero.

The main disadvantage of the lower resolution obtained using frequency bands is that there is not fine enough resolution to suppress noise between tonal harmonics. Fortunately, it's not that important, and there's even a simple trick to do it (see Tonal Filtering section below).

The output is based on 22 frequency bands, so it doesn't make sense to have higher frequency resolution on the input, use the same 22 frequency bands to provide spectral information to the neural network. Audio has a large dynamic range, so calculating the logarithm of the energy is much better than entering the energy directly. The resulting data is based on the Barker-scale cepstrum, which is closely related to the Mel-frequency cepstral coefficients (MFCCs), which are very commonly used in speech recognition.

In addition to cepstral coefficients, there are:

1) The first and second order reciprocals of the first 6 frames

2) Pitch period (1/pitch frequency)

3) Pitch gain (sound intensity) for 6 subbands

4) A special non-stationary value for detecting speech (not used in this example)

This provides a total of 42 input features to the neural network.

5. Deep Neural Network Layer Architecture

The designed deep architecture is inspired by traditional noise suppression methods. Most of the work is done by 3 GRU layers. The figure below shows the layers we use to compute the band gain, and how the architecture maps to each step of traditional noise suppression.

 Each box represents a layer of neurons, and the numbers in parentheses indicate the number. Dense layers are connected together and not repeated. One output of the network corresponds to a set of gains applicable at different frequencies, and the other output is voice activity detection, which does not belong to the scope of noise suppression and is an attached module in the network.

6. About the dataset

Deep neural networks can also be downright stupid at times. They're very good at what they know, but they can make really egregious mistakes on inputs that are too far from what they know. A neural network is like a lazy student, they can take advantage of any holes in the training data to avoid learning more difficult things, and the results of the training will not be as expected. Therefore, the data used for training is very critical.

Typical failure example: A long time ago, some army researchers tried to train a neural network to recognize tanks camouflaged in trees. They took photos of trees with and without tanks, then trained a neural network to identify trees with tanks. The training results are far from expected. The reason is that the photos with tanks were taken on cloudy days, while the photos without tanks were taken on sunny days, so what the network really learned is how to distinguish between cloudy and sunny days!

For the occasion of noise suppression, it is not possible to only collect input/output data that can be used for supervised learning, because we rarely get clean speech and noisy speech at the same time, so we need to artificially combine pure speech and noisy speech to create some data. So the tricky part should be how to get all kinds of noise data and add the noise data to the speech. Also one has to make sure to cover all types of recording conditions, e.g. early versions trained only on full-band audio (0-20kHz), training results do not work on audio below 8kHz.

Unlike common practice in speech recognition, the features are not normalized by the cepstral mean, and the first cepstral coefficient representing the energy is retained. Instead, the data is kept to contain all real-world levels of audio, and random filters are applied to the audio, making the system robust to a wide range of microphone frequency responses.

Since the frequency resolution of our frequency band is insufficient to filter out noise between tonal harmonics, we use basic signal processing. That's another part of the mix. When there are multiple measurements of the same variable, the easiest way to increase precision (reduce noise) is to calculate an average. Obviously, just computing the average of adjacent audio samples is not what we want, since it would result in low-pass filtering. However, when the signal is periodic (such as speech), we can compute the average of samples offset by the fundamental frequency period. A comb filter is introduced that passes the harmonics of the fundamental frequency while attenuating the frequencies between them which are the noise containing parts. To avoid signal distortion, a comb filter is applied to each frequency band independently, and its filter strength depends on the fundamental frequency correlation and the band gain calculated by the neural network.

Currently FIR filters are used for pitch filtering, but IIR filters can also be used, which if too strong will result in greater noise attenuation and possibly higher distortion.

7. From python to C language

All design and training of the neural network was done in Python using the Keras deep learning library. Since Python is usually not the language of choice for real-time systems, we had to implement the code in C. Fortunately, running a neural network is much simpler than training a neural network, so we only need to implement a forward pass through the GRU layer, outputting a 22-dimensional gain. To fine-tune the weights to a reasonable stride, we limit the size of the weights to +/- 0.5 during training, which makes it easy to store them using 8-bit values. The resulting model is only 85 kB (instead of the 340 kB required to store the weights as 32-bit floats).
C code is available under a BSD license. The code runs 60 times faster on an x86 CPU than in real time. It's even 7x faster than real time on a Raspberry Pi 3. With good vectorization (SSE/AVX), it should be possible to be four times faster than it is now.

8. Additional resources

(1) Code: Xiph.Org/rnnoise GitLab

(2) Thesis:

AHybridDSP/DeepLearningApproachtoReal-TimeFull-BandSpeechEnhancement (arxiv.org)

(3) Example project address: jmvalin.dreamwidth.org

Guess you like

Origin blog.csdn.net/qq_40088639/article/details/128660406