Voice wake-up tool: WeKWS

1 Introduction

This article is based on the translation and summary of "WEKWS: A PRODUCTION FIRST SMALL-FOOTPRINT END-TO-END KEYWORD SPOTTING TOOLKIT" in October 2022. The authors are the team of teacher Zhang Xiaolei from the School of Navigation of Northwestern Polytechnical University, the team of teacher Xie Lei from the Audio Speech and Language Processing Research Group of Northwestern Polytechnical University, and the WeNet open source community.

WeKWS is an end-to-end (E2E) keyword spotting tool (Keyword spotting) that can be put into production, easy to build, and easy to apply. Keyword spotting (KWS) refers to the recognition of predefined keywords from a continuous speech stream. Wake-up word recognition (wake-up word (WuW)) is a kind of KWS.

Open source address: GitHub - wenet-e2e/wekws: Production First and Production Ready End-to-End Keyword Spotting Toolkit

Voice wake-up on devices such as the Internet of Things (IoT) requires a script with a small memory footprint, low computing cost, low latency, and high accuracy. The current tools are too complex, such as Kaldi, Fariseq, Honk, etc. To this end, we have built WeKWS, which has the following conditions:

  • Alignment-free: It does not need to use automatic speech recognition (ASR) or speech activity detection (speech activity detection: SAD) for keyword alignment or keyword end timestamp, which simplifies KWS training.
  • Production ready: Bridging the gap between research and production. It can be exported in Torch Just In Time (JIT) and converted to Open Neural Network Exchange (ONNX) format, which is easy to adopt in multiple development environments. (Two commonly used reasoning acceleration schemes in the Pytorch model: ONNX and TorchScript).
  • Light weight: only rely on Pytorch;
  • High accuracy.

2 WeKWS

2.1 System design

The figure below includes 3 layers.

2.1.1第一层: Data preparation module and an on-the-fly feature extraction and argumentation.

In the data preparation module, it is to prepare the speech list and keyword tags at the utterance level, which is convenient for model training. WeKWS uses on-the-fly feature extraction. Each speech is first resampled to a specific sampling rate, followed by rate perturbation and Mel-filter bank feature extraction (speed perturbation and Mel-filter bank feature extraction). The input adopts the data augmentation method of Feature-level Specaugment. Compared with the offline method, this online method saves disk usage, enriches the diversity of training samples, and improves the robustness of the model.

2.1.2 The second layer: Model training and testing

We can use a variety of popular KWS backbone networks (backbone) and a refined max-pooling KWS objective function. The backbone network can choose RNN, temporal convolutional network (TCN), multiscale depthwise temporal convolution (MDTC), etc.

2.1.3 Layer 3: Model export and development.

The trained model supports TorchScript and ONNX output, so it can be easily applied to different platforms. Now we support 3 main platforms such as x86, Android, Raspberry Pi. Moreover, it supports float32 model and quantized int8 model. The quantized int8 model can improve the prediction speed on embedded devices such as ARM Android and Raspberry Pi.

2.2 Model structure

As shown in the figure above, the model consists of 4 parts, including the global cepstral mean and variance normalization layer (global cepstral mean and variance normalization (CMVN)), the linear layer (converts the input feature dimension to the dimension required by the backbone network), Backbone network, multiple binary classifiers. Each binary classifier uses sigmoid to predict the posterior probability of a keyword, and multiple binary classifiers support multiple keywords.

The WeKWS backbone network (backbone) currently supports the following three: 1) RNN or its improved version LSTM; 2) TCN, or its lightweight version with deep separation of TCN, namely DS-TCN (depthwise separable TCN); 3) MDTC.

In all convolution-based neural networks, we use causal convolutions.

2.3 Refined max-pooling KWS objective function

where p is the predicted posterior probability. m is the minimum duration frame of keywords, and m is calculated statistically in the training set. N is the number of frames of the i-th utterance.

By using the max-pooling loss function, the model automatically learns the end timestamp of keywords, so it does not need to rely on the alignment of keywords and the end timestamp of keywords. In particular, for positive samples, the max-pooling loss only optimizes frames with high posterior probability and ignores other frames. For negative samples, the max-pooling loss minimizes frames with high posterior probability, so the posterior is minimized for all frames of negative samples.

3 experiments

3.1 Experimental construction

We evaluate our WeKWS using Mobvoi (SLR87), Snips, Google Speech Command (GSC) datasets.

Mobvoi is a Mandarin corpus applied to wake-up tasks. It has two keywords, and each keyword has 36k voices. The non-keyword voice is about 183K.

Snips is a crowdsourced wake word corpus with the keyword "Hey snips", about 11K keyword utterances, and 86.5K non-keyword utterances.

Google Speech Command includes 64721 one-second-long recordings of 30 words spoken by 1881 different speakers.

We use a 40-dimensional Mel-filter bank (Fbank) as model input with a 25 ms window and a 10 ms window shift.

We use Adam. The batch size is 128. Training 80 epochs.

3.2 Experimental results

The table below is a comparison with LF-MNI-based methods (which rely on graph-based encoding algorithms). False rejection rate ( false rejection rate, FRR) is the percentage of false rejections in actual discrimination. Compared with our method WeKWS, the FRR is reduced, and the effect is better.

Tables 2 and 3 below compare WeKWS with the other two end-to-end methods.

3.3 Ablation experiment

The max-pooling method is better.

MDTC backbone is better.

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/130439261