SLT2021: LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH

0. Title

LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH

Lightweight Spicy: Lightweight, non-autoregressive multi-speaker text-to-speech

1. Summary

With the development of deep learning, the end-to-end neurotext speech conversion system has made significant progress in high-quality speech synthesis. However, most of these systems are autoregressive models based on attention, resulting in slower synthesis speed and larger model parameters. In this article, we propose a new lightweight non-autoregressive multi-speaker speech synthesis system called LightSpeech, which uses a lightweight feedforward neural network to accelerate synthesis and reduce the number of parameters. By embedding multi-speaker vectors, LightSpeech can achieve multi-speaker speech synthesis very quickly. Experiments on the LibriTTS dataset show that compared with FastSpeech, our smallest LightSpeech model achieves 9.27 times the speed of Mel spectrum generation on the CPU, and the model size and parameters are compressed by 37.06x and 37.36x, respectively.

关键词: End-to-end, multi-speaker speech synthesis, non-autoregressive, lightweight neural network

End-to-end, multi-speaker speech synthesis, non-autoregressive lightweight neural network

2. Introduction

In recent years, the end-to-end text-to-speech (TTS) system has surpassed the traditional multi-level manual engineering system, simplified the system process and provided high-quality synthesized speech. Compared with traditional statistical parameter speech synthesis [1-5], end-to-end TTS [6-11] directly learns the text-to-speech mapping through a pure neural network, without the need for complex text front-end processing and various language function extraction and Extensive domain expertise. However, the current mainstream end-to-end TTS systems mostly use various attention mechanisms to implicitly learn the alignment of text to speech, which will bring a lot of computational cost. At the same time, the autoregressive generation method is used in these systems, which requires the speech frame generated in the previous time step as the input of the next time step, so that these systems have the problem of low training efficiency and cannot synthesize speech in parallel.

 

In order to accelerate end-to-end speech synthesis, researchers have proposed a series of alternative methods of attention mechanism for learning text-to-speech alignment and autoregressive generation. Durian [12] uses the forced alignment widely used in speech recognition to obtain the alignment between text and speech, so the attention mechanism is no longer required. But Durian still uses autoregressive generation to generate speech. The non-autoregressive architecture was first adopted in ParaNet [13] to generate speech in parallel, but it still requires attention mechanisms to learn the alignment between text and speech. The above two TTS systems have not completely abandoned the attention mechanism (used to obtain alignment) and the autoregressive generation method, which limits the increase in speech synthesis speed. Recently, a new system called FastSpeech [14] extracts phoneme duration sequences from the attention alignment matrix of the pre-trained autoregressive Transformer TTS [8] model to train duration predictors, so no attention mechanism is needed to learn Text-to-speech alignment. In addition, FastSpeech uses a feedforward Transformer structure to synthesize speech in parallel, which greatly speeds up speech synthesis

 

Although FastSpeech's feedforward Transformer structure speeds up speech synthesis, the computational complexity of the self-attention layer is the square of the length of the input elements, which requires a lot of memory. In this article, we propose a new lightweight non-autoregressive speech synthesis system called LightSpeech, which can reduce computational complexity and model parameters. Convolutional neural network is used in LightSpeech to achieve the above purpose. However, the parameters and computational complexity of the conventional convolutional network structure are still relatively large. Compared with the conventional convolutional network structure, deep convolution [15] reduces the number of parameters by independently convolution on each channel, and makes the computational complexity linear. In order to further reduce the parameters of LightSpeech, lightweight [16] and dynamic convolution [16]

 

Although FastSpeech's feedforward Transformer structure speeds up speech synthesis, the computational complexity of the self-attention layer is the square of the length of the input elements, which requires a lot of memory. In this article, we propose a new lightweight non-autoregressive speech synthesis system called LightSpeech, which can reduce computational complexity and model parameters. Convolutional neural network is used in LightSpeech to achieve the above purpose. However, the parameters and computational complexity of the conventional convolutional network structure are still relatively large. Compared with the conventional convolutional network structure, deep convolution [15] reduces the number of parameters by independently convolution on each channel, and makes the computational complexity linear. In order to further reduce the parameters of LightSpeech, a lightweight [16] and dynamic convolution [16] architecture is adopted. Based on deep convolution, they reduce the parameters by sharing the weight of the convolution kernel between groups. Dynamic convolution is a variant of light convolution, which can dynamically predict the convolution weight through an additional linear layer at each time step. Unlike self-attention, these convolutional structures only focus on a limited context, which may reduce model performance. In view of this, we consider using convolution and self-attention to extract local and global features respectively, and merge them to improve the model representation ability

 

In recent years, multi-speaker speech synthesis has always been a hot research topic [17-22]. However, these architectures are still designed based on the autoregressive Tacotron2 [7] model, so the speech synthesis speed is slow. Our LightSpeech is designed to quickly synthesize multi-speaker speech in a non-autoregressive way. For invisible speakers in the training data set, the end-to-end TTS model can be used to adapt to the target speaker's voice in a zero-shot manner by using only the speaker embedding, without the need to fine-tune the model. Therefore, we introduced x-vector [23] (an advanced speaker representation method) into our LightSpeech for multi-speaker TTS. In addition, the voice tones of different speakers are used as additional prosodic information to synthesize expressive speech

 

Generally, neural network models with better performance have larger parameters, which will lead to huge calculations and memory consumption. Therefore, in order to reduce the number of parameters and enable the model to be deployed in devices with fewer resources (such as embedded systems), the design of lightweight neural networks is essential. In this article, we use the knowledge distillation technique to compress LightSpeech and have good performance. As far as we know, this is the first implementation of a lightweight neural network for speech synthesis models. Lightweight network architecture allows our LightSpeech to be easily deployed in mobile devices, so speech synthesis is not limited to the cloud

3. Others-easy to understand

First of all, the expressive multi-speaker teacher model is used. The teacher model is used to guide LightSpeech. LightSpeech is to generate the target mel-spectrum in parallel and compress the model parameters. We have designed a lightweight feedforward structure. She does not need to pay attention to the mechanism. Learn the alignment of text to speech

In addition, in order to reduce phoneme or word duplication, skipping and mispronunciation caused by unstable and incorrect alignment, an end-to-end automatic speech recognition (ASR) model is introduced as an auxiliary constraint to reconstruct the input text from the generated mel Spectra for more precise alignment

 

The X vector is an advanced speaker representation method that extracts the speaker embeddings through Time Delay Neural Network (TDNN) [25], and different speaker embeddings represent different speakers. We introduce the x vector into our teacher model to achieve multi-speaker speech synthesis. Specifically, in order to eliminate the text information carried by the x vector, we use a text-independent training strategy to train the TDNNs model. Then, we concatenate the x vector with the output of each encoder as an additional condition to control the decoder to generate speech from different speakers

 

In fact, the voices of different speakers carry different rhythm information. A fixed-length vector (such as an x ​​vector) is not sufficient to simultaneously model speaker and rhythm information. In order to further decompose the speaker and rhythm information, the x vector is only used as a condition for judging different speakers without controlling different rhythms. Inspired by Mellotron [26], we extract the tones of different voices through a pitch pre-net composed of one-dimensional convolutional layers and obtain advanced pitch features, and then combine the pitch features with each decoding time step before passing to PostNet To control different rhythms

Lightweight convolution (LC) and dynamic convolution (DC) reduce the number of parameters through deep convolution and group convolution, allowing different feature channels to be calculated in parallel, resulting in linear computational complexity

4. Other-not easy to understand

K is the size convolution of the lightweight kernel, d is the dimension of word embedding, and H is the number of groups. Lightweight convolution divides the embedded input sentence into different groups along the channel dimension, and different colors indicate different groups. The weights of the same group are shared in the group convolution, and each group can be calculated in parallel, which can effectively reduce the model parameters. On the basis of group convolution, lightweight convolution introduces deep convolution in each group, so that different channels in the same group can be calculated in parallel, thereby further reducing parameters and computational complexity

 

Dynamic convolution inherits all the advantages of lightweight convolution, and dynamically predicts the weight of each convolution kernel based on the embedding of each word currently input. This dynamic prediction of convolution kernel weights is similar to the generation of attention scores in self-attention (SA), except that a limited amount of context is concentrated in dynamic convolution

The LightSpeech we proposed inherits all the advantages of FastSpeech, including fast speech synthesis speed, controllability of the generated speech, and few word skipping problems. The structure of LightSpeech is shown in Figure 3. The encoder and decoder of LightSpeech are formed by stacking our Lightweight feedforward blocks. In view of the excellent performance of Transformer, our lightweight feedforward network module is modified based on Transformer's encoder. In order to reduce the computational complexity and model parameters, two feedforward network architectures are designed. The first method is to use lightweight convolution or dynamic convolution to completely replace self-attention, but since convolution only focuses on a limited context, this architecture may reduce the quality of synthesized speech. The other first divides the input text feature channel into two parts to simplify the calculation, then uses lightweight convolution or dynamic convolution to extract local context, and uses self-attention to extract global context, and finally merges them. Due to the mask operation on the feature channel, the amount of self-attention calculation here is also reduced, and the architecture (DC-SA, LC-SA) can fuse local and global context information to improve the model's representation ability. In addition, in order to further reduce the parameters, we replaced the feedforward network in the Transformer structure with deep convolution and group convolution in the above structure. In addition, a duration prediction network is introduced in LightSpeech to predict the duration of each character. The network is used to extend the encoder output to the same length as the mel spectrum through a length adjuster without paying attention to the mechanism to learn text-to-text . Voice alignment

Sequence-level knowledge distillation

Guess you like

Origin blog.csdn.net/u013625492/article/details/112973655