Motivation: The quality of VITS is very good. The work of this article aims to achieve high-quality synthesis with smaller models and faster inference speed.
contribution:
The most time-consuming part is the waveform generation module of the decoder (HFG), which is replaced by iSTFTNet to complete the conversion from frequency domain to time domain;
multi-band生成:each iSTFT module generates sub-band signals, summed to generate the full-band target waveform.
multi-stream生成:use a trainable synthesis filter for the sub-band signals,
result
0.066 on an Intel Core i7 CPU, 4.1x faster than VITS
Compared with distillation models (Nix-TTS, teacher-student, smaller model size), the generation quality is better when using the same model size, because the structure of end2end has less loss than distillation.
method
The decoder of VITS is the structure of HFG. By upsampling z to the sampling point (multiple convolution upsampling), it consumes a relatively large amount of calculation;
Inspired by iSTFTNet , this process is replaced by the inverse Fourier transform;