AI digital human: Chinese speech generation training based on VITS model

1 Introduction to VITS model

        VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a highly expressive speech synthesis model that combines variational inference, normalizing flows, and adversarial training.

        The VITS model was proposed by the Korean Academy of Sciences in June 2021. VITS connects the acoustic model and the vocoder in speech synthesis through hidden variables instead of spectrum. Random modeling is performed on hidden variables and a random duration predictor is used to improve the Diversity of synthesized speech, inputting the same text, can synthesize speech with different tones and rhythms.

        Paper Address: VITS Papers

2 VITS model structure

VITS mainly includes 3 pieces:

  • Conditional Variational AutoEncoder (Variational AutoEncoder, VAE)
  • Alignment estimates from variational inference
  • Generative Adversarial Training

The milestone of VITS speech synthesis complete end-to-end TTS, the main breakthrough points are as follows:

(1) The first complete E2E model whose naturalness exceeds 2-stage architecture SOTA. MOS4.43, only 0.03 lower than GT recording. Claims that the current public system works best.

(2) Thanks to the research of introducing Flow into VAE to improve the generation effect in the image field, Flow-VAE was successfully applied to the complete E2E TTS task.

(3) The training is very simple, completely E2E. There is no need to add additional features such as pitch and energy like the Fastspeech series models, and it is not like most 2-stage architectures that need to finetune the vocoder according to the output of the acoustic model to achieve the best results.

(4) Get rid of the preset acoustic spectrum as the feature of linking the acoustic model and the vocoder, and successfully apply the learning implicit representation of VAE to E2E to link the two modules

(5) The naturalness of the multi-speaker model does not decrease, unlike other models which tend to be flat in GT recording MOS score

3 Using the vits model for Chinese speech synthesis training

3.1 GitHub project download:

git clone https://github.com/PlayVoice/vits_chinese

3.2 Build the operating environment:

For details on setting up the annoconda environment, see: annoconda installation and use

conda create -n vits pyton==3.9

conda activate vits

cd vits_chinese

pip install -r requirements.txt

cd monotonic_align

python setup.py build_ext --inplace

3.3 Dataset download:

Download the Biaobei male voice data set, the sampling frequency is 22050, the download address is as follows:

Bibei male voice dataset (the first package)

Bibei male voice dataset (the second package)

Annotated data of Biaobei male voice dataset

After the download is complete, decompress the dataset and put it in the "vits_chinese/data/waves" directory, and put the labeled data in

Under the "vits_chinese/data" directory

3.4 Pre-training model download:

Prosodic Model Download: Prosodic Model

After the download is complete, move to the "vits_chinese/bert/" directory

3.5 Data preprocessing:

Modify the configuration file: vi config/bert_vits.json

    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
python vits_prepare.py -c ./configs/bert_vits.json

3.6 Start training

python train.py -c configs/bert_vits.json -m bert_vits

3.7 Inference after training

python vits_infer.py --config ./configs/bert_vits.json --model logs/bert_vits/G_700000.pth

Among them, G_700000.pth is the trained model, and the training model is specified according to the actual training situation for inference

 4 Display of training results

The speech generation effect after 1000 epoch training is as follows:

https://download.csdn.net/download/lsb2002/87832170d

 5 pre-trained models

Using standard male voice data, using Tasla-v100GPU, after training the model for 700,000 epochs, the new speaker can be trained twice on this model to achieve rapid convergence. Pre-trained model download address

After downloading, store the model in the /vits_chinese/logs/bert_vits/ directory and start the second training

        

Guess you like

Origin blog.csdn.net/lsb2002/article/details/130904876