Professional Practice Record II: End-to-end cross-language timbre transfer speech synthesis

0. Description

What is recorded is the work between 2020-12-16 and 2021-1-16

1. Engineering

1.1. Perfect commercial mixed language synthesis system

Following the work of last month, under the premise of having bilingual corpus, realize the synthesis of mixed language text

The system name of this part is: Fantasy Mix-Lingual Tacotron

1.1.1. Experimental details

The corpus used is: training using the laboratory standard shell bilingual data set, and the Chunchun virtual bilingual corpus of Ping An Technology

The models tried in total are:

  • Fantasy Mix-Lingual Tacotron Version 2: Use Grapheme , keep Language ID, use VAE module
  • Fantasy Mix-Lingual Tacotron Version 4: Use Phoneme , keep Language ID, use VAE module
  • Fantasy Mix-Lingual Tacotron Version 5: Use Phoneme, keep Language ID, remove VAE module
  • Fantasy Mix-Lingual Tacotron Version 6: Use Phoneme, remove Language ID , use VAE module
  • Fantasy Mix-Lingual Tacotron Version 7: Use Phoneme, remove Language ID, remove VAE module
  • Fantasy Mix-Lingual Tacotron Version 4 revised version: Language ID is spliced ​​at TXT Encoding in advance , and the rest remains unchanged

1.1.2. Experimental phenomena and conclusions

  • The revised version of Fantasy Mix-Lingual Tacotron Version 4 has the best effect, which can achieve the effect of normal synthesis of mixed-language text.
  • Phoneme is much better than Grapheme
  • The Language ID must be retained, regardless of whether the output terminal distinguishes input representations in different languages
  • The understanding of the VAE module is not enough, and the effect reflects the lack of testing. But intuitively, each part of the mixed-language text synthesis effect is more natural

1.1.3. Future work

Package the revised version of Fantasy Mix-Lingual Tacotron Version 4 and launch the web version

1.2. The cross-language tone conversion structure proposed by Ali

1.2.1. PPG to MEL spectrum mapping based on Tacotron

  • PPG downsampling
  • Attempt to freeze position of Fine-Tune
  • Fine-tune level trial

1.2.2. Code implementation

  • Compare Ali's structural corrections relative to Tacotron
  • Pytorch implementation based on r9y9

1.2.3. Future work

  • Alibaba structure PPG-TTS that realizes the best Fine-Tune

1.3. AutoVC recurrence

Reproduce AutoVC papers and explore the conditions that affect experimental results

  • Similar Loss: Inference from AutoVC's Content Loss, discussion on the influence of self-encoding structure
  • The influence of different acoustic hyperparameters extraction on experimental results
  • Dimensionality proposed by AutoVC and the role of downsampling
  • The difference between One-hot and Speaker Encoder solutions

The conclusion of the experiment is applied to the colleague’s paper

2. Research

2.1. Voice Transfer cross-language synthesis solution

2.1.1. Ideas

  • Tone encoder-based solution for extracting tone information
  • Acoustic model training without source language corpus
  • Only use the target language corpus to train the acoustic model
  • Not applicable to target speaker corpus training model
  • Only use multiple source speakers to train the model, relying on multiple speakers to establish a perfect timbre feature space
  • Mainly rely on a good Speaker Encoder module to communicate the relationship between the target tone and multiple source tones

2.1.2. Experimental results

  • The experimental cross-language synthesis effect is much better than the previous scheme
  • Limited by timbre modeling and information conflict, the similarity and synthesis stability of timbre are still not good enough

2.1.3. Future work

  • Improve the cross-language synthesis scheme of Voice Transfer by referring to the paper of National Taiwan University to achieve the stability of synthesis

2.2. The role of Similar Loss in PPG self-encoding TTS

2.2.1. Ideas

  • CopyVC: Use Similar Loss's PPG as the input structure based on the Google-19 Tacotron cross-language synthesis framework

2.2.1. Future work

  • Perfect the idea of ​​CopyVC and realize it

3. Next stage tasks

  • Package the revised version of Fantasy Mix-Lingual Tacotron Version 4 and launch the web version
  • Alibaba structure PPG-TTS that realizes the best Fine-Tune
  • Improve the cross-language synthesis scheme of Voice Transfer by referring to the paper of National Taiwan University to achieve the stability of synthesis
  • Summarize information decoupling methods from AutoVC and participating papers, such as the use of similar loss, which is used in cross-language synthesis
  • Perfect and realize the idea of ​​CopyVC based on PPG self-encoding

Guess you like

Origin blog.csdn.net/u013625492/article/details/113393773