2020-12-27-HCSI Group Meeting

1. Jackie Cotatron

1.1. Transcription-guided

  • Borrow the attention part of the pre-trained Tacotron
  • mel provides two places, and txt splicing, and the residual information is sent to the decoder separately
  • Just learn from attention to achieve the alignment of mel and txt expansion sequences
  • L = matmul(A, Encoder(T)) is called a variable, similar to PPG. It is particularly well decoupled, because the attention mechanism is used

1.2. Tacotron + speaker encoder

The reference encoder borrowed from the style, why use it? Instead of one-hot?

1.3. Residual Encoder

L can only provide text information, residual information provides others

  • The structure is bottleneck, relatively complete
  • Dimensionality reduction + sampling
  • instacnce Norm, tanh
  • Smoothing Hann

In the end it only dropped to 1 vector, similar to the part of VAE Residual, but the structure design is very similar to AutoVC

But what kind of information is it? For example, is it similar to F0? Need an ablation experiment

1.4. VC decoder

The purpose is relatively simple, but the structure used is more advanced

  • GBlock
  • condition batch norm
  • speak id used one hot again

1.5. Cotatron Loss

Training twice, so two losses

  • One more speaker id loss than Tacotron-2
  • When VC training, reconstruct loss

1.6. ASR Any-to-Many

  • Improve the mel->txt->multi-speaker TTS process
  • But I also borrowed from the synthesis method of VC, (1) the decoder is relatively simple (2) the residual module corrects the ASR error, right?
  • Still entangled in the use of speaker id embedding
  • Why not use PPG? What is the difference? In fact, there are two main types of PPG, traditional ASR, and end-to-end ASR PPG, and now there are more such L
  • English and end-to-end ASR

1.7. Data volume

Use VCTK to train one-hot many speakers, 400 sentences per person

2. 思磐FastPitch&FastSpeech2

2.1. Application of Alignment

  • You can do hard alignment to get the duration of each phoneme
  • You can get the pitch of each phoneme
  • It should be more suitable for TTS tasks than the results obtained by Force Alignment in the ASR field.
  • Hard alignment can add an optimization, such as predicting a Gauss parameter, which is more detailed than an integer
  • MFA can also be considered

2.2. Text Prediction Pitch

A bit too average

  • But it is true that the traditional TTS has a prosody prediction, which predicts F0

2.3. Encoder inheritance and knowledge distillation

  • Need to see why fastspeech dare not throw away the distillation, but later dare to throw away
  • Hypothesis: traditional text analysis, then padding, then LSTM, such models are replaced with Transformer structure, the effect will be very good. No autoregression, powerful modeling ability

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/u013625492/article/details/111797125