Sound synthesis and cloning - making a sound dataset for training

foreword

1. PaddleSpeech is an easy-to-use all-in-one voice toolbox that supports voice processing-related operations, such as voice recognition, voice synthesis, voiceprint recognition, voice classification, voice translation, voice wake-up, etc. application development.

Only speech synthesis and sound cloning are used here. It mainly consists of three main modules: Text Frontend, Acoustic Model and Vocoder. The module workflow is as follows:

  • Raw text is converted to characters/phonemes by a text front-end module.
  • Convert characters/phonemes to acoustic features through acoustic models, such as linear spectrogram, mel spectrogram, LPC features, etc.
  • Acoustic features are converted into waveforms by a vocoder.

2. To complete the entire project, it can be roughly divided into the following steps:

  • Voice data collection, processing.
  • Speech synthesis and fine-tuning of cloned models.
  • Model offline application deployment.

Dataset production

1. If you want to train your own voice, you can use a recording device to record your own voice. Both Chinese and English are acceptable. The recording environment should be as noise-free as possible. The longer the recording time, the better.

2. If the sound data on the network is used, both video and audio are fine.

3. I am using the voice of a UP master on station B for the demonstration here. I cut about 10 videos of more than 5 minutes, because the sound of the video has background music. For the training effect, the background music is removed here. There are many ways to remove background music. Professional voice processing people like to use Adobe Audition, but it is too troublesome to learn. Here you can use the magic of deep learning to remove the background music.

 Ultimate Vocal Remover is a super easy-to-use accompaniment vocal extraction tool. After the installation is complete, you can use UVR to separate the accompaniment and vocals. The instructions are as follows:

Basic options (not in-depth users generally only use these few functions) 

 VR Architecture Options

MDX-Net option

Demucs v3 options
 Ensemble Mode Ensemble Options

 Manual Ensemble Manual Ensemble

4. After removing the background music, cut the audio into 2 to 10 seconds long (no more than 10 seconds) audio segments. Audio cutouts were handled using Adobe Audition. After installing Adobe Audition, use Adobe Audition to open the prepared video or audio file, then click on the file name —> Insert into Multi-track Mixing —> Create a new multi-track session, as shown below:

 Give the project to be edited a name:

Afterwards, the blade on the Adobe Audition interface can be used. When slicing, try to cut the part without voice, that is, the part without sound waves. If there is a long section without sound waves, cut it out and delete it. When making slices, pay attention to not less than 2 seconds and not more than 10 seconds.

 After cutting the entire audio, drag each audio segment into a separate audio track, and delete the unused segments:,

 Then select all cut segments (Ctrl+A), click File—>Export—>Edit All

Change the sampling type to 24000Hz in the export interface, and export all files:

 Export audio clips, save the file name in Chinese, and change it to English or digital file name.

Related software download

1. Download the sound clip cutting software. This is a green version. You can find it in a certain treasure, or private message me, and I will send you a network disk.

2. Accompaniment vocal extraction tool download:

https://download.csdn.net/download/matt45m/88033228

Guess you like

Origin blog.csdn.net/matt45m/article/details/131502867