In the field of voice interaction, speech synthesis is an important part, and its technology is also constantly developing. In recent years, there has been a growing interest and demand for emotion synthesis. Emotional speech synthesis will allow the machine to communicate with us like a real person. It can express different emotions such as angry voices, happy voices, and sad voices, and even different emotions of different intensities.
Emotional speech conversion technology can convert speech from one emotional state to another under the premise of keeping the identity of the speaker and the content of the language unchanged. Simply put, it is to properly transfer the emotional expression from an emotional speaker to the target speaker while maintaining a good target speaker timbre.
Image
Emotional Speech Synthesis Technology
Image
Emotional speech synthesis system can use speaker and emotion embedding model scheme. Use emotion as a label, that is, add an emotion label on the basis of the original network, and the information of these emotions will be learned through the network.
Speaker embedding is to obtain a speaker vector through a neural network, which requires a certain scale of multi-person database for training.
Emotional embedding requires emotional data combined with speaker vectors to implement an emotional speech synthesis model, so high-quality, multi-emotional data is required.
Emotional Voice Transformation Technology
For example, cross-speaker emotion transfer can use emotion and timbre perturbation to learn speaker and emotion-related spectrums respectively, and provide explicit emotion features for the final speech generation. Speaker correlation is to maintain the timbre of the target speaker, and emotion correlation is to capture the emotional expression of the source speaker. Therefore, data from multiple people with multiple emotions and multiple people without emotion are needed for joint training.
Emotional Voice Application Scenarios
Avatar: It can make virtual characters have certain emotional expression ability.
Short video dubbing: You can dub the content of the short video to make the content more lively and interesting.
Game role: It allows users to have a better experience in the game.
Film and television animation: can carry out vivid explanation.
Intelligent customer service: It can improve the human-computer interaction experience and make the interaction full of fun.
Datatang Emotional Voice Database Recommendation
01
Single Person Emotional Speech Database
Recorded by a single speaker in a professional recording studio.
13.3 hours Chinese female voice emotional synthesis library
01
Recorded by a gentle and kind young woman, the six emotional texts are happiness, anger, sadness, surprise, fear and disgust. The phoneme coverage of the corpus is balanced, and professional phoneticians participate in the annotation. The accuracy rate of text annotation is not less than 99.9%, the accuracy rate of phoneme annotation is not lower than 99%, and the accuracy rate of prosodic annotation is not lower than 98%.
02
Multi-person emotional speech database
It is recorded in a professional recording studio by multiple speakers.
22-person Chinese emotion synthesis library
01
The ratio of male to female speakers is balanced, covering different age groups of children, youth, and old people. Each person collects six emotions of happiness, anger, sadness, surprise, fear and disgust, and each emotion takes 20 minutes. The text style is natural and colloquial, the phoneme coverage of the corpus is balanced, and professional phoneticians participate in the annotation. The accuracy rate of text annotation is not less than 99.9%, the accuracy rate of phoneme annotation is not lower than 99%, and the accuracy rate of prosodic annotation is not low. At 98%.
The 22 people in this database are selected from the Datatang finished product database "100 Chinese General Average Timbre Synthesis Library". The superimposed use of the two sets of databases can realize technologies such as emotional speech synthesis and cross-speaker emotion transfer.
20-person Chinese emotion synthesis library
02
The ratio of male to female speakers is balanced, covering different age groups such as teenagers, young people, middle-aged, and old people. Each person collects 7 emotions of happiness, anger, sadness, surprise, fear, disgust, and neutral, and each emotion lasts 20 minutes. The texts are all in the style of novels, with balanced phoneme coverage of the corpus. Professional phoneticians are involved in the annotation. The accuracy rate of text annotation is not less than 99.9%, the accuracy rate of phoneme annotation is not less than 99%, and the accuracy rate of prosodic annotation is not low. At 98%.
03
Multi-Speaker Average Model Library
It is recorded in a professional recording studio by multiple speakers.
100-person Chinese general average tone synthesis library
01
Covering news, daily spoken language, audio books, poetry, advertisements, news broadcasting, entertainment and other categories, the language covers Chinese, English, Chinese and English mixed reading, 50 male and female speakers, covering different age groups of children, adults, and the elderly. Humans record 600 to 700 sentences. And marked text, phoneme, 4-level prosody, phoneme boundary.
As the world's leading artificial intelligence data service provider, Datatang can provide customers with rich emotional voice data. The artificial intelligence trained by these data can synthesize voices richer in emotion and expression, making the synthesized voice more natural and real. Better application in different scenarios.