Speech Neuroscience—02.Speech synthesis from neural decoding of spoken sentences

Speech synthesis from neural decoding of spoken sentences

Terminology

speech synthesis speech synthesis
brain-computer interface (BCI) brain-computer interface
bidirectional long short-term memory (bLSTM) bidirectional long short-term memory Network
vental sensorimotor cortex(vSMC) ventral sensorimotor cortex
superior temporal gyrus(STG) superior temporal gyrus
inferior frontal gyrus(IFG) inferior frontal gyrus
acoustic feature acoustic feature
mel-frequency cepstral coefficients(MFCCs) mel-frequency cepstral coefficients
mean Mel-Cepstral Distortion(MCD) Mean Mel-Cepstral Distortion
Pearson's correlation Pearson's correlation coefficient
acoustic feature Acoustic feature
Electromagnetic Articulography (EMA) Electromagnetic glottis analysis
spectral features Spectral features
auditory feedback Auditory feedback
spectrograms
Kullback-Leibler divergence KL divergence
electrode electrode
Principal Component Analysis (PCA) Principal Component Analysis< a i=18> formants formants Spectral Envelope spectrum envelope

Overview

The researchers designed a neural decoder that explicitly exploits kinematic and acoustic representations encoded in human cerebral cortex activity to synthesize audible speech.

background

Nervous system diseases that lose the ability to communicate are very hopeless for most patients. Although there are some methods that allow patients to use communication devices to select letters to spell words, the speed of these methods is limited. We < /span>. By analyzing patterns and signals in brain activity related to speech production, researchers hope to be able to accurately reproduce intelligible speech output. to use neuroscience and signal processing techniques to decode brain signals into speech expressions The goal of the author's research is may be the only way to achieve high communication rates with natural speech, It may also be the most intuitive way for users to learn. on vocal tract movements and the sounds they producea biomimetic approach that focuses. Therefore, speech is a highly efficient form of communication produced through fluid, multi-articulated vocal tract movements Spelling is the sequential concatenation of discrete letters, while . Need to study higher-speed communication methods

speech decoder design

The figure below shows a two-stage decoding method:
Insert image description here
第一阶段: A bidirectional long short-term memory (bLSTM) recurrent neural network from the ventral sensorimotor cortex (vSMC), superior temporal neural activity (high gamma amplitude envelope and low frequency components) in the gyrus (STG) and inferior frontal gyrus (IFG) a i=4>Decoding oral kinematic features (Fig. 1a, b). : An independent bLSTM decodes the acoustic features (fundamental frequency) from the oral kinematics features decoded in the first stage F0, mel frequency cepstral coefficients (MFCCs), glottal vibration and glottal excitation intensity) (Fig. 1c). Then, an audio signal is synthesized from the decoded acoustic features (Fig. 1d). In order to integrate the two stages of the decoder, the second stage (articulation-to-acoustics) is directly connected to the first stage (brain-to-articulation) It is trained on the output so that it not only learns the conversion from motion to sound, but also corrects the oral motion estimation errors that may occur in the first stage.
第二阶段

synthesis performance

Audio spectrograms

The figure below shows a comparison of the audio spectrogram of the speech decoded from brain activity and the original spoken sentence. There are two sentences in total, one on the left and one on the left. The upper part is the original speech spectrogram, and the lower part is the decoded speech spectrogram.

Q: What is频谱图? What do the original speech spectrogram and the decoded speech spectrogram represent?
A: A spectrogram is a representation of a speech signal in time and frequency . It shows how the energy distribution of a speech signal at different frequencies changes over time. The decoded spectrogram is obtained by performing spectrum analysis on the decoded speech signal.
tips: The audio spectrogram of the original spoken sentence shows the acoustic energy distribution of the speech signal at different frequencies and times. . The decoded audio spectrogram represents the acoustic energy distribution of the speech signal obtained by decoding brain activity.

Insert image description here
As can be seen from the figure, the decoder retains the significant energy pattern in the original spectrogram, which means that the decoded Spectrograms accurately capture the important acoustic features and energy distribution of the original speech sound. This is important for speech reconstruction tasks, as being able to preserve the energy pattern of the original spectrogram ensures that the decoded speech sounds more natural and accurate.

Q: What are the acoustic characteristics and energy distribution of the original speech?
A: The acoustic characteristics and energy distribution of the original speech can be represented and analyzed through audio spectrograms. Here is an explanation of some of the acoustic features and energy distribution associated with raw speech:

  • 音高(Pitch): Pitch refers to the fundamental frequency in the speech signal, that is, 音调高低 of the sound. In an audio spectrogram, pitch corresponds to periodic peaks in the spectrum or to periodic features of the frequency distribution.
  • 音量(Intensity): Volume representsthe intensity or energy of sound. In an audio spectrogram, volume corresponds to the distribution of energy in the frequency spectrum, usually represented by the brightness of a color.
  • 音色(Timbre): Timbre refers tothe texture or characteristics of sound, which makes the sounds of different sound sources or instruments uniquely identifiable. The timbre characteristics are represented in the audio spectrogram by the spectral line shape and distribution of the spectrum.
  • 噪声成分(Noise Components): In addition to fundamental frequency and harmonic components, speech may also contain noise components, such as breathing sounds, environmental noise, etc. Thesenoise components usually appear as frequency regions with a wider energy distribution in the audio spectrogram.
  • 共振峰(Formants): A formant is a frequency region in a speech signal that corresponds to the resonance peak of the vocal tract (such as the mouth and larynx). Formants appear as frequency peaks with higher energy distribution in the audio spectrogram.

Median spectrograms

Q: What is the median spectrogram
A: "Median spectrograms" refers to the median calculated in a set of spectrograms Value spectrum.
频谱图 is a representation of the speech signal in time and frequency. It shows the change of the energy distribution of the speech signal at different frequencies over time. For a set of speech signals, each speech signal can be converted into a spectrogram and combined together to form a set.
In this set, The energy value of each time point and frequency point can be obtained by calculating the median of the values ​​at the corresponding positions in the set a>. The median energy value thus obtained can be used to construct a median spectrum.

The figure below shows the quality of reconstruction at thephone level. Median spectrograms of original and synthesized phonemes showing, 解码的样本中保留了典型的频谱时域模式 (e.g. formants F1-F3 for the vowels /iː/ and /æ/; and key for the consonants /z/ and /p/ Spectral modes, manifesting as mid-band energy and broadband blasts, respectively).
The preservation and reconstruction of these patterns can be used in research to evaluate the accuracy and effectiveness of decoding algorithms at the phoneme level.
Insert image description here

Q: What is spectrum time domain mode?
A: "Spectrum time domain pattern" refers to the specific pattern or characteristics of the sound signal in the spectrum and time domain. It describes the energy distribution and changes of sound signals at different frequencies and time points. Spectral time-domain patterns refer to specific shapes, peaks, or patterns of energy concentration that can be observed in a spectrogram.
For example, for 元音音素, the spectral time domain pattern can appear as a spectral shape with clear resonance peaks辅音音素< a i=5>, these formants correspond to the resonance characteristics of the vocal tract. For, the spectrum time domain pattern can appear ashaving specific peaks or frequency regions where energy is concentrated. These frequency regions Characteristics corresponding to consonants such as plosives, fricatives, etc.

auditory tasks

In the auditory task, participants were asked to listen to synthesized speech and perform the corresponding task based on their own auditory perception. Divided into word-level and sentence-level transcription tasks. Participants were asked to listen to a compound word and try to identify it correctly. 325 words separated from synthesized sentences were evaluated and word length (number of syllables) and were quantified Effect of word number (10, 20, 50 words) on speech understanding. The observed result is that as the number of syllables in a word increases, listeners recognize it more accurately. On the contrary, as the number of syllables in a word increases, listeners recognize it more accurately. As the number of words increases, the listener's recognition accuracy decreases, which is consistent with natural speech perception. Participants were asked to listen to a synthesized sentence and transcribe it into written form as accurately as possible. The authors designed a closed vocabulary in which listeners heard the entire synthesized sentence and recorded what they heard by selecting words from a defined pool (25 or 50 words). These words include target words and random words from the test set. The observed results, as shown below, are the average word error rate (WER) per sentence. In a pool of size 25, the median WER for sentence transcription is 31%. For a pool of size 50, the median WER for sentence transcription is 53%.
单词级别


Insert image description here
句子级别


Insert image description here

Quantification of decoding performance

特征级别
The researchers conducted a feature-level quantitative assessment of decoding performance for all participants. In speech synthesis, mean Mel-Cepstral Distortion (MCD) is often used to report the spectral distortion between synthesized speech and real speech. Mel frequency bands emphasize the distortion of perceptually relevant frequency bands in the audio spectrum.

Q: What isMel-Cepstral Distortion?
A: Mel-Cepstral Distortion is based on 梅尔倒谱系数 (Mel-frequency cepstral coefficients, MFCCs), which is a feature representation method commonly used in speech signal processing. MFCCs convert the speech signal into cepstral coefficients uniformly distributed on the frequency spectrum to simulate the human ear's perception of audio. MFCCs are often used to represent the spectral features of speech.
Mean Mel-Cepstral DistortionCalculate the distance or difference between the MFCCs of the synthesized speech and the MFCCs of the target speech. It can be obtained by calculating the Euclidean distance or other distance metric between two sets of MFCCs. Then, average the distances between all MFCC coefficients to obtain the Mean Mel-Cepstral Distortion.

As shown below, the MCD of neural synthetic speech (Decoded) is compared with the reference synthetic speech based on joint kinematics (Reference) and chance level decoding ( Shuffled) for comparison (A lower MCD value indicates better performance). Reference synthesis simulates perfect neural decoding of joint kinematics. For 5 participants, the median MCD for decoded speech ranged from 5.14dB to 6.58dB. The researchers calculated the correlation between the original acoustic features and the decoded acoustic features. For each sentence and feature, the Pearson correlation coefficient was calculated using each sample of that feature (at a sampling rate of 200 Hz).
Insert image description here
原始声学特征和解码声学特征的相关性

Q: What is Pearson correlation coefficient?
A: The Pearson correlation coefficient is a method used to measure the degree of linear correlation between two variables Statistics. It measures the strength and direction of the linear relationship between two variables.
The Pearson correlation coefficient ranges from -1 to 1, where:

  • When相关系数为1, it means that the two variables are completely正相关, that is, when one variable increases, the other variable will also increase in the same proportion.
  • When相关系数为-1, it means that the two variables are completely负相关, that is, when one variable increases, the other variable will decrease in the same proportion.
  • When相关系数接近0, it means that there is almost没有线性关系 between the two variables.

The average decoding across participants is plotted below声学特征(acoustic feature) (including intensity, MFCCs, excitation strengths and sounds Sentence-level correlation between (voicing)) and 推断的运动学(inferred kinematics). Decoding correlations for tone features such as 基频(pitch F0), 语音包络(speech envelope), and 声音(voicing) were significantly above chance level
Insert image description here
In addition, the authors also studied the decoding performance of other related features:

  • Decoding performance of dynamic features
    In the figure below, the correlation between all33个解码的口腔运动学特征 and the true values ​​is shown. The dynamic features are obtained using EMA (electromagnetic glottal analysis) technology and represent the X and Y coordinates of the oral motor organs (three points of the lips, jaw and tongue) on the sagittal plane in the vocal tract trajectory. In addition, there areMannerfeatures, which are complementary to EMA and further describe acoustically relevant motions. The boxplot shows the distribution of correlations among these features. Insert image description here

Q: What is EMA?
A: EMA is the abbreviation of Electromagnetic Articulography. It is a technique used to study and record oral movementsduring human speech production. EMA works by using sensors and magnetic fields to capture and measure the position and trajectory of mouth and throat movements. In the EMA system, subjects are asked to wear small magnets with sensors, usually placed on oral motor organs such as the tongue, lips and jaw. These sensors measure their position and orientation using the principle of magnetic field induction. Also included in the EMA system is a measuring device that generates and controls the magnetic field and records the sensor's position information.

Q: What is Manner?
A: In phonetics, manner (manner) is a term that describes the way consonants are pronounced . It refersthe passage of air flow in the mouthand the way different phonemes are produced by using different articulators.
Manner can be used to describe the characteristics and manner of consonant pronunciation. The following are some common manner types:

  • 阻塞音(Stops): A sound produced when the vocal organ completely blocks the flow of air. For example, /p/ and /b/ are obstruents, which are pronounced with the lips closed and then suddenly released.
  • 擦音(Fricatives): The vocal organ partially blocks the air flow, producing fricative sounds. For example, /s/ and /f/ are fricatives, which create frictional noises by forcing airflow through narrow passages.
  • 破擦音(Affricates): Combines characteristics of obstruents and fricatives, which begin with an obstruent part and then transition into a fricative. For example, /tʃ/ (as in English "ch") is a fricative.
  • 鼻音(Nasals): An open channel in the articulator that allows air to escape through the nasal cavity. For example, /m/, /n/, and /ŋ/ (as in English "ng") are nasal sounds.
  • 侧音(Lateral): The tip of the tongue blocks the central channel and air flows out from the side of the tongue. For example, /l/ is a side sound.
  • 半元音(Semi-vowels): The shape of the vocal organ is similar to a vowel, but closer to a consonant. For example, /j/ (as in English "y") and /w/ are semivowels.
    These different Manner types describe the characteristics and mechanisms of different pronunciation modes, which are of great significance to the study of speech production and phonetics.
  • Decoding performance of spectral features
    In the figure below, the correlation between all32个解码的频谱特征 and the true values ​​is shown. Spectral features are obtained using MFCC (mel frequency cepstral coefficients), which are 25 coefficients that describe the power of perceptually relevant frequency bands. Additionally, there are synthesis features, which describe the glottal excitation weights required for speech synthesis. The boxplot shows the distribution of correlations among these features.
    Insert image description here

Decoder characteristics

The decoder design diagram has been shown before, as shown below. The decoder is a two-stage decoder. It first decodes the cavity kinematic features from the EEG, then decodes the acoustic features, and finally synthesizes speech. Because the decoder we want to design in this article is for clinical use, there are several key factors that need to be considered that will affect the performance of the decoder.
Insert image description here

  • 第一点:数据
    For patients with severe paralysis or limited language function, it is difficult to obtain a large amount of data for training the decoder.
    If 直接解码器(direct decoder) is used, decodes the acoustic features directly from the EEG without It needs to be converted into kinematic features through the joint kinematics decoder first. As shown in the figure below, reliable decoding performance can be obtained when using only 25 minutes of voice data. As the amount of data increases, the performance can still be further improved. However, without the intermediate step of joint kinematics, decoding directly from ECoG to acoustic features suffers from 0.54 dB bias (0.2 dB is perceptually noticeable21).
    Insert image description here
    Therefore, the author designed a two-stage decoder , and found clearly Modeling joint kinematics as intermediate featureswill have obvious advantages over decoding acoustic features directly from electrode electroencephalography (EEG) signals. From the figure above, we can also see the gap between the two decoders, becausejoint kinematics directly reflects the motion information in the speech production process, and is more directly related to the acoustic features. association. By first decoding joint kinematics and then decoding from joint kinematics to acoustic features, we are able to recover speech features more accurately.
  • 第二点:保留的语音学特性
    When decoding speech from brain signals, it is important to ensurethat the synthesized speech preserves the phonetic characteristics of natural speech. This means that the decoded speech should sound similar to the intended word or utterance and should be properly understood and recognized by humans.
    The authors therefore evaluated the similarity between synthetic speech and natural speech to understand the phonetic properties that were preserved, This can include analyzing and comparing aspects such as the accuracy of phonemes (the basic units in speech), intonation, pitch, speech rate and speech prosody. The authors usedKullback-Leibler(KL)散度to compare the spectral feature distribution of eachdecoded phoneme with the distribution of each real phoneme to determine their degree of similarity. (For understanding of KL divergence, please refer to this blog: Machine Learning-Intuitive Understanding of KL Divergence + Code)
    As shown in the figure below, the author calculated the acoustic similarity matrix to compare the acoustic properties of the decoded phonemes and the original pronounced phonemes. Similarity is calculated by first estimating a Gaussian kernel density for each phoneme (decoded and original) and then calculating the Kullback-Leibler (KL) divergence between the decoded and original phoneme distributions. Each row compares the acoustic properties of a decoded phoneme to the original articulated phoneme (column). Hierarchical clustering was performed on the obtained similarity matrix. Data comes from P1.
    Insert image description here
  • 第三点:电极的放置
    Figure f below shows the electrode placement in three areas related to speech production (IFG, STG, vSMC).
    Figure g shows the decoding performance of the decoder after deleting the electrodes in one of the areas. It can be seen that deleting the electrodes in any area will lead to a decrease in decoding performance, especially when deleting the electrodes in the vSMC area. Finally, the decoding performance dropped most seriously, with MCD reduced by 1.13dB.
    Insert image description here
  • 第四点:可迁移性
    The author wants to verify whether the decoder trained on the fixed data set can be used on new sentences. The author compared two decoders, one was trained on all sentences, including the test set training set, and the other was trained only on the test set sentences . The experimental results are shown in the figure below. It can be seen that the decoding performance of these two decoders, whether on the same sentence or on new sentences, has no significant difference in MCD and spectral feature correlation.
    This shows that the decoder can generalize to arbitrary words and sentences for which the decoder has never been trained.

Insert image description here

Synthesizing mimed speech

Q: What is tacit voice? What is synthetic tacit speech?
A: 「默契语音」Refers to the production of similar speech without actual utterance or production of audible speech sounds gesture or action. It can involve movements of the articulators (such as lips, tongue, jaw).
「合成默契语音」is the process of generating audible speech from non-verbal or silent speech actions. It involves converting movements and gestures performed in silent speech into understandable and recognizable speech sounds.

When a person speaks, 听觉反馈(auditory feedback) will be generated. This feedback will be transmitted to the cerebral cortex and generate an electrical signal. As we know, the electrical signal is the input of our decoder (EGG). So we need to explore whether the decoder relies on electrical signals generated by auditory feedback.

Q: What is auditory feedback?
A: Auditory feedback of pronunciation refers to the process in which the sounds produced by a person during the vocalization process are transmitted back to the brain through the auditory system. When we pronounce words, the sound comes out through the vocal tracts such as the throat, mouth and nasal cavity, producingsound fluctuations. These sound waves are transmitted to our ears, then received and converted into electrical signals by the auditory nerve cells in the inner ear, and finally transmitted to the auditory cortex of the brain through neural pathways.

So the author designedsilent simulation to test the decoder. A held-out set of 58 sentences was tested in which a participant (P1) first read each sentence aloud and then silently imitated the same sentence, perform the same pronunciation movementbut do not make any sound. The author visualizes the original speech (a), the speech decoded in the voicing state (b), and the speech decoded in the silent state © as shown in the figure below. Although the decoder was not trained on the imitated sentences,. Then the author calculated the MCD and spectral feature correlation between the synthesized speech and the original speech under different conditions , as shown below As shown in the figure, it can be seen that the synthesis of silent speech is slightly worse than the synthesis of voiced speech, but it also proves the possibility of silent speech synthesis. 合成的无声语音的声谱图展现了与合成的可听语音相同句子的类似频谱模式
Insert image description here

Insert image description here

State-space of decoded speech articulation

Q: What does the title mean?
A: In this phrase, "decoded speech articulation" refers to decoded speech pronunciation, that is, speech information restored from a certain encoding form . "State-space" refers to state space, which is a mathematical model used to describe system state changes. Therefore, "state-space of decoded speech articulation" refers to using a state-space model to describe the changing process of decoded speech pronunciation. This state space model can include different state variables,such as the position, speed, acceleration, etc. of the lips, tongue, mandible, etc., to represent the changes in the speech articulators over time< a i=8>. Through this state space model, we can analyze and describe the changing characteristics of decoded speech pronunciation in time and space, so as to better understand and study the speech production process.

Because simulating the underlying kinematics can improve decoding performance, we next hope to better understand the nature of the kinematic features decoded frompopulation neural activity .
The author performs principal component analysis (PCA) on the oral kinematics characteristics and calculates the projection of the state space to obtain Low-dimensional kinematic state space trajectory. Among the total 33 principal components, the first 10 principal components (PCs) explained 85% of the variance, and the first two PCs explained 35% of the variance. The results are shown in the figure below, for each speech kinematics and acoustics , compute a principal component analysis and perform a cumulative sum of the variances for each additional principal component.
Insert image description here

Q: What is PCA?
A: PCA is the abbreviation of 主成分分析(Principal Component Analysis). It is a commonly usedstatistical method and dimensionality reduction technique for analyzing and processing multidimensional data sets.
The main goal of PCA is to convert high-dimensional data into low-dimensional representation through linear transformation while trying to retain the maximum variance of the original data < a i=7>. By finding a new set of orthogonal variables, called principal components, PCA can represent the original data in a new coordinate system. Each principal component is a linear combination of the original data in different directions, arranged in descending order of variance, representing the variability in the data from large to small.

Q: 为什么PCA要保留数据的最大方差?
A: When we perform dimensionality reduction or feature extraction, we want to retain as much useful information as possible while reducing redundancy and noise. Variance can be seen as a measure of useful information in the data. A larger variance indicates that the data has greater changes and differences in that direction, and vice versa. Likewise.
By retaining as much variance as possible, PCA is able to capture the most significant patterns and changes in the data. Larger variance corresponds to important features and variability in the data, while smaller variance corresponds to noise or redundant information in the data. Therefore, by selecting principal components with larger variances, PCA is able to extract the most significant features in the data and reduce the dimensionality of the data while retaining key information.

In Figures a and b below, the motion trajectory of a sentence is projected onto the first two principal components. It can be seen that These trajectories have been well decoded. . (Figure e visualizes the median r > 0.75 for all participants, except P5, where r represents the average r of the first two principal components.) In addition, silent speech was also well decoded, with a median r = 0.6.
and, these trajectories appear to exhibit dynamic changes in syllable patterns in continuous speech , of the corresponding trajectory respectively. The study found that the two types of kinematic trajectories showed biphasic characteristics between vowels and consonants , i.e. the transition from a "high" state to a "low" state (white) and vice versa (black). In panels c,d below, each vowel-consonant transition (n=22453) and consonant-vowel transition (n=22453) were sampled and the 500 ms trajectories of the average PC1 and PC2 trajectories were plotted. The researchers found that PC1 and PC2 preserved biphasic trajectories for vowel and consonant states but did not perform well for specific phonemes. Out specificity. This shows that PC1 and PC2 not only describe the opening and closing of the mandible, but also describe the global opening and closing configuration of the vocal tract. These findings are consistent with theoretical explanations of human speech behavior. These theories argue that high-dimensional acoustic features of speech can be explained in a lower-dimensional oral kinematic state space. In other words,human speech behavior can be represented and understoodwith fewer kinematic features. 辅音(consonants 灰色线)和元音(vowels 蓝色线)波谷和波峰
Insert image description here
Insert image description here

Insert image description here

By projectingthe production of the same sentence from different participants into their respective kinematic state spaces and correlating them Sexual analysis, with 评估解码的状态空间轨迹的相似性. It was found that the state space trajectories are very similar, with a correlation coefficient greater than 0.8. This shows that解码器很可能依赖于跨说话者之间共享的表征. This shared representation is crucial for generalization because it means that the decoder is able to establish a common understanding and expression across different speakers< a i=6>.

For non-verbal people, first学习使用运动学解码器 (the first stage) may be more intuitive and faster, while using existing training on 独立收集的语音数据 Kinematics-to-acoustics decoder (second stage). That is, the two stages are data from different people.
In the figure below, we show the synthetic performance of transferring the second stage from the source actor (P1) to the target actor (P2). Acoustic transfer performed well, although slightly worse than when the first and second stages were trained only on the target participant (P2), possibly because the MCD metric is sensitive to speaker identity.
Insert image description here

Summarize

  • For patients with severe paralysis or speech impairment, we need to study an efficient communication method to help them communicate. The fastest method at present is to decode speech directly from the neural activity of the cerebral cortex.
  • The author of this article proposes a two-stage decoder. First decodes the oral kinematics representation from the EEG signal, and then decodes the acoustic features from the oral kinematics representation. , and finally synthesize the speech signal.
  • In this article, the author has done a lot of experiments to analyze the effect of the decoder, such asspectrum and median spectrogram. The author also designedauditory tasks to test synthesized speech. Then a four-point analysis was conducted on the characteristics of the decoder, and the four key factorsthat affected the performance of the decoder were analyzed.
  • In order to eliminate the impact of auditory feedback on the decoder, the author also conducted comparative experimentssynthesizing silent speech, and performed a comparative experiment on silent speech synthesis. It also achieved better decoding results, bringing more possibilities to clinical applications.
  • The authors also usedprincipal component analysis to characterize the properties of kinematic features decoded from population neural activity. Experimental results prove that speech synthesis only requires a small number of cavity kinematic features to achieve good results.
  • Projecting the production of the same sentence from different participants into their respective kinematic state spaces revealed that their state space trajectories were very similar, suggesting that the decoder is likely to rely on shared representations across speakers. And the author used experiments on people who can't speak to prove this.

Guess you like

Origin blog.csdn.net/m0_51474171/article/details/134953612