Review of End-to-End Speech Translation in 2023 (Recent Advances in Direct Speech-to-text Translation)

The content of this review comes from the literature:

[1] Xu C, Ye R, Dong Q, et al. Recent Advances in Direct Speech-to-text Translation[J]. arXiv preprint arXiv:2306.11646, 2023.

Table of contents

1 Introduction

2 Tackling Modeling Burden

2.1 Transformer and Variants

Speech-Transformer

Conformer

SSL-Transformer

2.2 Multitask Frameworks

Decoupled Decoder (decoupled decoder)

Decoupled Encoder

Two-stream Encoder

2.3 Non-autoregressive Modeling

3 Tackling Data Scarcity

3.1 Data Augmentation

Expanding ST data

Speech augmentation

3.2 Pre-training

3.3 Knowledge Distillation

3.4 Multilingual Training

4 Tackling Application Issues

Real-time

Segmentation

Named entity

Code-switching

Gender bias

5 Future

LLM(Large Language Model)

Multimodality


1 Introduction

Glossary:

  • Error accumulation: means that in successive steps of transcription or translation, errors in the previous step will accumulate in subsequent steps, resulting in a gradual deterioration in the quality of the final result. declining phenomenon. This error accumulation often occurs in multi-step processes between speech-to-text (Automatic Speech Recognition, ASR) systems and text-to-text (machine translation or text transcription) systems. In these systems, the sound signal is first transcribed into text, which is then translated into the target language or otherwise processed. If errors occur during the transcription step, these errors will be passed on to subsequent steps, affecting the quality of the final translation or text transcription.

  • Autoregressive:In the E2E ST (End-to-End Speech Translation) model, "autoregressive" means that the model will generate each word of the translated text one by one or subword, each generation will depend on what was generated at the previous time step. This is a step-by-step, serial generation process. Typical autoregressive models include recurrent neural networks (RNN), long short-term memory networks (LSTM), and transformers (Transformer).

  1. The early speech translation [Speech-to-text translation (ST)] solution was to use multiple subtasks for processing through a cascade system.

    • For example, the speech is first transcribed into text through the ASR (Automatic Speech Recognition) system, and then the text is translated into another language using the MT (Machine Translation) system.

    • For such cascade systems, the research direction is mainly to solve the problem of error accumulation.

  2. End-to-end speech translation (E2E ST) has the following benefits:

    • Can reduce error accumulation

    • able to reduce latency

    • Have more contextual modeling

    • for unwritten languages

  3. Basic modeling:

    • The corpus of ST usually contains speech s, escape text x, and translation result y

    • The basic E2E ST model framework is based on the Encoder-Decoder architecture

    • However, the training of the E2E ST model is not easy, and its effect is only close to the results of the cascade system, which is not the best performing technology.

  4. At present, the research directions of the E2E ST model are mainly:

    • Modeling Burden:

      • It is necessary to deal with cross-modal (sound to text) and cross-language (source language to target language) issues at the same time, making model modeling very complicated.

      • Difficulty in convergence and poor performance

    • Data scarcity:

      • There are many corpora of ASR and MT, and some of them are very large.

      • However, the ST corpus is more difficult to annotate, so there is very little ST data.

    • Application issues:

      • Issues in practical applications need to be considered, such as real-time translation, long-form audio segmentation, etc.

  5. Based on the above problems and corresponding solutions, we have the following classification diagram:

The following introduction will talk about these three aspects.

  • Section 2 describes how to mitigate modeling burden challenges in the existing literature. Modeling methods can be divided into three categories: Transformers and their variants, multi-task frameworks, and non-autoregressive modeling.

  • Section 3 summarizes approaches to address the data scarcity problem, including data augmentation, pre-training, knowledge distillation, and multilingual training.

  • Section 4 briefly introduces practical application issues

  • Section 5 predicts some promising directions for future ST research.

2 Tackling Modeling Burden

In response to the questions raised, we will introduce them from the following three aspects:

  • Introduced in 2.1: For long sequence inputs such as speech signals, we use high-capacity end-to-end models, usually Transformer and its variant architectures.

  • Introduced in 2.2: For the problem of modeling burden, a multi-task learning framework is usually used to modify the original Transformer-based model.

  • Introduced in 2.3: For the decoding efficiency problem, we use a non-autoregressive model to improve the decoding speed.

2.1 Transformer and Variants

Usually, ST tasks are models using the Encoder-Decoder architecture like Seq2Seq. The schematic diagram of the model is as follows. The Transformer is one that stands out from this type of model. Here are some variants of the Transformer model

Speech-Transformer
  • Based on text-to-text Transformer

  • The main improvement point is that acoustic features are first compressed by a convolutional layer (usually two layers with a stride of 2, compressing the length 4 times) before entering the self-attention encoder, and then followed by a normalization layer

Conformer
  • The main improvement point is that in eachencoder blocks the multi-head self-attention module A convolution module is added between a> and feedforward layer

  • The convolution module includes attention and convolution components, surrounded by two Macaron-net style feed-forward layers and residual connections.

SSL-Transformer
  • This is a speech representation model combined with self-supervised learning (SSL)

  • SSL has been successfully applied to the task of extracting speech features

  • SSL-Transformer mainly inputs the original audio waveform into the self-supervised learning model, and processes it through multiple convolutional layers and coding layers to extract speech features.

  • In the SSL-Transformer model, the self-supervised learning model can be integrated into the decoder: either as an independent encoder, or as a speech feature extractor, and then connected with the entire Transformer model.

2.2 Multitask Frameworks

In response to the problem of model burden, the core idea of ​​multi-tasking is to use some auxiliary tools to assist in the completion of target tasks. Such as ASR and MT. Some parameters of task modules and auxiliary modules can be shared, which leads to the feasibility of auxiliary tasks. There are currently three types of multitasking frameworks:

Decoupled Decoder (decoupled decoder)

Additional decoders are used to guide the model to learn text transcription while still training the model in an end-to-end manner. There are two main ideas. One is how to better promote translation through generated text transcription, such as using a two-pass decoder; the other is to generate text transcription and translation at the same time (dual decoder)

  • Two-pass decoder: First pass the acoustic features through this Decoder, and then combine the transcription results and the decoder results for translation work. However, due to the sequential generation, the inherent advantage of low latency is lost. Therefore, some people use non-autoregressive methods to decode the first segment.

  • Dual decoder:Interactive decoding uses two decoders to generate transcripts and translations simultaneously. At the same time, a cross-attention module is additionally used to exchange information between the two decoders. The wait-k policy provides more useful information for the decoding of the translation tokens by first predicting the tokens of the transcribed text.

Decoupled Encoder

For decoupled decoders, design and latency issues may arise when encountering multiple inferences. A better solution is to simultaneously recognize and understand the semantics of the raw speech input via a decoupled encoder. Therefore, we adopt the scheme shown in the picture below. There are two encoders. The low-level speech encoder first encodes the acoustic information from the speech input, and the semantic encoder further learns the semantic representation required for translation and decoding.

  • Each stage of encoding can be supervised with transcribed information

  • Transcription also provides speech alignment, which can alleviate the encoding burden

Two-stream Encoder

ASR data can be used to enhance components, so can MT data? During the training process, we can receive speech and text input at the same time, each with its own encoder, and a shared encoder. This structure is often optimized with multi-task training losses, such as the negative log-likelihood (NLL) loss for speech translation (ST) and machine translation (MT). The advantage is that by sharing with the MT encoder, a better semantic representation can be learned to improve translation performance.

In the inference process, the speech data is input, passes through the speech encoder, shared encoder, and decoder, and finally generates the translated text.

  • Speech encoder:It needs to be more capable of extracting the acoustic features of the speech input alone. Pretrained speech models such as Wav2vec2 can be used as speech encoders for better ST performance

  • Text encoder:The text encoder can be a text embedding layer or several layers of a text Transformer encoder. At the same time, speech phonemes (phoneme) can also be used instead of the original transcription as text input, which can reduce the modal difference between the two inputs.

  • Interaction:There are also many variants of speech encoder and text encoder interaction.

    • Some use contrastive learning method to shorten the expression difference between speech and text.

    • The Chimera model has been proposed to align the length of speech and text expressions.

    • There are also methods that simultaneously consider expression and length differences to add a cross-attentive regularization module after the shared encoder. The regularization module first extracts data from the text or speech encoder through self-attention or cross-attention. Generate two reconstructed sequences with the same length, and then optimize the L2 distance between the reconstructed sequences. (I feel this is good)

2.3 Non-autoregressive Modeling

The end-to-end model greatly reduces the computing delay compared to cascade systems of the same level, but this advantage is only effective in the case of autoregressive decoding. There are two routes for research on this technology:

  • Non-autoregressive speech translation models are developed with reference to methods from automatic speech recognition (ASR) and machine translation (MT) tasks, such as conditional masking language models and rescoring techniques.

  • Explore more efficient architectures that rely on pure CTC (Connectionist Temporal Classification) for prediction to increase speed. CTC is a loss function for sequence labeling tasks that can be used to train a model to map input sequences to output sequences.

3 Tackling Data Scarcity

Compared with MT or ASR, ST has very little data for training. There are a total of two existing ideas:

  • Extended datasets and data augmentation: introduced in 3.1

  • Mining useful information from MT or ASR data:

    • Pre-training: 3.2 Introduction

    • Knowledge distillation: 3.3 Introduction

3.1 Data Augmentation

This is the most straightforward approach when training data is very sparse.

Expanding ST data
  • Directly use high-quality MT to translate large amounts of ASR data. This method is also called "pseudo-labeling" or "sequence-level knowledge distillation (SeqKD)"

  • There is also bidirectional SeqKD: including forward SeqKD and reverse SeqKD. This is very useful for bilingual end-to-end speech translation models (bilingual E2E-ST models)

  • At the same time, reverse enhancement, that is, enhancement of speech data is also possible: using text-to-speech (TTS) models to extend source language text in machine translation systems into speech.

Speech augmentation
  • SpecAugment:The speech input coefficients acting on the filter bank, including warping features, masking channel blocks and time steps

  • SkinAugment:Use automatic encoding speaker transformation to convert the original speaker's voice to another speaker's voice. Can help the model adapt to the voices of different speakers

  • Data diversity:The usefulness of the original speech translation data can be enhanced through various segmentation methods and recombination methods

3.2 Pre-training

Pre-training has very good results on many tasks in the AI ​​field. The current most advanced E2E ST models basically involve pre-training. It is also divided into two categories

  • Separate pre-training:Separate pre-training refers to pre-training some model parameters or pre-training different sub-modules through different tasks. Earlier work explored better pre-training methods to enhance the encoder's capabilities in semantic understanding. For example, curriculum learning method, self-supervised method of Masked Acoustic Modeling (MAM), and MAM-based FAT.

  • Joint pre-training:Joint pre-training means that the model (including all modules of the encoder and decoder) participates in pre-training as a whole. Joint pre-training usually uses a multi-task learning framework (that is, what is introduced in 2.2). By building a unified model in multi-task pre-training and then fine-tuning it on specific tasks, the performance of multi-task speech and text-related tasks can be improved while reducing the cost of data annotation.

3.3 Knowledge Distillation

Knowledge Distillation is a technique for training deep neural networks in which one neural network (usually a large, complex model) teaches its knowledge to another neural network (usually a small, simple model). The goal of this process is to transfer the complexity and performance of the large model to the small model, allowing the small model to achieve similar performance as the large model while having lower computational and memory requirements.

Knowledge distillation (KD) is often used for model compression, using the output of a larger teacher model that usually performs better to guide the learning of the student model, with the expectation that the student model will achieve the same performance as the teacher model. With limited data, how can we make ST’s performance close to that of MT teachers? There are several methods as follows

  • Use the ST model and the MT model to predict translation markers respectively, and use the predicted probability of the MT model as a teacher to guide the ST output

  • Using the two-stream encoder framework, from the hybrid sequence of the two-stream encoder (2.2 mentioned it, this is to bridge the representation between speech and text Difference), extract the knowledge of the speech-to-text translation module. This approach can help improve the performance of speech-to-text translation modules so that they can better understand and convert speech input.

3.4 Multilingual Training

Multilingual translation is a separate research category. As with MT, adding language indicators (e.g. <2de>, <2fr>) to the decoder is the most direct and efficient way to evolve from a bilingual ST to a multilingual ST. In fact, with limited data in each translation direction, training a many-to-many multilingual ST model is better than training a bilingual ST model alone, because the multilingual model can capture more pronunciation similarities between languages.

Current research on multilingual ST mainly focuses on:

  • In terms of pre-training, such as how to build a unified multi-language speech and text pre-training model and how to design various effective pre-training tasks

  • Efficient fine-tuning, such as

    • Fine-tuning only the parameters of layer-norm and attention layers is more effective than fine-tuning all parameters. This means that fine-tuning only these specific layers can improve system performance.

    • Freeze the pre-trained ASR encoder and mBART decoder, and then only fine-tune the language-specific adapter module to complete the one-to-many speech translation task. This is done on the basis of a multilingual system with a parameter scale of only tens of millions. This approach has also proven effective in improving the performance of multilingual speech translation systems.

4 Tackling Application Issues

Current research is still only conducted in manual segmentation and noise-free environments. However, practical application requirements also need to be discussed.

Real-time

By weighing quality and latency, real-time translation can be achieved. The key goal is to decide whether to wait for some sound sequences or to translate some tokens first. Specific technologies include:

  • Speech segmenter: Based on the CTC standard, segment speech in real time

  • Continuous Integrate-and-Fire module (continuous integration-discharge module): used to exert adaptive strategies and make WRITE decisions at each trigger step.

  • Cross Attention Augmented Transducer: Extended from RNN-T, jointly optimizes decoding strategy and translation quality by considering all possible READ and WRITE action paths

Segmentation

The ST model cannot handle very long speech sequences (such as movies, etc.), so it needs to be segmented into short skits. Specific technologies include:

  • Supervised Hybrid Audio Segmentation (SHAS): Uses Wav2vec2 and trains a classifier to predict segmentation locations supervised by manual segmentation information.

Named entity

That is, the translation of named entities. This is a critical requirement in real-world scenarios. Specific studies include:

  • It was found that the key to the failure of name translation lies in the nationality of the other party, so a multi-language model was proposed to improve the robustness of different pronunciations.

  • There are also two methods designed to perform ST model translation and NE recognition at the same time. The inline method generates NE labels and tokens, and the parallel method (parallel) predicts NE labels and tokens.

Code-switching

It refers to the speech translation task of mixing different languages ​​(usually two or more languages). For example, if a speaker uses both English and French in a conversation, the E2E ST model needs to be able to handle this mixture of languages ​​and convert it into text or speech output in a single target language (such as English or French).

  • We are currently building a corpus of CS tasks and exploring the performance differences between cascaded systems and end-to-end structures on this task.

  • A unified language-independent E2E ST model (Language Agnostic E2E ST model (LAST)) has also been proposed.

Gender bias

emmm is to solve gender bias in translation and ensure that speech recognition and translation systems do not introduce inequality or bias due to gender factors.

5 Future

Discuss some topics for future research

LLM(Large Language Model)

LLMs include ChatGPT, Bloom, etc., and they all have very powerful capabilities. So it is worth studying how to integrate the powerful generation capabilities of LLM into the tasks of ST, and how to incorporate voice data into the training of LLM. direction.

  • As a first step, we can optimize the representation of speech so that it is comparable to the representation of text.

    • Pseudo-language-speech discrete representations as pseudo-language is a good direction.

  • In addition, pre-training large-scale acoustics-aware LLMs is also a promising direction.

Multimodality

The explosion of multi-modal information such as text, image, voice, and video generated by artificial intelligence has promoted the field of ST to exploremore complex human-computer interaction (HCI, human-computer Research on interaction) scenarios, such as speech-to-speech translation, video translation, etc.

The explosive growth of multi-modal data has also led to the implementation of In-Context Learning (ICL) on multi-modal dataIt has also become a promising research direction to better understand and utilize the correlation between different modal data, thereby achieving more accurate and comprehensive multi-modal analysis and applications.

Multimodal pre-training has also been proven to be effective in many fields.

The information interaction and correlation between multi-modalities also need to be explored, such as the voice of the character in the video and the image frames and prosodic environments of the character in the same time period. For example, tone, pitch, volume, speaking speed, pauses, etc. can convey the relationship between the emotion, tone, etc. of the language.

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/134035089