Multimodal pre-training large model~

This paper presents a comprehensive review of large-scale multimodal pretrained models (MM-PTMs).

This article introduces a multi-modal large-scale model review paper published in the MIR journal: Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, ​​Wen Gao. Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research. https://doi.org/10.1007/s11633-022-1410-8. (Multi-modal pre-trained large models: A comprehensive survey overview), The authors are from Pengcheng Laboratory (Shenzhen), Anhui University, Peking University, and Sichuan University.

Paper link: https://www.mi-research.net/article/doi/10.1007/s11633-022-1410-8

Github link: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey

With the urgent need for general-purpose deep models, many pre-trained large-scale models have been proposed, such as Bidirectional Encoder Representation (BERT), Visual Transformer (ViT), Generative Pre-training Transformers (GPT), etc. Inspired by the success of these models in single domains (such as computer vision and natural language processing), multimodal pre-trained large models have also received increasing attention in recent years. In this paper, we provide a comprehensive survey of these models, hoping that this paper can provide new insights and help new researchers keep track of the cutting-edge work. Specifically, we first introduce the background of multimodal pre-training by reviewing traditional pre-training work in deep learning, natural language processing, computer vision, and speech. We then introduce the task definition, key challenges, and advantages of multimodal pretrained models, and focus on data, objectives, network architecture, and knowledge-augmented pretraining for multimodal pretrained large models. Next, we introduce downstream tasks for validating large-scale multimodal pretrained models, including generation, classification, and regression tasks. We also visualize and analyze model parameters and results on representative downstream tasks. Finally, we point out related research directions that may help future work. Additionally, we maintain a continuously updated list of papers on large pretrained multimodal large models:

https://github.com/wangxiao5791509/MultiModal_BigModels_Survey。

With AlexNet's breakthrough in recognition performance in the ImageNet competition [1], artificial intelligence has been greatly developed. Many representative deep neural networks have been proposed, such as VGG, ResNet, Inception, and LSTM networks (long short-term memory networks). Researchers usually collect and annotate some samples to accomplish their tasks, and train their models on large-scale datasets based on the pre-trained backbone structure (such as ImageNet in the field of computer vision, GloVe in the field of natural language processing, and Skip-thought vector ). Compared with traditional handcrafted features, this end-to-end approach can solve many tasks well, such as object detection, segmentation and recognition, etc. However, the generalization ability of the resulting deep models is still limited. While these issues can be addressed to some extent by collecting and labeling larger datasets, this process is expensive and cumbersome. To address this problem, Vaswani et al. proposed Transformer networks, which achieve new state-of-the-art (State-Of-The-Art, SOTA) performance on machine translation tasks. After that, self-supervised pre-training on large corpora followed by fine-tuning on downstream tasks has attracted more and more researchers' attention. Many pre-trained large-scale models are proposed according to this paradigm, such as Bidirectional Encoder Representation (BERT), Generative Pre-Training Transformers (GPT), T5, and XLNet, which also trigger new highlights in pre-training research in the field of computer vision . An increasing number of large-scale natural language processing (NLP) and computer vision models demonstrate powerful results with pre-training and fine-tuning paradigms, including Vision Transformer (ViT) and Swin-transformer. Although these advances have brought new impetus to the development of artificial intelligence, the problems caused by single modality defects are still difficult to solve. Researchers have attempted to bridge the data gap by introducing more modalities into deep models. Many multimodal fusion tasks have also been explored in traditional deep learning, such as visible light, depth information, natural language, point cloud, audio, pulse and event flow, etc. Many pre-trained multi-modal large models have been proposed one after another, and constantly refresh the state-of-the-art level of each task,

As shown in Figure 1. This article provides a comprehensive overview of the work aimed at helping new researchers interested in the field to quickly understand the history and recent developments.

Figure 1. Timeline of multimodal pretrained large models from 2019 to June 2022, including multimodal datasets and representative models. Purple font indicates that the dataset contains Chinese text (other datasets contain English text). Models highlighted in wine red were trained using more than two modalities.

This paper differs from existing related review papers.  Although two surveys on multimodal pre-training have been proposed by researchers, the differences between our survey and existing ones can be summarized as follows:

  • Reviews vary in scope. Existing multimodal surveys only focus on visual language, while the problem of multimodal information is a broader research topic. This article is more comprehensive than the above overview by introducing more modalities, such as audio, video, tables, etc.

  • Timeliness of the review. This paper introduces the latest multimodal pre-training datasets and algorithms proposed from 2019 to 2022. This is a long-term review, while previous review work is a short article. The author also keeps track of the latest research results on the Github page.

  • Latest insights on multimodal pretrained large models. By classifying and analyzing the existing multimodal pre-training from different perspectives, this article can help readers master cutting-edge methods and techniques, and understand them from a detailed and high-level perspective. In addition, the possible research directions proposed in this paper for the multimodal pre-training large model are well thought out, hoping to provide new clues for follow-up research.

Figure 2. Overview framework of multimodal pre-trained large models.

Multimodal pre-training:

Mission Definition and Key Challenges:

In general, deep neural networks are trained on large-scale datasets, such as the widely used residual network, which is pre-trained on the ImageNet dataset using classification tasks. In contrast, multimodal pre-trained large models are usually trained on large-scale training datasets. Usually, these data are not labeled, because they are too large to label. On the other hand, parameters need to reach a certain scale. As shown in Figure 3, multimodal data, large models, and computing power are closely linked. In conclusion, backed by computing power, multimodal pre-training usually denotes the task of pre-training large multimodal models with a large number of parameters using large amounts of multimodal data in an unsupervised manner. Figure 3. The relationship between multimodal data, models, and computing power.

According to the above process, it is very challenging to realize a multi-modal pre-training large model with a huge amount of model parameters. More specifically, we summarize the key challenging factors as follows:

  • Acquisition and cleaning of large-scale multimodal data . Multimodal data is one of the most important elements in multimodal pre-trained large models. Collecting multimodal data is much more difficult than single modality due to the lack of multimodal imaging equipment. Commonly used multimodal cameras usually only cover two modes, such as RGB-depth, RGB-thermal, RGB-radar, RGB-event cameras, etc. Most current multimodal pretrained large models are visual language models due to the easy access to image and text data from the internet. However, additional cleaning of these data is also necessary due to sample noise.

  • Network architecture design for large-scale multimodal pre-training . Network architecture is another key component of multimodal pre-training. Networks for feature encoding of multiple input modalities need to be carefully tailored, as different modalities may have their own features and require specific networks. For example, for image and text modalities, Transformers or CNNs are recommended, while for event streams, spiking neural networks can be used. Another issue is the design of multimodal fusion or cross-modal matching modules. Whether similar modules are designed for small-scale multimodal tasks applicable to large-scale pre-trained models remains to be verified.

  • The design of the pre-training objective . Due to the large amount of unlabeled multimodal data, pre-training tasks usually need to be performed in an unsupervised learning manner. Many current works adopt mask region prediction for each modality as their learning objective. Obviously, objectives for multimodal tasks can be directly borrowed from unimodal pre-training, but pre-training objectives designed for multimodal tasks are also necessary, intuitive, and effective. The widely used contrastive learning, modality-based matching and modality translation are all effective and meaningful attempts. How to design new multimodal pre-training targets is one of the most challenging tasks in multi-modal pre-training large models.

  • Support large-scale computing power . Training of traditional deep neural networks can be performed on servers with a limited number of GPUs. In contrast, multimodal pre-training of large models requires more computing power due to large-scale multimodal data and very large model parameters. Therefore, a supercomputing device must be prepared first, and subsequent model training also requires a large amount of power to support it.

  • Parameter tuning techniques . Considering the aforementioned challenging factors, training an effective large-scale model is never a simple task. The technique used to train the neural network is also very important. Although the research and technology of small-scale pre-training are relatively more mature, there is less experience in large-scale pre-training technology.

Advantages of pre-training large models:

Compared with single-modal pre-trained large models, multi-modal pre-trained large models are more suitable for practical application scenarios. Specifically, problems such as multimodal collaborative generation, modality completion, and cross-domain retrieval can be well solved by multimodal pre-training large models. In addition, multimodal data contain more information, which can make up for the shortcomings of a single modality. Therefore, multimodal pre-trained large models can help extract common features of multiple modalities. Many recent works have shown that pre-training large models with multimodality indeed brings additional prior knowledge. Compared with small multimodal models, the generalization of multimodal pretrained large models obtained through self-supervised/unsupervised learning can be significantly improved. Because some prior knowledge is only contained in massive big data, and a small amount of manually selected labeled data is biased, it is difficult for small models to grasp this knowledge.

Pre-training dataset:

As shown in Table 1, many large-scale multimodal data sets have been proposed for pre-training tasks. For a more detailed introduction to each data set, please refer to the original text of the review paper and the corresponding original article. Table 1. Overview of large-scale pre-trained multimodal datasets

Pre-training target:

How to design pre-training targets is one of the core tasks of multimodal pre-training tasks. The many pre-training targets introduced in this article are shown in Figure 4. For more details, please refer to the original text. Figure 4. Representative pre-training objectives in pre-trained large models.

Pre-trained network framework:

This article reviews 89 typical pre-trained multi-modal large models, as shown in Table 2, which covers the involved modalities, backbone network architecture, pre-training objectives, characteristics of large models, parameter quantities, and corresponding open source code link. For more details, please refer to the original review.

Table 2. Overview of the multimodal pre-training large models involved in this review paper

Knowledge-guided pre-trained large model:

Traditional pre-trained models suffer from poor logical reasoning ability and lack of interpretability. In order to solve these problems, knowledge is introduced into the pre-training model, even pre-training with knowledge is also called "knowledge-enhanced pre-training models" (KEPTMs), as shown in Figure 5. Knowledge means learning. Knowledge representation learning by learning symbolic knowledge (usually represented in the form of entities and relations) enables neural network-based models to fuse knowledge and improve their reasoning ability. Similarity-based models and graph neural network (GNN) models are two main methods for knowledge representation learning. Figure 5. Classification of Knowledge Augmented Pretrained Models (KEPTMs).

Pre-training downstream tasks :

After the pre-training task is completed, researchers usually test the model on downstream tasks to verify its powerful capabilities. Specifically, generation tasks, classification tasks, and regression tasks are used for validation. As a new learning paradigm, prompt learning for modifying downstream tasks to adapt to pre-trained large models has attracted increasing attention. In this section, some representative hint learning algorithms are also reviewed. An overview of these downstream tasks is shown in Figure 6. Figure 6. Overview of downstream tasks for pretrained multimodal large models.

experiment analysis:

Considering the complexity and quantity of multimodal pre-trained large models, it is almost impossible to replicate the pre-training task in a short time. Therefore, this paper ignores the experiments and analysis related to pre-training. However, this paper still hopes to provide readers with a more complete review article, therefore, the experimental results of the corresponding downstream tasks are extracted from their paper and compared with the shared benchmark dataset, as shown in Figure 6 and Figure 7 Show.  Figure 7. Experimental results of selected multi-modal pre-trained large models on three mainstream tasks: zero-shot image retrieval (Rank-1, Rank-5), image captioning (BLEU, METEOR, CIDEr, SPICE), and visual question answering (Test-std).

Future research directions:

Although multimodal pre-training large models has been greatly developed, it is still a young research direction. Many questions and opportunities still await researchers. In this section, we summarize several research points worth trying.

  • Pre-train on more modalities . Existing large-scale pre-trained models are usually pre-trained on two modalities, such as vision and language. The lack of large amounts of aligned multimodal data may be a key reason. As an old saying goes: "Sharpening a knife is not a mistake for a woodcutter". Acquiring realistic multimodal data is the most important thing for large-scale pre-training, as shown in Figure 8, such as visual images, text, audio, radar, event streams, depth images, thermal images, etc. To our knowledge, few devices can capture so many modalities simultaneously. Therefore, the fabrication of multimodal imaging devices could be of very significant interest. Pre-trained large-scale models based on these data may have wider application potential. whaosoft  aiot  http://143ai.com  

Figure 8. Examples of mainstream modals that are often used 

  • Pre-training based on incremental learning.  Currently, existing pre-trained large-scale methods are used for downstream tasks through feature fine-tuning or hint learning. This standard deep learning process works well for short periods of time, but pre-training is an expensive process. Specifically, data collection and cleaning, charges for pre-training, and hardware devices consume a lot of human and material resources. Pretraining on mixed data is expensive, redundant, and not environmentally friendly when we collect another set of data. However, few studies have considered developing incremental learning algorithms for large models, and it is unclear whether incremental learning algorithms developed in traditional deep learning are applicable to large models. In addition to the above data incremental learning, there are many ways to take advantage of multimodal pre-training of large models. For example, incremental learning of categories (or categories) is a classic machine learning problem. Another interesting question is how to introduce new modalities into an already pretrained multimodal model. Because new sensors (modalities) will appear at some indeterminate time in the future, the designed multimodal large-scale model should be flexible enough to handle this situation.

  • Knowledge-augmented multimodal pre-training.  According to the survey of multimodal pre-training large models, we can find that knowledge-assisted pre-training is still in its infancy. Current work only adopts external knowledge graphs or knowledge bases in the pre-training stage, but they are usually unimodal, independent of multimodal data, and are limited to improving the model's understanding of the data. While commonsense knowledge is more general, it is also more abstract, introducing ambiguity that leads to challenges when applied to specific data. Therefore, we believe that further exploration of knowledge-augmented multimodal pre-training is worthy of investigation. First, specified knowledge of multimodal data needs to be collected or extracted via self-supervised learning. Second, more general knowledge fusion methods for multimodal data need to be designed, beyond the limitations of visual and language modalities. Third, a specific pre-training knowledge evaluation task is needed to examine knowledge enhancement in early stages, since pre-training is the first stage of the whole training process, while downstream tasks are still being determined.

  • Fine-grained multimodal pre-training.  Most existing multimodal pre-training large models are pre-trained from a global perspective, for example, researchers use the matching between the entire image and language as the pre-training supervision signal. Representative works include CLIP, ALIGN, etc. It should be noted that fine-grained local information mining or instance-level pre-training may further improve the overall performance of multi-modal pre-training. Some researchers have exploited the possibility of fine-grained pre-training strategies. We hope that more researchers will focus on this direction to further advance the final results.

  • Multimodal Pretrained Models Based on Hint-Based Learning.  The current pre-trained large-scale models usually use the "pre-training-fine-tuning" method, that is, users need to initialize their models with pre-trained weights, and then fine-tune them on downstream tasks. While this works in many tasks, fine-tuning may not be the most straightforward approach. Because current multimodal large-scale models are pre-trained by modality matching, masked token prediction, etc., while downstream tasks are usually classification and regression tasks. Therefore, there is a gap between multimodal pre-training and fine-tuning. Recently, a new framework (called hint learning) was developed for downstream tasks based on large models, which rapidly transforms the environment of downstream tasks to be consistent with pre-training. Many works have demonstrated its effectiveness in CV and NLP tasks. Research work in this direction is also interesting and has great potential.

  • Technology transfer for small-scale model development.  Small-scale multimodal models have been developed for many years, and many representative models have been proposed for deep multimodal tasks. Among these works, diffusion, cross-attention, and dynamic neural networks are very useful for specific multimodal tasks. Some of these techniques are applied to visual-language pre-training models, such as ViLBERT based on cross-attention. There are still many algorithms or tricks that have not been explored in large model tasks. We believe that the transfer from small-scale to large-scale pretrained models is worth investigating.

  • Coupling and decoupling issues in cross-modal pretrained models.  Coupling involves establishing a correlation between different modalities, and "crossover" can only be achieved through this correlation. Decoupling allows further dynamic extension of modals. From a framework design perspective, it is worth investigating how to provide viable solutions to these two problems.

Summarize:

This paper presents a comprehensive review of large-scale multimodal pretrained models (MM-PTMs). First, we introduce the background of MM-PTMs, focusing on traditional deep learning, NLP, CV, and pre-training in speech. Then, the task definition, key challenges and benefits of MM-PTMs are discussed. Next, we dive into the review of MM-PTMs and discuss pre-training data, objectives, networks, knowledge-augmented pre-training, etc. We review downstream tasks, including generation, classification, and regression tasks, and outline the model parameters of MM-PTMs and the hardware used for pre-training. We also discuss and visualize experimental results on several representative tasks. Finally, we point out some noteworthy research directions and conclude the paper. Hope our review can provide some useful insights for MM-PTMs.

 

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131776193