The great unification is really here: Meta-Transformer with multi-modal shared parameters

Producer: Towhee Technical Team

Author: Zhang Chen

Among the many possible development directions of general artificial intelligence, multimodal large model (MLLM) has become an important direction that has attracted much attention. With the impact of GPT-4 on graphic understanding, the understanding of more modes has become a hot topic in academia. Is this era really coming?

The research team of the Multimedia Laboratory of the Chinese University of Hong Kong and the Shanghai Artificial Intelligence Laboratory proposed a unified multimodal learning framework - Meta-Transformer. Through unified learning of multiple modal information, the model can learn to understand 12 modalities and share the network. parameters without additional training.

alt|The modalities supported by Meta-Transformer, and its comparison with ImageBind

This paper explores the potential of the transformer architecture to process 12 modalities, including image, natural language, point cloud, audio spectrogram, video, infrared, hyperspectral, X-ray, IMU, tabular, graph, and time-series data, as shown .

This paper discusses the transformer learning process for each modality and addresses the challenges of unifying them into a single framework, and proposes a novel unified framework for multimodal learning called Meta-Transformer. Meta-Transformer is the first framework to encode data from more than a dozen modalities simultaneously using the same set of parameters, allowing a more cohesive approach to multimodal learning. Meta-Transformer consists of three simple yet effective components: a modality expert for data-to-sequence tokenization, a modality-shared encoder for extracting representations across modalities, and a task-specific head for downstream tasks.

Specifically, Meta-Transformer first transforms multimodal data into a sequence of tokens sharing a common manifold space. Then, a modality-shared encoder with frozen parameters extracts representations, which are further adapted to individual tasks by only updating the parameters of the downstream task header and lightweight tokenizer. Finally, task-specific and general modality representations can be learned efficiently by this simple framework. Meta-Transformer heralds the great promise of using transformers to develop unified multimodal intelligence.

This paper conducts extensive experiments on various benchmarks for 12 modes. By exclusively pre-training on images from the LAION-2B dataset, Meta-Transformer demonstrates superior performance in processing data from multiple modalities, consistently outperforming state-of-the-art methods in different multimodal learning tasks .

alt|For data of different modalities, the corresponding feature sequence construction method is designed based on the information characteristics of different modalities, and then the obtained feature sequence is input into the encoder whose parameters are frozen after pre-training, and the extracted representation can be used in multiple Solve multiple downstream tasks in one modality.

The article also said some limitations of Meta-Transformer:

  • Complexity: Meta Transformers are computationally intensive. The high memory cost and heavy computational burden make it difficult to scale up the model scale and data size.
  • Methodologically: Compared with the axial attention mechanism in TimeSformer and Graphormer, Meta-Transformer lacks temporal and structural awareness. This limitation may affect the overall performance of Meta-Transformer in tasks where temporal and structural modeling play a key role, such as video understanding, visual tracking, or social network prediction.
  • Application: Meta-Transformer mainly exerts its advantages in multimodal perception. Its ability to generate across modalities remains unknown.

Overall, the potential of common transformers for unified multimodal learning is explored in this paper, highlighting the promising trend of using transformer backbones to develop unified multimodal intelligence. To some extent, this paper supports the dominance of transformers in next-generation networks. Importantly, CNNs and MLPs are not far behind. They play an important role in data tokenization and representation projection. This process embodies the inheritance law of neural networks and the continuous evolution of artificial intelligence.

  • Related Links:

Code address: https://github.com/invictus717/MetaTransformer

Paper address: https://arxiv.org/pdf/2307.10802v1.pdf

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/weixin_44839084/article/details/131942609