[AI Theory Learning] Multimodal Introduction and Current Research Direction

What is multimodal?

What is multimodal? Multimodality refers to information in various modalities, including: text, image, video, audio, etc. As the name suggests,Multimodal research is the problem of the fusion of these different types of data.

Most of the current work only deals with data in the form of images and text, that is, converting video data into images and converting audio data into text formats. This involves content in the image and text fields.

What are the multimodal tasks and datasets?

Multimodal research is about visual language problems, and its tasks are about image and text classification, question answering, matching, sorting, positioning and other issues.
visual language issues
For example, given a picture, the following tasks can be done:

  1. VQA (Visual Question Answering) visual question answering
    input: a picture, a question described in natural language
    output: answer (word or phrase)
  2. Image Caption Image Caption
    Input: a picture
    Output: natural language description of the picture (a sentence)
  3. Referring Expression Comprehension Referring Expression
    Input: a picture, a sentence described in natural language
    Output: judge the content of the sentence description (correct or incorrect)
  4. Visual Dialogue Visual Dialogue
    Input: a picture
    Output: Multiple interactions and dialogues between two characters
  5. VCR (Visual Commonsense Reasoning) Visual Commonsense Reasoning
    Input: 1 question, 4 alternative answers, 4 reasons
    Output: Correct answer, and reason
    visual commonsense reasoning
  6. NLVR (Natural Language for Visual Reasoning) Natural Language Visual Reasoning
    Input: 2 pictures, a distribution
    Output: true or false
    natural language visual reasoning
  7. Visual Entailment Visual Entailment Input
    : Image, Text
    Output: Probability of 3 kinds of labels. (entailment, neutral, contradiction) contains, neutral, contradictory
    Visual implications
  8. Image-Text Retrieval Image-Text Retrieval
    There are 3 methods.
    1) Search text by picture. Input picture, output text
    2) Search picture by text. Input text, output pictures
    3) Search pictures by pictures, input pictures, output pictures
    Text search

What are the ways to integrate multiple modalities?

Through the pre-training model of NLP, the embedded representation of the text can be obtained ; combined with the pre-trained model in the image and vision field, the embedded representation of the image can be obtained ; then, how to integrate the two to complete the above tasks? There are two commonly used multimodal crossover methods:

  1. Dot multiplication or direct addition
    In this method, the text and image are embedding separately, and then the respective vectors are appended or dot multiplied . The advantage is that it is simple and convenient, and the calculation cost is relatively low.
    ALIGN ModelModal dot product
  2. Another way of modal crossover is Transformer, which has been used more recently.
    The advantage is that the Transformer architecture is used to better represent image features and text features . The disadvantage is that it takes up a lot of space and the calculation cost is high.
    Multimodal Fusion Based on Transformer

What are the research directions of multimodal tasks?

  1. Multimodal representation learning (multimodal representation) : Use the complementarity and redundancy of multimodality to represent and summarize multimodal data.
  • Joint Representation: Different unimodal projections are projected into a shared subspace to fuse features.
  • Cooperative representations: Each modality can learn separate representations, but coordinated through constraints. Constraints can be obtained through adversarial training, similarity constraints on modality-encoded features.
  • Codec: Mapping one modality to another modality in multimodal task conversion. The encoder maps the original modality to an intermediate vector, and then generates an expression in the new modality through the intermediate vector.
  • Modal mapping (translation) : Transform data from one modality to another.
    • example-based: Retrieval-based models are the simplest form of multimodal translation. They find the closest example in the dictionary and use it as a translation. Retrieval can be performed in unimodal space or intermediate semantic space.
    • generative :
      • Grammar-based: Simplifies tasks by using a grammar to limit the target domain . They first detect high-level semantics from source modalities, such as objects in images and actions in videos. These detection results are then merged with a pre-defined grammar-based generation process to produce the target modality.
      • Encoder-Decoder: The source modality is encoded into a latent representation, and then the target modality is generated by the decoder . That is, the source modality is first encoded into a vector representation, and then the target modality is generated using a decoder module, all in a single-pass pipeline.
      • Continuous generative models: Continuously generate target modalities based on source modal input streams, which are often used for sequence translation, and produce outputs at each time step in an online fashion . These models are useful in converting sequences to sequences such as text-to-speech, speech-to-text, video-to-text.
  1. Modal alignment
  • Implicit alignment: Implicit alignment acts as an intermediate (often latent) step in another task , e.g. textual description based image retrieval can include an alignment step between words and image regions.
  • Explicit alignment: Explicit alignment is mainly achieved through similarity measures , and most methods rely on measuring the similarity between different forms of subcomponents as the basic building block.
  1. Multimodal fusion (multimodal fusion) : two classification methods
  • Aggregation-based: Aggregation-based methods combine multimodal sub-networks into a single network through certain operations (such as averaging, connection, self-attention) .
  • Alignment-based fusion methods: Alignment-based fusion methods employ a regularization loss to align the feature embeddings of all sub-networks, while maintaining separate parameter propagation for each sub-network.

  • Early: Early fusion can learn to exploit the correlations and interactions among the low-level features of each modality . For example, the polynomial feature fusion adopted in Document 3 recursively transfers local associations to global associations to fuse features.
  • Late: Late fusion uses unimodal decision values ​​and fuses using a fusion mechanism such as averaging, voting, weighting based on channel noise and signal variance, or a learned model .
  • Hybrid: Hybrid fusion attempts to leverage the advantages of the above two approaches in a common framework . It has been successfully used in multimodal speaker recognition and multimedia event detection.

  1. Co-learning (co-learning)
    Co-learning helps to solve resource-scarce model training in a certain modality, extracting information from one modality data to assist in the training of another modality data . Collaborative learning can be divided into the following three types for different data resource types.
  • Parallel-based: Observations from one modality in the training dataset are directly related to observations from other modalities. i.e. when multimodal observations come from the same instance.
  • Non-parallel data: There is no requirement for a direct link between observations from different modalities. These methods typically achieve joint learning by using data category overlap.
  • Hybrid: In a hybrid data setting, two non-parallel modalities are connected by a shared modality or dataset.

References

[1] https://zhuanlan.zhihu.com/p/473760099
[2] https://zhuanlan.zhihu.com/p/351048130

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/127890525