Paper reading: Multimodal Graph Transformer for Multimodal Question Answering


Paper link

Paper name : Multimodal Graph Transformer for Multimodal Question Answering
paper link

Summary

Despite Transformer模型their success in vision and language tasks, they often learn knowledge implicitly from large amounts of data and cannot directly exploit structured input data. On the other hand, 结构化学习方法, such as graph neural networks (GNNs) integrating prior information, can hardly compete with Transformer models.

In this work, we aim to benefit from both worlds and propose a novel 多模态图转换器question answering task that requires performing reasoning across multiple modalities. We introduce a method 涉及图形的即插即用类注意机制that incorporates multimodal graph information obtained from textual and visual data as effective priors into vanilla self-attention .
Specifically, we construct text graphs, dense region graphs, and semantic graphs to generate adjacency matrices, and then combine them with input visual and linguistic features for downstream inference .

This method of normalizing self-attention with graph information significantly improves reasoning ability and helps to align features from different modalities. We validate the effectiveness of the Multimodal Graph Transformer over its Transformer baseline on the GQA, VQAv2 and MultiModalQA datasets.
insert image description here
图1: Multimodal Graph Transformer 的概述。它将视觉特征、文本特征及其相应生成的图形作为输入。首先将生成的图转换为邻接矩阵来导出掩模矩阵g。计算Transformer中修改的准注意分数来推断答案。式中,G是由视觉端和语言端邻接矩阵连接而成的图诱导矩阵。G是可训练偏差。将不同模态的输入特征与图形信息融合,进行下游推理。

1 contribution

To make up for the shortcomings of existing methods, this paper proposes a plug-and-play graph-based multimodal question answering method. Our approach is Multimodal Graph Transformerbecause it builds on the well-established Transformer (Vaswani et al., 2017a) backbone, despite several key fundamental differences.
First, we introduce a systematic scheme to convert text graphs, dense region graphs, and semantic graphs from vision and language tasks into adjacency matrices for use in our method.
Second, instead of computing the attention score directly, we learn the newly proposed quasi-attention score with the graph-induced adjacency matrix as the core to show the importance of learning relative importance as an efficient inductive bias for computing the quasi-attention score .
Third, unlike previous Transformer methods that learn self-attention entirely from data, we introduce graph structure information in the self-attention computation to guide Transformer training, 如图1所示.

The main contributions are summarized as follows:

• We propose a novel multimodal graph transformer learning framework that combines multimodal graph learning from unstructured data with Transformer models.

• We introduce a modular plug-and-play graph-like attention mechanism with a trainable bias term to guide the information flow during training.

• The effectiveness of the proposed method is empirically verified on GQA, VQA-v2 and MultiModalQA tasks.

3 Multimodal Graph Transformer

3.1 Background on Transformers

The Transformer layer (Vaswani et al., 2017b) consists of two modules: multi-head attention and feed-forward network (FFN).

Specifically, each header is represented by four main matrices:
insert image description here

The output of attention is as follows:
insert image description here

3.2 Framework overview Framework overview

The whole framework of the proposed multimodal graph transformer method 如图2所示. Without loss of generality, we assume that the ultimate task discussed below is VQA, while noting that our framework can be applied to other visual-language tasks, such as multimodal question answering.

insert image description here
图2:该图说明了我们的Multimodal Graph Transformer的整体框架。来自不同模态的输入被处理并转换成相应的图形,然后转换成掩模并结合其特征馈送到变压器进行下游推理。其中,通过场景图生成方法生成语义图,提取密集区域图作为密连图,通过解析生成文本图。

Given an input image and a question, the framework first builds three graphs, including a semantic graph, a dense region graph, and a text graph , which are described in more detail in the following sections. Graph G = (V, E) , where V represents the set of nodes in the graph and E represents the edges connecting them , the graph G = (V, E) is fed into the transformer to guide the training process.

3.3 Multimodal graph construction Multimodal graph construction

We build three types of graphs and feed them into the transformer: 文本图text graph, , 语义图semantic graphand 密集区域图anddense region graph.

Text graph

The task of visual question answering involves the combination of an image, a question, and the corresponding answer. To handle this, we extract entities and create a textual graph representation. We then constructed the graph g =(V,E), shown on the left in Figure 2. The node set V represents entities, and the edge set E represents the relationship between entity pairs. This results in:

  • A collection of N entities, each represented by a labeled vector embedding, constitutes the nodes of the graph.
  • A set of pairwise relationships between entities that form the edges of a text graph. The relationship between entities i and j is represented by a vector e_ij that encodes the relative relationship.
    insert image description here
    图3:将语义图转换为邻接矩阵的简单演示。蓝色的单元格表示图矩阵中该元素的“0”,而白色的单元格表示“-inf”。在计算准注意力时,我们采用矩阵作为掩模。

Semantic graph

In tasks such as multimodal question answering, additional input may be added in the form of tables or long paragraph sentences . To process these inputs, a linear representation of the table can be created and a semantic graph constructed using a similar approach. They are processed using a scene graph parser (Zhong et al., 2021), which converts text sentences into a graph of entities and relations , as shown in Figure 3. The output of the scene graph parser includes:

  • A collection of N words that make up a semantic graph node, where N is the number of parsed words in the text.
  • The possible pairwise relations between a set of words, such as “left” and “on”, as shown in Figure 3, form the edges of our graph. An edge between words connecting j and i is denoted by eij, i.e. associativity is expressed as:insert image description here

Dense region graph

Visual features are extracted by chopping the input image into small pieces and flattening them. Then it will 密集区域图dense region graph G = (V, E)be converted into a mask, where V is the extracted visual feature set , and E is the edge set connecting each feature node , the method is described in (Kim et al., 2021). This will result in an almost fully connected graph.


The three resulting graphs are then converted into adjacency matrices , where the elements are -∞ or zero .
Figure 3 describes the conversion process by taking the semantic graph as an example. These adjacency matrices are used to control information flow 内标点积注意by masking (set to -∞ ) values.

Graph-involved quasi-attention

To effectively utilize structured graph knowledge in our self-attention computation, we incorporate the graph as an additional constraint for each attention head by converting the graph into an adjacency matrix. The graph matrix is ​​denoted as G, which is composed of multiple masks. Figure 4 shows this process. Visual masks are generated from dense region maps, and text masks are derived from text maps. Furthermore, the cross-modal mask is set as an all-zero matrix to encourage the model to learn cross-attention between visual and textual features, thereby facilitating alignment across different modalities.

In the case of adding graph information, when visual graph masks and text graph masks are concatenated and aligned with image and text features, we think it would be beneficial to have a more flexible masking mechanism instead of maintaining one in the Softmax operation. A single constant mask matrix. Drawing on insights from Liu et al. (2021), who included the relative positional bias of each head when computing the similarity, we also intuitively parameterize the trainable bias G- and incorporate it into the training process. Finally, we compute the quasi-attention as follows:


Summarize

insert image description here

Guess you like

Origin blog.csdn.net/weixin_44845357/article/details/130577459