Deep Bidirectional Language-Knowledge Graph Pretraining paper reading

Deep Bidirectional Language-Knowledge Graph Pretraining

github code

Summary

Recent work has shown that knowledge graphs (KGs) can complement textual data, providing structured background knowledge and useful scaffolds for reasoning. However, these works are not pre-trained to learn a large-scale deep fusion of two modalities, which limits the potential to obtain fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional Linguistic Knowledge Graph Pre-training), a self-supervised approach to pretraining deep joint linguistic knowledge base models at scale from text and KG . Specifically, our model takes pairs of text segments and associated KG subgraphs as input, and bidirectionally fuses information from both modalities. We pre-train this model by unifying two self-supervised inference tasks, masked language modeling and KG link prediction. DRAGON outperforms existing LM and LM+KG models on various downstream tasks, including general and biomedical question answering, with an average absolute gain of +5%. In particular, DRAGON achieves excellent performance on complex reasoning about language and knowledge (+10% on questions involving long context or multi-step reasoning) and low-resource question answering (+8% on OBQA and riddle reasoning), And achieve new state-of-the-art
results on various BioNLP tasks.

challenge

Effectively combining text and knowledge for preprocessing is an open problem

Given text and KG, we need:

  • A Deep Bidirectional Model of Two Modal Interactions

  • Learning self-supervised objectives for joint inference of text and KG at large scale.

Existing methods:

  • Blends text and KG in a superficial or unidirectional way
  • Focuses on fine-tuning on labeled downstream tasks and does not perform self-supervised learning

These methods may limit their potential to model and learn deep interactions on text and KG.

propose

We propose DRAGON (Deep Bidirectional Language Knowledge Graph Pre-training), a method for performing deep bidirectional, self-supervised pre-training of language knowledge models from text and KG.

Core components:

  • A cross-modal model for bidirectional fusion of text and KG
  • A Bidirectional I-Supervised Objective for Learning Text and KG Joint Reasoning

insert image description here

Specifically, text corpus and KG are adopted as raw data, and the input of the model is created by sampling text fragments from the corpus and extracting relevant subgraphs from KG through entity links, thus obtaining (text, local KG) pairs.

This input is encoded into a fused representation using a cross-modal model, where each layer of the model encodes the text with an LM, the KG with a graph neural network (GNN), and a bidirectional modality interaction module.

The model is pretrained by unifying two self-supervised inference tasks:

(1) Masked Language Modeling (MLM), which masks and predicts tokens in input text.

(2) Link prediction, which discards and predicts edges in the input KG.

Deep bidirectional language knowledge graph pre-training (DRAGON)

We take a text corpus and a large knowledge graph as raw data, and coarsely align (text segment, local KG) pairs by sampling. To learn the interaction between text and KG, DRAGON includes a cross-modal encoder (GreaseLM ), which bidirectionally Fuses input text-KG pairs, and a pre-training objective that performs bidirectional self-supervision on textual KG input. Our pre-training objective combines masked language modeling (MLM) and KG link prediction (LinkPred) to make text and KG inform each other and learn to reason jointly about them. Finally, we describe how to fine-tune the pre-trained DRAGON model for downstream tasks.

input representation

We construct the input by (text segment W, local KG G) pairs. It is hoped that the text and KG of each pair are (roughly) semantically consistent, so that the text and KG can inform each other and facilitate the model to learn interactive reasoning between the two modalities. Specifically, for each text segment W in W, we extract the relevant local KG G from G through the following KG retrieval procedure.

  • KG retrieval

  • Modality interaction token/node

Cross-modal encoder

To model the interaction between text and KG, we use a bidirectional sequence graph encoder that takes in text tokens and KG nodes and exchanges information between multiple layers to produce each token Fusion representation with nodes.
insert image description here

For a controlled comparison with existing works, we adopt the existing high-performance sequence graph architecture GreaseLM, which combines transformers and graph neural networks (gnn) to fuse text-KG inputs.

Specifically, GreaseLM first uses an N-layer Transformer language model (LM) layer to map the input text to an initial token representation, and uses KG node embedding to map the input KG node to an initial node representation.
insert image description here

Then use the text-KG fusion layer of the M layer to encode these token/node features into the final token/node features.
insert image description here

Each of these fusion layers does the following:
insert image description here

Here, GNN induces a graph-structure-aware representation of KG nodes, [ ; ] are connected, and MInt (Modal Interaction Module) exchanges information between interaction tokens (text side) and interaction nodes (KG side) through MLP.

pre-training target

Our goal is to pre-train the DRAGON model so that it can learn text and one-kilogram joint reasoning. To ensure that text and KG inform each other and the model learns bidirectional information flow, we unify two self-supervised inference tasks: masked language modeling and KG link prediction.

**MLM:**MLM is a common pre-training task for language models, which masks some tokens in the input text and predicts them. This task enables the model to reason about masked tokens using the unmasked context, and in particular, when our method takes joint text-KG pairs as input, we expect that the MLM can encourage the model to learn to reason using the structured knowledge in the text and KG Masking in the text (e.g. in example 1 in Fig. 1 in addition to the textual context, identifying the “circle brush” from KG—the “art supply” pathway can together help predict the masked representation “art supply”).

Specifically, to perform the MLM task, we mask the subset of tokens M⊆W in the input text with a special token[mask], and let the task head fhead be a linear layer that takes the contextualized Token vector {H:} to predict raw tokens. The target is the cross-entropy loss:

insert image description here

Link Prediction (LinkPred) : When the MLM task predicts the text end, the link prediction keeps some edges and predicts the input KG. Link prediction is a fundamental task in KGs, which makes the model use the structure of KGs to reason (e.g. use the combination path "X's mother's husband is Y" to infer the missing link "X's father is Y"). In particular, since our method takes joint text-KG pairs as input, we expect link prediction to encourage models to learn to use KG structure jointly with textual context to infer missing links in KG (e.g., in Figure 1, except for KG In addition to the structure, recognizing that "round brush can be used for hair" in the text can help to predict the protruding edge together (round_brush, at, hair)).

Specifically, to perform the link prediction task, we take a subset of edge triplets, S = {(h, r, t)} ⊆E, from the input KG. For the task head fhead, we adopt the KG representation learning framework, which maps each entity node (h or t) and relation ® in the KG to a vector h, t, r, and defines a scoring function f(h, t ) to model positive/negative triplets. Specifically, we let h = Vh, t = Vt r = Rr, where {Vj} is the contextualized node vector from the encoder, r = {r1,. , rIRI} are learnable relational embeddings. We consider a KG triplet scoring function φr(h, t), such as:

insert image description here

Among them, < , , > represent the trilinear dot product, ⭕ represents the Hadamard product (actually the multiplication of the corresponding positions of the two matrices), and the higher the φ, the (h, r, t) becomes a positive triple (with edges) The odds are higher for negative triples (no edge).

optimize the target:

insert image description here

Where (h', r, t') is a positive triplet, (h, r, t) corresponds to n negative samples, and γ is the boundary (???). This objective function is such that the model predicts triplets sticking out of edge S as positive and other random triplets as negative.

Joint training : Joint training. To pre-train DRAGON, we jointly optimize the MLM and LinkPred objectives:
insert image description here

This joint objective unifies the effects of MLM and LinkPred, encourages the model to simultaneously base the text on the KG structure, and contextualizes the KG to the text, facilitating the bidirectional information flow between text and KG for inference. Subsequent experiments show that the joint objective yields higher performing models than using one objective alone.

Experimental results

insert image description here

Table 1 shows the performance of 9 downstream commonsense reasoning tasks. DRAGON consistently outperforms existing LM (RoBERTa) and KG-augmented QA models (QAGNN, GreaseLM) on all tasks, e.g., 7% absolute accuracy improvement on RoBERTa and 5% improvement over GreaseLM on OBQA %. These accuracy improvements demonstrate the advantages of DRAGON over RoBERTa (KG inference) and GreaseLM (pre-training). This gain is especially significant for datasets with small amounts of training data (such as ARC, Riddle, and OBQA), and for datasets that require complex inference (such as CosmosQA and HellaSwag).

analyze

The role of knowledge graph

The first key contribution of DRAGON is the utilization of KG, which the authors find significantly improves the performance of the model in terms of robust and complex inference.
** Quantitative analysis. ** In Table 2, the authors investigate the downstream task performance of DRAGON on problems involving complex reasoning. The authors consider several agents to classify complex problems:

(i) whether there is a negation (e.g. no, never), (ii) whether there is a conjunction (e.g. and, but),

(iii) whether there are formulas (such as sometimes, maybe), (iv) the number of prepositional phrases,

(v) Number of Entity Mentions.

Negations or connections indicate logically multi-step reasoning, more prepositional phrases or entity mentions indicate more steps of reasoning or constraints are involved, and formulas indicate complex textual nuances involved.

DRAGON significantly outperforms the baseline LM (RoBERTa) on all these categories (e.g., +14% accuracy on negatives), confirming that pre-training with linguistic knowledge improves reasoning. DRAGON also consistently outperforms existing KG-augmented QA models (QAGNN, GreaseLM). The authors found that QAGNN and GreaseLM only had modest improvements over RoBERTa on certain categories like conjunctions or many prepositional phrases (= 2,3), but DRAGON provided substantial improvements. This shows that by self-supervised pre-training with larger and more diverse data, DRAGON learns more general reasoning capabilities than fine-tuning-only models such as GreaseLM.

qualitative analysis . Using the CSQA dataset, the author further conducted a case study on the behavior of DRAGON's KG reasoning component, in which the author visualized the change of attention weights under different question changes (Figure 2 below). It was found that DRAGON demonstrated the ability to extrapolate and perform robust inference. Since these problems are more complex than those typically seen in CSQA training sets, the author's insight is that kg-augmented pre-training (DRAGON) while ordinary LM (RoBERTa) and fine-tuning (GreaseLM) have limitations in learning complex reasoning Helps to obtain generalizable reasoning capabilities to extrapolate to harder test examples.

insert image description here
insert image description here

The role of pre-training

Downstream tasks with limited data . In Table 1, we find that DRAGON provides significant improvements over GreaseLM on downstream tasks with limited fine-tuning data available, such as ARC (3K training instances; +4% accuracy gain), riddles (3K instances; +4% accuracy gain rate) and OBQA (5K instances; +5% accuracy). For other tasks, we also experimented with low resource settings, where 10% of the tuning data was used (Table 3). Here, we also see that DRAGON achieves significant gains (+5% accuracy on PIQA) over GreaseLM, which indicates the improved data efficiency of DRAGON.

complex downstream tasks . In Table 1, we find that DRAGON provides greater gains than GreaseLM on downstream tasks involving more complex reasoning, such as CosmosQA and HellaSwag, where the input has longer context and more entities (hence larger local KGs ). For these tasks, GreaesLM improves very little over RoBERTa (+0.1% compared to CosmosQA), but DRAGON provides a considerable boost (+1.8%). Our point is that through self-supervised pre-training on larger and more diverse data, DRAGON learns richer text-kg interactions than GreaseLM and is able to solve more complex downstream tasks. DRaGON also achieves large gains over GreaseLM on complex questions containing negations, conjunctions, and prepositional phrases (Table 2), and infers more complex questions than the training set (Figure 2).

Improve model capabilities . In Table 4, we investigate the downstream performance of the GreaseLM and DRAGON models when the capacity is increased—the number of text-kg fusion layers is increased from 5 to 7. We find that, as reported in the original GreaseLM paper, the increased capacity does not help the fine-tuned-only model (GreaseLM), but it helps with pre-training (DRAGON). This result shows that increased model capacity is actually beneficial when combined with pre-training, and suggests that the promise of DRAGON will expand further.

DRAGON'S DESIGN CHOICES

Pre-training objectives (Table 5). The first important design choice of DRAGON is the joint pre-training objective: MLM + LinkPred ($2.3). Using the joint objective outperforms MLM or LinkPred alone (+5% OBQA accuracy). This shows that setting bidirectional self-supervised tasks on text and KG helps the model to fuse the two modalities for inference.

Link prediction header selection (Table 5 middle 1). KG representation learning is an active research area, and various KG triplet scoring models have been proposed (Equation 9). Therefore, the authors conducted experiments on DRAGON's link prediction head using different scoring models. found that, while DistMult had a slight advantage, all variants the authors tried (DistMult, TransE, RotatE) were effective and outperformed a baseline without LinkPred (“MLM only”). This result demonstrates the generality of DRAGON and its promise for combining with various KG representation learning techniques.

Cross-modal encoder (Table 5 middle 2). Another core component of DRAGON is the cross-modal encoder with a bidirectional text-KG fusion layer. The authors found that if they were eliminated and simply concatenated the text and KG representation at the end, the performance would drop dramatically. This result shows that deep bidirectional fusion is crucial to model the interaction between text and KG.

KG structure (bottom of Table 5). The final key design of DRAGON is to exploit the graph structure of KGs through a sequence graph encoder and a link prediction objective. Here, the authors try an alternative pre-training method that removes the graph structure: we use templates to convert trigrams in the local KG into sentences, append them to the main text input, and perform normal MLM pre-training . The authors found that DRAGON performed far better than this variant (+2% accuracy on OBQA), suggesting that the graph structure of KG helps the model to reason.

Guess you like

Origin blog.csdn.net/gary101818/article/details/130317801