HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

ACL2022-HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

Paper: https://aclanthology.org/2022.findings-acl.202.pdf

Code: https://github.com/MatNLP/HiCLRE

Journal/Conference: ACL 2022

Summary

Distant supervision assumes that any sentence containing the same entity pair reflects the same relationship. Previous Distant Supervised Relation Extraction (DSRE) tasks usually focus on sentence-level or bag-level denoising techniques independently, ignoring explicit interactions across levels. In this paper, we propose a Hierarchical Contrastive Learning framework (HiCLRE) for distantly supervised relation extraction to reduce noisy sentences, which integrates global structural information and local fine-grained interactions. Specifically, we propose a three-level hierarchical learning framework to interact with cross-levels to generate denoising context-aware representations by adapting existing multi-head self-attention, called Multi-Granularity Recontextualization. At the same time, through a dynamic gradient-based data enhancement strategy, that is, dynamic gradient adversarial perturbation, pseudo-positive samples are provided at a specific level for comparative learning. Experiments show that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.

1 Introduction

Distant Supervised Relational Extraction (DSRE) solves the problem of data annotation overhead and sparseness by automatically generating training text samples. However, DSRE will introduce noisy data and may lose the performance of the model. To this end Multiple Instance Learning (MIL) is proposed to assign "at least one" correct relation triplet to the bag.

The previous DSRE is mainly divided into: sentence-level and bag-level. Sentence-level and bag-level provide a large amount of semantic information at the entity level, as shown in Figure 1. There is a big gap in the semantic information of different bag-levels.

To overcome the above challenges, we propose a hierarchical contrastive learning framework for distantly supervised relation extraction (HiCLRE), which facilitates semantic interactions within specific levels and across levels:

(1) Multi-Granularity Recontextualization: In order to capture cross-level structural information, we adjust the multi-head self-attention mechanism to three-level granularity, including entity-level, sentence-level and bag-level. We align the contextual content features of each layer separately to the input of the attention mechanism. Through the attention scores aggregated by the other two levels, a refined representation of the re-textualized interaction semantics is picked for the corresponding level.

(2) Dynamic gradient adversarial perturbation: To obtain more accurate level-specific representations, we use gradient-based contrastive learning (Hadsell et al, 2006; van den Oord et al, 2018) to extract information about the constructed pseudo-positive samples, And push the difference of the negative samples. Specifically, we compute dynamic perturbations from two aspects, including the normalized gradient of the task loss and the time-weighted memory similarity between the previous and current rounds.

The main contributions of this paper:

  • We propose a Hierarchical Contrastive Learning framework (HiCLRE) for the DSRE task, which takes full advantage of semantic interactions within specific levels and across levels, reducing the impact of noisy data.
  • Multi-Granularity Recontextualization is proposed to enhance cross-level interactions, and dynamic gradient adversarial perturbation learns better representations within three specific levels.
  • Extensive experiments show that our model outperforms strong baselines on the DSRE dataset, and detailed analysis shows that these modules are also effective.

2. Related work

2.1 Relation Extraction with Distant Supervision

It can be divided into two categories: artificially designed features and neural network representations.

2.2 Contrastive learning

Loss function: NCE, which distinguishes clean samples from noisy samples through a probability density function. InfoNCE, adding manual features on the basis of NCE, maximizes the similarity and minimizes the difference.

Data enhancement: data enhancement for simple text processing, EDA (synonym replacement, insertion, deletion), CIL (use TF-IDF to insert/replace unimportant words in the instance, and construct positive samples). Data augmentation for embedded processing, ConSBRT, SimCSE. Data augmentation with external knowledge, ERICA.

These data enhancement methods are all at the data level, ignoring the impact of the model training process. The model proposed in this paper can capture global structural information and perform interactive fine-tuning at different levels.

3. Method

3.1 Model overview

The overall structure of the model is shown in Figure 2. HiCLRE mainly consists of two parts: Multi-Granularity Recontextualization aims to integrate cross-level importance to determine which valuable representations should be extracted in the target level. Dynamic Gradient Adversarial Perturbation proposes a method to enhance internal semantics by constructing pseudo-positive samples for a specific level.

In HiCLRE, the input sample is a certain token S ij = ( ti 1 , ti 2 , … , tik ) S_{ij}=(t_{i1},t_{i2},\ldots,t_{ik})Sij=(ti 1,ti2,,ti) S i j S_{ij} SijRefers to bag B j B_jBjii ini sentences. kkk isS ij S_{ij}SijThe number of tokens in jjj is the index of the bag. ei 1 , ei , 2 e_{i1},e_{i,2}ei 1,ei,2is the sentence S ij S_{ij}SijEntity head and tail in . Each bag contains nnn sentencesB j = ( S 1 j , S 2 j , … , S nj ) B_j=(S_{1j},S_{2j},\ldots,S_{nj})Bj=(S1 j,S2 j,,Snj) . The purpose of our model is to predict the bagB j B_jBjof ∣ r ∣ |r|r specific relation rjinrj d d d refers to the hidden layer dimension of the pretrained language model.

3.2 Hierarchical Learning Modeling

3.2.1 Sentence representation

Specifically, the input to the sentence encoder is the sentence S ij S_{ij}SijAnd its corresponding head entity ei 1 e_{i1}ei 1and tail entity ei 2 e_{i2}ei2token sequence. The text encoder sums each token’s token embedding, segment embedding, and position embedding to achieve its input embedding, and then computes a context-aware hidden representation H = { hti 1 , hti 2 , , … , hei 1 , … , hei 2 , … , htik } H=\{h_{t_{i1}},h_{t_{i2}},,\ldots,h_{e_{i1}},\ldots,h_{e_{i2}}, \ldots,h_{t_{ik}}\}H={ hti 1,hti2,,,hei 1,,hei2,,hti}
H = F ( { t i 1 , t i 2 , … , t i k } ) (1) H=\mathcal{F}(\{t_{i1},t_{i2},\ldots,t_{ik} \}) \tag{1} H=F({ ti 1,ti2,,ti})(1)
F \mathcal{F} F is a pre-trained language model encoder,H ∈ R k × d H \in \mathbb{R}^{k \times d}HRk × d . The embedding of the sentence will be computed with head entity, tail entity and [CLS] labels.
h S ij = σ ( [ hei 1 ∥ hei 2 ∥ h [ CLS ] ] ⋅ WS ) + b S (2) h_{S_{ij}}=\sigma([h_{e_{i1}} \parallel h_{ e_{i2}} \parallel h_{[CLS]} ] \cdot W_S)+b_S \tag{2}hSij=s ([ hei 1hei2h[CLS]]WS)+bS(2)
∥ \parallel is a join operation,WS ∈ R 3 d × d , b S W_S \in \mathbb{R}^{3d \times d},b_SWSR3d×d,bSRepresents the weight matrix and bias respectively, σ \sigmaσ is a non-linear activation function.

3.2.2 bag representation

A sentence-level attention mechanism is used to generate a combined Bag representation. h B j ∈ R d h_{B_j} \in \mathbb{R}^dhBjRd refers to the representation of the bag, which will be calculated through the sentence attention weightα ij \alpha_{ij}aijand the hidden representation h S ij h_{S_{ij}}hSij
h B j = ∑ i = 1 n α i j h S i j (3) h_{B_j}=\sum_{i=1}^{n} \alpha_{ij} h_{S_{ij}} \tag{3} hBj=i=1naijhSij( 3 )
In order to avoid simply treating each utterance of the Bag species equally, the selective attention mechanism assigns the importance of noise reduction. Each weightα ij \alpha_{ij}aijGenerated by a query-based function:
α ij = exp ⁡ ( fij ​​) ∑ n exp ⁡ ( fij ​​) (4) \alpha_{ij}=\frac{\exp (f_{ij})}{\sum_{n} \exp (f_{ij})} \tag{4}aij=nexp(fij)exp(fij)(4)
f i j f_{ij} fijMeasure the input sentence S ij S_{ij}Sijand predicted relationship rj r_jrjdegree of matching.
fij = h S ij A jrj (5) f_{ij}=h_{S_{ij}} \mathbf{A_j r_j} \tag{5}fij=hSijAjrj(5)
A j ∈ R d × d , r j ∈ R d \mathbf{A_j}\in \mathbb{R}^{d \times d},\mathbf{r_j} \in \mathbb{R}^{d} AjRd×d,rjRd represent the weight diagonal matrix and relationrj r_jrjA map representation from relationship labels. The relationship type of the final bag B j B_jBjwill be predicted:
p ( rj ∣ h B j , θ ) = exp ⁡ ( O r ) ∑ p = 1 ∣ r ∣ exp ⁡ ( O p ) (6) p(r_j|h_{B_j},\theta)= \frac{\exp (O_r)}{\sum_{p=1}^{|r|} \exp (O_p)} \tag{6}p(rjhBj,i )=p=1rexp(Op)exp(Or)(6)

O r = σ ( W r ⋅ h B j ) + b r (7) O_r=\sigma(W_r \cdot h_{B_j})+b_r \tag{7} Or=s ( WrhBj)+br(7)

W r ∈ R ∣ r ∣ × d W_r \in \mathbb{R}^{|r| \times d} WrRr × d is a trainable change matrix,br ∈ R ∣ r ∣ b_r \in \mathbb{R}^{|r|}brRr is the bias. θ \thetaθ is a parameter of the bag model. O r ∈ R ∣ r ∣ O_r \in \mathbb{R}^{|r|}OrRr is the final output of our model, which associates all relation types. Therefore, the relation classification objective function DSRE task will be defined as:
L task = − ∑ j = 1 ∣ r ∣ log ⁡ p ( rj ∣ h B j , θ ) (8) L_{task}=-\sum_{j= 1}^{|r|} \log p(r_j|h_{B_j},\theta) \tag{8}Ltask=j=1rlogp(rjhBj,i )(8)

3.3 Multi-Granularity Recontextualization

The above hierarchical learning process ignores explicit interactions across levels to refine representations at better levels. Therefore, after updating the PLM-generated hidden representations, our HiCLRE model attempts to re-textualize the enhanced representations at each level. This is achieved using a modified Transformer layer (Vaswani et al., 2017) that replaces multi-head self-attention with multi-head attention between the representation at the target level and the other two levels.

The multi-head attention mechanism is defined as:
Att . ( Q , K , V ) = softmax ( QKT dk ) V (9) Att.(Q,K,V)=softmax(\frac{QK^T}{\sqrt {d_k}})V \tag{9}A tt . ( Q ,K,V)=softmax(dk QKT)V(9)

h B j ′ = MLP ( A tt . ( he , h S ij , HB j ) ) (10) h_{B_j}'=MLP(Att.(h_e,h_{S_{ij}},H_{B_j}) ) \tag{10}hBj=M L P ( A tt . ( he,hSij,HBj))(10)

MLP is a multilayer linear function.

We concatenate the augmented object-level representation with the original hierarchical hidden state to obtain an information-level representation:
h B attj = σ ( [ h B j ∥ h B j ′ ] ⋅ W att ) + batt (11) h_{ B_{att_j}}=\sigma([h_{B_j} \parallel h_{B_j}'] \cdot W_{att}) +b_{att} \tag{11}hBa t tj=s ([ hBjhBj]Wa tt)+ba tt( 11 )
The last three levels of reinforcement representheattj , h S attj , h B attj h_{e_{att_j}},h_{S_{att_j}},h_{B_{att_j}}hea t tj,hSa t tj,hBa t tjWill supersede the hierarchical hidden representation.

3.4 Dynamic Gradient Adversarial Perturbation

3.4.1 Gradient perturbation

Continuous gradient perturbation ptadv pt_{adv}ptadvwill be obtained from being having task loss parameters VVGradient of V ggg计算。
g j = ▽ V L t a s k ( h B j ; θ ) (12) g_j=\bigtriangledown_V L_{task}(h_{B_j};\theta) \tag{12} gj=VLtask(hBj;i )( 12 )
VVV is the sentence representation of bag. We differentiate entities to generate sentence-level gradient perturbations and entity-level tokens.
ptadvj = ϵ ⋅ gj ∥ gj ∥ (13) pt_{adv_j} = \epsilon \cdot \frac{g_j}{\parallel g_j \parallel} \tag{13}ptadvj=ϵgjgj(13)
∥ g j ∥ \parallel g_j \parallel gj is the norm of the gradient from the loss function, and $\epsilon$ is a hyperparameter controlling the degree of perturbation.

As the number of training epochs increases, we use temporal information of different granularities to further improve the robustness of internal semantics. Specifically, we add inertial weight information (Shi and Eberhart, 1998) on the perturbation term, which exploits the representation difference between the previous and current round. The inertial weight information is expressed as follows:
I w = T − u T sim ( rep ( u ) , rep ( u − 1 ) ) (14) I_w =\frac{Tu}{T} sim (rep_{(u)},rep_ {(u-1)}) \tag{14}Iw=TTuyes ( re p _(u),rep(u1))( 14 )
whereTTT is the total number of rounds of the training process,uuu is the current round index. rep ( u ) rep_{(u)}rep(u)can respectively represent the uuthEntity, sentence or bag representations for u rounds. rep is an embedding matrix that holds semantic memory in the order of element indices, updated from the second epoch during training. We then combine the inertial weight information with bag-level gradient perturbations:
ptadvj = ϵ gj ∥ gj ∥ + T − u T sim ( rep ( u ) , rep ( u − 1 ) ) (15) pt_{adv_j} =\epsilon \frac{g_j}{\parallel g_j \parallel} + \frac{T -u}{T} sim (rep_{(u)},rep_{(u-1)}) \tag{15}ptadvj=ϵgjgj+TTuyes ( re p _(u),rep(u1))( 15 )
Finally, use the InfoNCE loss function as follows:
L bag info = − log ⁡ exp ⁡ ( cos ⁡ ( h B j , h B j ′ ) / τ ) ∑ k = 1 m 1 [ k ≠ j ] exp ⁡ ( cos ⁡ ( h B j , h B kj ) / τ ) (16) \mathcal{L}_{\text {bag }}^{\text {info }}=-\log \frac{\exp \ left(\cos \left(h_{B_{j}}, h_{B_{j}}^{\prime}\right) / \tau\right)}{\sum_{k=1}^{m} \ mathbb{1}_{[k \neq j]} \exp \left(\cos \left(h_{B_{j}}, h_{B_{kj}}\right) / \tau\right)} \tag {16}Lbag info =logk=1m1[k=j]exp(cos(hBj,hBkj)/ t )exp(cos(hBj,hBj)/ t ).(16)

3.5 Training Objectives

Specifications:
L total = λ 1 L eninfo + λ 2 L seninfo + λ 3 L basefo + λ 4 L task (17) L_{total}=\lambda_1 L_{en}^{info}+\lambda_2 L_{sen}^{info}+\lambda_3 L_{back}^{info}+\lambda_4 L_{task}\tag{17}Ltotal=l1Le ninfo+l2Lseninfo+l3Lbaginfo+l4Ltask( 17 )
λ l \lambda_lllIs a hyperparameter, where ∑ l = 1 4 λ l = 1 \sum_{l=1}^{4} \lambda_l=1l=14ll=1

4. Experiment

6. Summary

In this paper, we propose HiCLRE, a hierarchical contrastive learning framework for distantly supervised relation extraction. HiCLRE's Multi-Granularity Recontextualization module utilizes a multi-head self-attention mechanism to transfer information at three levels. The dynamic gradient adversarial perturbation module combines gradient perturbation with inertial memory information to construct better false positive samples for contrastive learning. Experiments show the effectiveness of HiCLRE against strong baseline models on various DSRE datasets.

a

Guess you like

Origin blog.csdn.net/qq_45041871/article/details/130686151