ACL2022 Document-Level Event Argument Extraction via Optimal Transport

Document-Level Event Argument Extraction via Optimal Transport

Paper: https://aclanthology.org/2022.findings-acl.130/

Code:-

Journal/Conference: ACL 2022

Summary

Event Argument Extraction (EAE) is one of the subtasks of event extraction, which aims to identify the role of each entity in a specific event trigger word. Although previous work has achieved success in sentence-level EAE, document-level has been less explored. In particular, although the syntactic structure of sentences has been shown to be effective for sentence-level EAE, previous document-level EAE models completely ignore the syntactic structure of documents . Therefore, in this work, we investigate the importance of syntactic structure in document-level EAE. Specifically, we propose Optimal Transport (OT) to induce document structure based on sentence-level syntactic structure, tailored to the EAE task. Furthermore, we propose a novel regularization technique to explicitly constrain the contribution of irrelevant context words in the final prediction of EAE. We achieve state-of-the-art performance through extensive experiments on the benchmark document-level EAE dataset RAMS. Furthermore, our experiments on the ACE 2005 dataset reveal the effectiveness of the proposed model in sentence-level EAE by establishing new state-of-the-art results.

1 Introduction

Better methods are needed to prune dependency-based structures in documents to better preserve important words and exclude noisy words. Unlike previous work, which employs simple grammar-based rules, i.e., distances to dependency paths (Zhang et al, 2018), we argue that the pruning operation should also be aware of the semantics of words. In other words, two criteria should be considered, namely syntactic and semantic relevance . Specifically, if a word has a small distance from an event trigger word/argument word in the dependency structure (i.e., based on grammatical importance), and it is semantically related to a word in the dependency path (i.e., According to the semantic importance), the word is retained in the document structure of the document-level EAE. Note that semantic similarity between words can be obtained from model-induced word representations. A key challenge to this idea is the different nature of syntactic and semantic distance, which complicates the combination of information that determines the importance of words to structure. Furthermore, the retention decision of a word should also incorporate the potential contributions of other words in the EAE document structure. Thus, inspired by dependency paths as anchors for document structure pruning, we propose to transform the joint consideration of syntactic and semantic distances of words into finding optimal alignments between non-DP and DP words. Optimal alignment will be addressed by an optimal transfer (OT) approach, in which both the syntactic and semantic distances of words from words along dependency paths are modeled in a joint optimization problem. OT is an established mechanism for efficiently Find the optimal transportation plan (ie, alignment) between the two sets of points. We propose in our document-level EAE problem to leverage the semantic similarity of words to obtain their transport costs, while exploiting the syntactic distance to event-triggering words/arguments to compute the word quality distribution for OT. Finally, to prune the document structure, a non-DP word is considered important to the document structure (and thus preserved) if it is aligned with an in-DP word by an OT solution. The pruned document structure will be used to learn representation vectors for input documents to perform argument role prediction using graph convolutional networks (GCNs) (Kipf and Welling, 2017).

Although OT-based pruning methods can help exclude irrelevant words of EAE in the document structure, their noise information may still be encoded in the representation of related words due to contextualization in the input encoder (e.g., BERT). Therefore, to improve representation learning, we propose to explicitly constrain the impact of irrelevant words on representation learning via a novel regularization technique based on the pruned document structure. In particular, we try to add irrelevant words back into the pruned structure (thus restoring the original tree) and ensure that changes in representation vectors due to such additions are minimized. Therefore, in addition to the pruned structure, we apply the GCN model on the original dependency structure to obtain another set of representation vectors for words. Finally, in the final loss function, we introduce the difference between the representation vectors obtained from the pruned structure and the original structure to implement the contribution constraint on irrelevant words. In our experiments, we evaluate our model on both sentence-level and document-level EAE benchmark datasets, demonstrating the effectiveness of the proposed model by establishing new state-of-the-art results in both cases.

2. Model

Problem Definition : The goal of EAE is to identify the effect of entity mentions on specific event trigger words. This task can be defined as a multi-classification problem. Formally speaking, given a document D = [ w 1 , w 2 , … , wn ] D=[w_1,w_2,\ldots,w_n]D=[w1,w2,,wn] , the trigger wordwt w_twtand candidate argument wa w_awa, this goal is to predict labels L = [ l 1 , l 2 , … , lm ] L=[l_1,l_2,\ldots,l_m]L=[l1,l2,,lm] as a candidate argumentwa w_awaIn the trigger word wt w_twtinduced effect. Label Set LLL contains a special tagNoneto denote the argumentwa w_awaand the trigger word wt w_twtThere is no relationship between them.

Model Overview : The proposed model consists of four main components: 1) Input Encoder, which represents words in documents using high-dimensional vectors; 2) Dependency Pruning, which uses Optimal Transfer (OT) to prune the 3) regularization, which explicitly minimizes the contribution of irrelevant words to representation learning; and 4) prediction, which uses word-induced representations for documents to make final predictions.

2.1 Input Encoder

Use high-dimensional vector xi x_ixiTo represent each word wi ∈ D w_i \in DwiD. _ The vector xi is formed by the concatenation of the following vectors: A) Contextualized word embedding: we input text[CLS] w 1 w 2 ... wn [ SEP ] [CLS] w_1w_2 \ldots w_n [SEP][CLS]w1w2wn[ SEP ] into the BERTbase model (Devlin et al, 2019); we usewi w_iwiThe hidden states in the last layer serve as contextualized word embeddings. Note that for words consisting of multiple tokens, we take the average of their token representations; and B) distance embedding: we use high-dimensional vectors obtained from the distance embedding table (randomly initialized) to represent the word wi w_iwito trigger words and argument words (i.e., ∣ i − t ∣ |i−t|it sum∣ i − a ∣ |i−a|ia ​) relative distance. Update the distance embedding table during training. Also, in our experiments, we found that fixing BERT parameters is more helpful. Therefore, for the vectorxi x_ixiCustomized as an EAE task, we set the vector X = [ x 1 , x 2 , … , xn ] X=[x_1,x_2,\ldots,x_n]X=[x1,x2,,xn] are fed to a bidirectional long short-term memory network (BiLSTM). Hidden state obtained from BiLSTM,H = [ h 1 , h 2 , … , hn ] H=[h_1,h_2,\ldots,h_n]H=[h1,h2,,hn] , which will be used by subsequent components.

2.2 Dependency pruning

In order to use the input document DDD' s syntactic structure, we exploit the dependency tree of the sentences in the document. Here we use an undirected version of the dependency tree generated by the Stanford CoreNLP parser. In order to join the dependency trees of sentences to formDDA single dependency graph for D , similar to (Gupta et al, 2019), we for DDEach pair of consecutive sentences in D adds an edge between the roots of the dependency tree. Therefore, the generatedDDD' s syntax tree, calledTTT , will contain all wordswi ∈ D w_i∈DwiD D D D' s complete treeTTT may containw_a for wawaRelative to the event trigger word wt w_twtRelated and irrelevant words predicted by the argument roles of . Therefore, it is necessary to prune the tree to keep only relevant words, thus preventing potential noise introduced by irrelevant words for representation learning. Inspired by the effectiveness of sentence-level EAE-dependent paths in previous work (Li et al, 2013), we use TTEvent trigger wordwt w_t in Twtand argument candidate wa w_awaThe dependency path (DP) between them acts as an anchor to prune irrelevant words. In particular, we try to only retrain TT , except for words in DP (which may miss some important context words for prediction)Non-DP words in T that are syntactically and semantically close to words in DP (i.e., align DP words with non-DP words). We propose to use Optimal Transfer (OT) to jointly consider the syntax and semantics of this word alignment. In the following, we first formally describe OT. Then, we will provide details on how to leverage OT to realize our idea.

OT is an established method for finding the best plan to convert (i.e. transport) one allocation into another. Formally, given the domain X , Y \mathcal X ,\mathcal YX,Probability distribution p ( x ) p(x)on Yp ( x ) andq ( y ) q(y)q ( ​​y ) , and cost/distance functionC ( x , y ) : X × Y → R + C(x,y):\mathcal X \times \mathcal Y \to \mathbb{R}_+C(x,y):X×YR+For X \mathcal XXY \mathcal YThe mapping of Y , OT, is found with marginalp ( x ) p(x)p ( x ) andq ( y ) q(y)The optimal joint permutation/distribution of q ( y ) π ∗ ( x , y ) π*(x,y)Pi(x,y )(在X , Y \mathcal X ,\mathcal YX,Y ), that is, fromp ( x ) p(x)p ( x )q ( y ) q(y)q ( ​​y ) by default:
π ∗ ( x , y ) = min ⁡ π ∈ ∏ ( x , y ) ∫ y ∫ X π ( x , y ) C ( x , y ) dxdys t x ∼ p ( x ) , y ∼ q ( y ) \pi^*(x,y)=\min_{\pi \in \prod(x,y)} \int_{\mathcal y} \int_{\mathcal X} \pi(x,y) C(x,y) dxdy \\st x \sim p(x),y \sim q(y)Pi(x,y)=π(x,y)minyXπ ( x ,y)C(x,y)dxdys.t.xp(x),yq ( ​​y )
where,∏ ( x , y ) \prod(x,y)(x,y ) is with marginp ( x ) p(x)p ( x ) andq ( y ) q(y)The set of all joint distributions of q ( y ) . Note that if the distributionp ( x ) p(x)p ( x ) andq ( y ) q(y)q ( ​​y ) is discrete, then the integral in the above equation is replaced by a sum, and the joint distributionπ ∗ ( x , y ) π^*(x,y)Pi(x,y ) is represented by a matrix whose entries( x , y ) ( x ∈ X , y ∈ Y ) (x,y)(x \in \mathcal X,y \in \mathcal Y)(x,y)(xX,yY ) indicates that the data pointxxconvert x to yyy will be distributedp ( x ) p(x)p ( x ) transforms toq ( y ) q(y)Probability of q ( y ) . Note that to obtain data pointsX , Y \mathcal X ,\mathcal YX,For hard alignment between Y , we can put π ∗ ( x , y ) π^*(x,y)Pi(x,y ) is aligned to the column with the highest probability, i.e.,y ∗ = arg ⁡ max ⁡ y ∈ Y π ∗ ( x , y ) y^*=\arg \max _{y \in Y}π^* (x,y)y=argmaxyYPi(x,y ),其中y ∗ y^*yY \mathcal YY with data pointx ∈ X x \in \mathcal XxX- aligned data points.

The most important and useful feature of OT in our problem is that it can find the transfer (i.e., alignment) between two sets of data points at the lowest cost according to two criteria: 1) the distance between the data points, and 2) The difference between their probability masses. In particular, these two criteria can be used to capture the required semantic and syntactic similarity in our model to find the coherence between non-DP words and DP words. Specifically, we use words on DP as domain YYdata points in Y , use words outside DP as areaXXThe data points in X will be. To computex ∈ X , y ∈ Y x \in \mathcal X,y \in \mathcal YxX,yDistribution of Y p ( x ) p(x)p ( x ) andq ( y ) q(y)q ( ​​y ) (i.e., the probability mass of the data point), we use a grammar-based importance score. Formally, for the wordwi w_iwi, we calculate its distance to the trigger word and candidate argument in the dependency tree (the length of the dependency path), namely ditd^t_iditsum diad^a_idia. Then, calculate word x = wi ∈ X x=w_i∈\mathcal{X}x=wiThe probability mass of X as the minimum of the two distances, ie p ( x ) = min ( dit , dia ) p(x)=min(d^t_i,d^a_i)p(x)=min(dit,dia) . Note that the distributionp ( y ) p(y)p ( y ) is computed similarly;p ( x ) p(x)p ( x ) andq ( y ) q(y)q ( y ) are also normalized with softmax over their corresponding sets to obtain distributions. In order to get each pair of words( x , y ) ∈ X × Y (x,y) \in \mathcal X \times \mathcal Y(x,y)X×Distance between Y /transportation cost C ( x , y ) C(x,y)C(x,y ) , we propose based on their representation vectorhx h_xhxand hy h_yhyin HHH C ( x , y ) = ∥ h x − h y ∥ C(x,y)=\parallel h_x−h_y \parallel C(x,y)=∥hxhy to use their semantic information.

With this setup, solving the above equation returns the best alignment π ∗ ( x , y ) π^*(x,y)Pi(x,y ) , this alignment can be used to placeX \mathcal XEach data point in X is related to Y \mathcal YA data point in Y is aligned. However, in our problem we look forX \mathcal XSubset of data points in X and Y \mathcal YThe data points in Y are aligned so that in DDD' s dependency structure is preserved. Therefore, we add an extra data point "NULL" toY \mathcal YY , which is represented by the pairX \mathcal XThe representation of all data points in X is averaged and the probability mass isX \mathcal XAverage of the probability masses of the data points in X. withY \mathcal YThe alignment of this data point in Y will be used as the null alignment, indicating that X \mathcal XAligned data points in X , i.e. deviated from DP words, should not be preserved in the pruned tree. X \mathcal XOther words in X with non-empty alignment, calledI ( I ⊂ X ) I(I⊂X)I(IX ) , will remain inDDD' s pruned tree. fromTTRemoving the words assigned to DP in TNULL will generate a new graph that may contain the most important words in D's argument role prediction, we also keep the trigger words/argument words along withIIAny word in the dependency path between words in I , resulting in a new graphT′ T′T' to denote DDwith important context wordsD

In the next step, we will T ′ T'T input into a graph convolutional network (GCN) (Kipf and Welling, 2017; Nguyen and Grishman, 2018) to utilizethe HHThe BiLSTM induction vector in H is used as input to learn T′T′T' More abstract representation vectors for words in . We denote the hidden vector produced in the last layer of the GCN model GCN as:H ′ = hi 1 ′ , … , him ′ = GCN ( H , T ′ ) H' = h_{i_1}',\ldots,h_ {i_m}'=GCN(H,T')H=hi1,,himGCN(HT ), wheremmm isT'T'T (m<n) the number of words in, andhik ′h_{i_k}'hikis the word wik w_{i_k}wik(i.e. T'T'T' inkkk words) vector.

2.3 Regularization

By pruning the tree T'T' usingT' to calculateH' H'H , we expect to explicitly guide these vectors to: (i) encode relevant/important context words, and (ii) exclude potential noise information from irrelevant words for waw_awarole predictions. However, due to the context in the input encoder with BERT, noise information of irrelevant words may still be included in the pruned tree T′T′TThe representation of the selected word in ' HHH , thus propagated by GCN to representH' H'H' in. Therefore, to further limit the contribution of irrelevant words to representation learning, we introduce a new regularization technique that encourages learning fromDDThe representation obtained for each word in D is similar to that obtained only from T′T′T (i.e. adding irrelevant words does not change the representation significantly). Since the character prediction will use the output vector of the GCN model, we implement this regularization technique based on the representation vector derived from the GCN. Formally, we first putDDD' sHHH and full dependency treeTTT is fed into the same GCN model, i.e.H " = GCN ( H , T ) H"=GCN(H,T)H"=GCN(H,T ) . We then compute the set H'H'by performing max poolingH (based onTTT ) andH " H"H " (based onTTT ) vector represents vectorh ′ ‾ \overline{h'}hand h ′ ′ ‾ \overline{h''}h′′,即, h ′ ‾ = MAX-POOL ( h i 1 ′ , … , h i m ′ ) , h ′ ′ ‾ = MAX-POOL ( h i 1 ′ ′ , … , h i m ′ ′ ) \overline {h'}=\text{MAX-POOL}(h_{i_1}',\ldots,h_{i_m}'),\overline {h''}=\text{MAX-POOL}(h_{i_1}'',\ldots,h_{i_m}'') h=MAX-POOL(hi1,,him),h′′=MAX-POOL(hi1′′,,him′′) . Finally, we augment h ′ ‾ \overline{h'}by adding their L2 distance to the overall loss functionhand h ′ ′ ‾ \overline{h''}h′′similarity: L reg = ∥ h ′ ‾ − h ′ ′ ‾ ∥ L_{reg}=\parallel \overline{h'} - \overline{h''} \parallelLreg=∥hh′′

2.4 Forecast

for wa w_awaand wt w_twtFor argument role prediction, we form the overall vector V = [ ht ′ , ha ′ , h ′ ‾ ] V=[h_t',h_a',\overline{h'}]V=[ht,ha,h] , among whichht ′ , ha ′ h_t',h_a'ht,hais H'H'H'wa w_awaand wt w_twtThe representation vector of . Therefore, VVV will be used by the two-layer feed-forward network to obtain the distribution Pover possible argument rolesP(D,wt,wa) . To train the model, we use a negative log-likelihood loss:L pred = − log ⁡ P ( l ∣ D , wt , wa ) L_{pred}=−\log P(l|D,w_t,w_a)Lbefore _ _=logP(lD,wt,wa) , wherelll is the correct label. Therefore, the overall loss function of our model is:L = L pred + β L reg L=L_{pred}+βL_{reg}L=Lbefore _ _+βLreg, where β ββ is a balance parameter.

3. Experiment

Experimental results :

Ablation experiment :

4. Summary

In this work, we propose a new document structure-aware model for document-level EAE. Our model adopts the dependency tree of sentences and proposes a new technique based on optimal transfer to prune the dependency tree of documents in the EAE task. Furthermore, we introduce a novel regularization method to explicitly constrain the contribution of irrelevant words to representation learning. Our extensive experiments demonstrate the effectiveness of the proposed model. In the future, we plan to apply our model to other IE tasks.

Guess you like

Origin blog.csdn.net/qq_45041871/article/details/131246340