【Paper Notes】Improving the Applicability of Knowledge-Enhanced Dialogue Generation Systems by MHKD

Improving the Applicability of Knowledge-Enhanced Dialogue Generation Systems by Using Heterogeneous Knowledge from Multiple Sources

image-20230422153616692


Task : Knowledge Augmented Dialogue Generation

Conference : WSDM 2022

Original text : Paper address

Source code : project address

Abstract

In order to deal with the problem that traditional dialogue systems are prone to produce meaningless replies, researchers often enhance dialogue generation by fusing external knowledge . Although such methods have achieved remarkable results, however, using only single-source knowledge tends to degrade existing knowledge augmentation methods to traditional models in real-world scenarios due to the insufficient knowledge coverage of single-source knowledge.

To improve the applicability of knowledge augmentation methods, this paper proposes two novel frameworks using multi-source heterogeneous knowledge .

  • The first framework is MHKD-SeqSeq , which uses different heterogeneous knowledge by identifying knowledge behavior at the abstract level , and at the same time, uses a disperse-aggregate (Diffuse-Aggregate) mechanism to process multiple knowledge at the same time and produce a unified result;
  • The second framework is MHKD-ARPLM , which uses knowledge linearization techniques combined with the advantages of pre-trained language models.

In the experiment, this paper collects previously publicly released data sets, and builds a multi-source knowledge alignment data set, TriKE-Weibo , which combines three knowledge sources: common sense (commensense), text (texts), infobox table. Through a series of rich experiments, the effectiveness of the method proposed in this paper is verified.

1. Introduction

1.1 Motivation

The introduction of external knowledge-augmented dialogue generation has achieved certain results, but in real-world datasets, only a small fraction of dialogues can be linked to a single source of knowledge , which will lead to a large applicability of knowledge-augmented dialogue models due to no knowledge available. limited as they tend to degenerate into traditional models as a result.

The three on the left of the figure below show the coverage of single-source knowledge, and the last one represents the coverage after connecting three single-source knowledge, that is, the coverage when at least one knowledge is linked to the dialogue.

insert image description here

Therefore, improving knowledge coverage is the key to improving the applicability of knowledge-augmented dialogue . An intuitive approach is to directly extend the existing knowledge base, but this is impractical because:

  • Existing knowledge bases such as ConceptNet and Freebase are published and maintained by specific organizations, and it is difficult for third parties to extend them;
  • Knowledge coverage and knowledge base capacity are not linearly related. As revealed by Zip'f's law, in many natural language scenarios, we can always observe long-tailed distributions, and dialogues that cannot be linked to knowledge bases often involve low-frequency knowledge, however, the cost of collecting low-frequency knowledge is high.

1.2 Solution

This paper proposes to utilize multi-source heterogeneous knowledge to improve the applicability of knowledge augmentation methods, i.e. knowledge coverage. Its main advantages are:

  • It is easier to collect more knowledge bases than to expand existing knowledge bases;
  • It can combine the unique advantages of different types of knowledge and avoid the inherent limitations of specific types of knowledge. Compared with unstructured knowledge text and structured knowledge, unstructured knowledge contains richer semantic information, and knowledge in structured graphs has more good organization;
  • Different knowledge sources may have different topic distributions/preferences. Fusing multiple different knowledge sources can effectively avoid the long-tail distribution problem. Low-frequency topics in one knowledge source may be more popular in another knowledge source. Therefore, multi-source Knowledge can significantly improve knowledge coverage.

Therefore, this paper proposes Multi-source Heterogeneous Knowledge-enhanced Dialogue generation (MHKD) , a multi-source heterogeneous knowledge-enhanced dialogue generation method , trying to solve the two main challenges in using multi-source heterogeneous knowledge:

  • Since different heterogeneous knowledge has individual characteristics, how to use heterogeneous knowledge without being affected by the difference of heterogeneous knowledge?
  • How to flexibly and efficiently integrate knowledge from multiple sources into a dialogue system?

This paper first proposes a non-pretrained MHKD-Seq2Seq framework. In order to be compatible with different heterogeneous knowledge sources, this method uses Abstract-Level Behavior Modeling to model knowledge sources at the abstract level. In order to use multiple knowledge sources simultaneously, this paper proposes a two-step diffusion-aggregate (Diffuse-Aggregate) mechanism for MHKD-Seq2Seq.

In addition, this paper proposes a pre-trained MHKD-APPLM framework. MHKD-APPLM uses knowledge linearization technology to linearize different knowledge into a unified format, namely plain text format . Subsequently, multiple linearized knowledge texts are concatenated and added to the dialogue context. With this paradigm, it is possible to build a system for multi-source knowledge augmentation on autoregressive pre-trained language models (such as GPT2).

To evaluate the effectiveness of dialogue reply generation enhanced by multi-source heterogeneous knowledge, this paper proposes implementations based on the above two frameworks, TriKE-Dial and TriKE-DialGPT . This paper also collects and constructs TriKE-Weibo, a multi-source knowledge-aligned 1M Chinese dialogue dataset, which contains three knowledge sources: common sense knowledge map (From ConceptNet), text knowledge (From Wikipedia), and information box knowledge (From Wikipedia) . In both manual and automatic evaluation, the method in this paper is significantly stronger than the baseline, and extensive analysis also shows that using heterogeneous knowledge from multiple sources can indeed improve the applicability of knowledge-enhanced dialogue systems.

2. MHKD-SEQ2SEQ

This method can be applied to most scenarios. The task is defined as: P ( Y ∣ X , K i ) P(Y|X, K_i)P(YXKi) X X X是Query,YYY是Response, K i = { k i , j } ∈ K K_i = \{k_{i,j}\}∈{K} Ki={ ki,j}K represents the multi-source knowledge setK {K}iiin Ki knowledge source, which is a collection of a series of items (knowledge items).

2.1 Abstract-Level Behavior Modeling

To minimize the complexity, this paper proposes to model multi-source heterogeneous knowledge at abstraction level. It first identifies the role that knowledge can play in a dialogue system, and then generalizes it to behavior at the abstract level. Therefore, MHKD-Seq2Seq can model arbitrary types of knowledge at an abstract level without being affected by the specific knowledge type. This approach can be summarized in three abstract behaviors:

Representing

Each knowledge K i K_iKiFirst needs to be encoded as an embedding K i K_iKi, define this behavior: K i = { ki , j } = R ep K i ( K i = { ki , j } ) K_i=\{k_{i,j}\}=Rep_{K_i}(K_i=\{ k_{i,j}\})Ki={ ki,j}=RepKi(Ki={ ki,j}) , eachRep K i Rep_{K_i}RepKiimplemented by an encoder.

Accessing

An important role of knowledge sources is to enrich the understanding of the dialogue context by providing relevant knowledge . It can be viewed as gathering a relevant piece of information from encoded knowledge according to the current context, i.e. knowledge selection . Put this behavior A access K i Access_{K_i}AccessKiDefined as an attention function, A access K i ( query , key = K i , val = K i ) Access_{K_i} (query, key = K_i, val = K_i)AccessKi(query,key=Ki,val=Ki) . query is the current state, key and value are encoded knowledge.

Copying

Unregistered words and generating boring replies are two current difficulties in dialogue generation technology. Copying information words directly from knowledge can effectively alleviate these two difficulties . Therefore, MHKD-Seq2Seq explicitly introduces the point-then-copy mechanism, and then defines the behavior as Copy K i Copy_{K_i}CopyKi, the behavior is based on knowledge K i K_iKi, at each time step ttProbability distribution on t P t K i ∈ R ∣ K i ∣ P^{K_i}_t∈R^{|K_i|}PtKiRKi∣Copy a knowledge itemki , j k_{i,j}ki,j,即 P t K i = C o p y K i ( c o n t e x t d e c o d e r , K i ) P^{K_i}_t=Copy_{K_i} (context_{decoder},K_i) PtKi=CopyKi(contextdecoder,Ki)

2.2 Diffuse-Aggregate Scheme

Another challenge is how to use all knowledge sources simultaneously. In order to maximize flexibility and scalability, this paper proposes a scatter-aggregate (Diffuse-Aggregate) mechanism. In each scatter step, for each abstract behavior, MHKD-Seq2Seq uses multiple knowledge-specific processors to process specific knowledge in parallel. data flow. In addition to knowledge, all processors share the same input and use the same output format. Subsequently, the data streams are aggregated in the aggregation step. The partial results output by different knowledge-specific processors are aggregated through an aggregation processor/function, which can be a parameterless method such as Max, Mean, Sum, etc., or a trainable network.

2.3 MHKD-Seq2Seq

Algorithm 1 gives the algorithm flow of the MHKD-Seq2Seq framework, where,

z t z_t ztIndicates that Decoder is in ttHidden state at time step t ; HHH represents the representation obtained by the encoder encoding Query;K i K_iKiRepresents the representation obtained by encoding the knowledge of the knowledge encoder; C t C_tCtexpressed in ttDynamic context computed by attention at time step t .

  1. Encode Query to get HHH , encode each knowledge source to getK i K_iKi;(line 1-4)
  2. by HHThe last hidden statehn h_n of HhnInitialize the decoder to get the initial state of the decoder t 0 t_0t0;(line 5)
  3. Through the hidden state zt − 1 z_{t-1} of the previous time stepzt1, through and HHH andK i K_iKiPerform attention calculations to get the context vector ct c_tctAnd knowledge vector ct K i c_t^{K_i}ctKi, this part is the Diffuse step of behavior Access; (line 6-10)
  4. Aggregate through a state encoder to get the state vector gts g_t^sgts;(line 11)
  5. Update the decoder state to get zt z_tzt: According to the state at the previous moment, the context vector, the knowledge vector, the embedding of the decoded output token at the previous moment, and the aggregation state vector; (line 12)
  6. Calculate the probability distribution on the vocabulary, the probability distribution copied from Query, and the probability distribution copied from knowledge; (line 13-17)
  7. Through a state encoder, the state vector gtm g_t^m is obtainedgtm, for the final aggregation; (line 18)
  8. Decoding generation by aggregating the probability distribution in 6. (line 19)

image-20230428203128648

2.4 TriKE-Dial

This paper introduces a specific implementation of MHKD-Seq2Seq: TriKE-Dial.

Use three kinds of knowledge:

  • Commonsense knowledge (CSK)

  • Text knowledge (Knowledge texts, TXT)

  • Infobox table knowledge (IBT)

2.4.1 Representing Stage

Four different encoders are used to encode Query and heterogeneous knowledge from different sources to better capture semantic information.

  • Context encoder: Query is a short text, and it is easy to overfit when using Transformer. Here, BiGRU is used ;
  • Common-sense knowledge encoder: Use TransE to encode knowledge triples and learn a knowledge embedding;
  • Text knowledge encoder: Considering that the text knowledge input is a long sequence, two layers of Transformer encoding are used;
  • Information box table encoder: consider its knowledge structure as a series of key-value pairs, Key is a name phrase, Value is a short text, first serialize it into a Key-Word embedding space, the length after serialization is very long, And there is no strong sequence relationship between Key-word pairs, and a two-layer Transformer is used to encode Key-word pairs.

2.4.2 Step-Wise Response Generation

Traditional decoders only maintain a set of parameters and output a state, which is difficult to deal with multi-source heterogeneous knowledge scenarios. Considering this problem, this paper proposes a multi-processor GRU (MP-GRU, Muti-Processor GRU unit) as a Decoder, which is a layered unit, each processor is a Source-Specific GRU, and there is no gap between them. Shared parameters.

With these GRUs, use a Golbal State Manager for aggregation and a Global Token Predictor to predict the next word.

MP-GRU Decoder
  1. At each time step ttt , perform Diffuse-step for Access, and use each local processor to calculate the local stateztc z_t^cztc z t K i z_t^{K_i} ztKi

image-20230428222207447

Among them, ct c_tct c t K i c_t^{K_i} ctKiThe attention representation vector is calculated according to the attention network , which is equivalent to the current global state zt − 1 z_{t-1}zt1As a Query, the required information is obtained from the context and the knowledge base (Key and Value):

image-20230428222156430

  1. Global state zt z_tztis the weighted sum of local states (Query and knowledge) , where fstate f_{state}fstateIt is a two-layer MLP+Softmax , which is used for the aggregation of Access. In fact, it is to learn a weight to control each local state, which is used for weighted summation:

image-20230428222428015

Global Token Predictor
  1. Vocabulary probability distribution, using a two-layer MLP+Softmax , based on the global state, Query and knowledge attention representation vector, and Embedding calculation of the output word at the previous moment:

image-20230428222534216

  1. Based on an attention network α C α^CaC , according to the context encoding vector (as Key, Value) and the current global state (as Query), calculate the probability distribution of copying a word from Query (input X):

image-20230428222551776

  1. Based on an attention network α i K α^K_iaiK, according to the knowledge encoding vector and the global state at the last moment (why the last moment? A clerical error or reuse of the previous QV probability distribution, only t-1 moment), calculate the probability distribution of copying a knowledge item from knowledge:

image-20230428222558163

  1. Aggregate the above probability distributions, fmode f_{mode}fmodeIt is a two-layer MLP+Softmax , used for the aggregation of P , which is actually to learn a weight. The final probability distribution is the weighted sum of the probability distributions of each predicted word:

image-20230428222731376

Training

P t ∈ R ∣ V ∣ + ∣ X ∣ + Σ ∣ K i ∣ P_t ∈ R^{|V|+|X|+Σ|K_i|} PtRV+X+Σ∣Ki

image-20230428225146423

3. MHKD-ARPLM

3.1 TriKE-DialGPT

Using knowledge linearization technology to linearize multi-source knowledge into a token sequence, and then stitch each linearized knowledge together. Fine-tuning autoregressive PLM to generate responses based on knowledge of dialogue history and linearization.

This article uses CDialGPT as the backbone, and the linearization mechanism used is shown in the figure below.

image-20230429124719348

image-20230429125130056

3.2 Why do we propose two frameworks?

Intuitively, it is easier to develop a model based on MHKD-APPLM than the non-pretrained model MHKD-Seq2Seq. However, in practical applications, MHKD-APPLM has several limitations:

  • The pre-training model consumes a lot of data and computing resources during the pre-training phase, which makes the pre-training model not work in some scenarios;

  • Unlike MHKDSeq2Seq, which can explicitly identify and model knowledge behaviors, MHKD-ARPLM uses knowledge in a black-box manner. This way it is not easy to control/explain how the model uses the knowledge;

  • Pre-training models usually have limitations in text length (for example, CDialGPT2 limits the upper limit of tokens to 512), while the length of linearized knowledge text is often very long:

    • The rich information of knowledge often means a large number of tokens;
    • In the linearization part, new specific tokens are often introduced to represent the knowledge structure.

    Although there are some pre-trained models that can handle long texts such as Longformer, they reduce the quadratic dependence of sequence length to linear by omitting attention. However, these models are all autoencoding models, not autoregressive, and are not suitable for text generation.

Based on this, this paper proposes two frameworks suitable for different scenarios.

4. Evaluation

4.1 Experiment Methodology

Dataset:TriKE-Weibo

Firstly, 3.67M Chinese dialogue Query-Response candidate pairs are collected from three publicly released Weibo datasets, and jieba word segmentation is used as a tokenizer.

Chinese ConceptNet is used as the source of common sense knowledge, 1M+ texts are collected from Chinese Wikipedia as the text knowledge base, and 1M+ information box tables are collected from Chinese Wikipedia as the information box table knowledge base.

It is worth noting that after dialogue-knowledge alignment, not all knowledge source entries are used in the final data. The training/validation/test sets are 1M/40K/40K instances respectively. In each subset, about 80% of the conversations are knowledge-aligned, and the remaining 20% ​​of the conversations are used to verify the performance of the knowledge-free matching scenario.

Models

There are two groups of models: non-pretrained and pretrained. The non-pre-training model uses many baselines, which are divided into single-source knowledge enhancement and multi-source knowledge enhancement. The pre-training model uses CDialGPT to fine-tune on TriKE-weibo, and the maximum input length is limited to 256 tokens.

Metrics

BLEU1-4, ROUGE-L, the embedding-based Embed-A/G/X (Average/Greedy/ Extreme, DIST-A2 (the number of distinct 2-grams in all generated words), DIST-B2 (the last candidate beam The number of distinct 2-grams), Ent4 (4-gram Entropy), entity score (the number of knowledge words generated per sentence).

4.2 Evaluation Results

In terms of non-pre-trained models, TriKE-Dial achieved the best performance in correlation indicators such as BLEU, and achieved a good balance in diversity and novelty indicators . The novelty indicator mainly measures the novelty of response compared to query.

In terms of pre-training models, the performance of TriKE-DialGPT far exceeds that of CDial-GPT, which verifies the effectiveness of the method in this paper.

Comparing TriKE-Dial and TriKE-DialGPT, it can be found that the main advantage of TriKE-DialGPT is in the correlation index, which may be brought by its large amount of pre-training data. While TriKE-Dial models multi-source knowledge better and more fine-grained to generate dialogue. Therefore, even with only 1M training dialogues, TriKE-Dial performs relatively on par with TriKE-DialGPT in the diversity, novelty and knowledge parts.

In addition, manual evaluation found that TriKE-DialGPT performs better than CDial-GPT, and comparing the two methods in this paper, TriKE-DialGPT is slightly better than TriKE-Dial, and TriKE-Dial is slightly better than TriKE-DialGPT.

4.3 Knowledge Ablation Study

Ablation experiments verified:

  • The validity of single-source knowledge is verified;
  • It also verifies that the multi-source knowledge enhancement model in this paper can greatly improve the performance;
  • Compared with the no-knowledge enhancement method, the sum of the performance gains achieved by single-source knowledge enhancement is approximately equal to the performance gain achieved by multi-source knowledge enhancement, which verifies the scalability of the framework in this paper;
  • The performance of the model + single-source knowledge + Copy in this paper has achieved relatively flat performance compared with other single-source knowledge-enhanced baselines;
  • When using the same knowledge source, the model performance of this paper can surpass its SOTA method, which shows that the performance improvement of TriKE-Dial is not only the power of multi-source knowledge, but also the certain advantages of this method.

The Necessity of Improving Knowledge Coverage

The test samples are divided into four groups: None/CSK/TXT/IBT, and the figure below reports the performance relative to the knowledgeless alignment model. A significant gain in knowledge alignment group performance can be observed, while knowledge-free alignment group performance degrades to the level of knowledge-free augmentation methods. This points to the need to increase knowledge coverage.

image-20230429225513358

4.4 Case Study

The sample analysis found that TriKE-DialGPT often generates informative but repetitive sentences. In contrast, the sentences generated by TriKE-Dial are more fluent.

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/130467307