【Paper Notes】Text Detoxification using Large Pre-trained Neural Models

Text Detoxification using Large Pre-trained Neural Models

insert image description here

Conference : EMNLP2021

Mission : Text Detoxification

Source code : link

Abstract

​ The work of this paper includes:

  • Two novel unsupervised methods based on pre-trained language models are proposed for text detoxification tasks: ParaGeDi (paraphrasing GeDi) and CondBERT (conditional BERT), and both achieve SOTA effects;
  • Compare and evaluate these two models with many SOTA models on text detoxification and emotion transfer tasks, and publish a small-scale detoxification dataset;
  • A small-scale parallel corpus of English was constructed by retrieving toxic-neutral sentence pairs in ParaNMT.

Motivation

The task of text style transfer is usually to rewrite sentences, keeping the semantic content of sentences unchanged while changing one or several attributes of sentences. However, changing the attributes of the text in many cases will cause a large change in the semantics of the sentence. Thus, in fact, the goal of many style transfer models is to transform a sentence into a similar sentence with a different style on the same topic. The author believes that the text detoxification task needs to better preserve the original semantics of the text than other style transfer tasks , so this task should be treated differently. Thus, the authors propose two text detoxification models with additional controls on content retention.

Related Work

Pointwise Editing Models

  • Retrieval-based approach. Only operate on style-related personality words to preserve the original semantics.

  • Based on masked language model (MLM). Searching for replacement words based on context and style tags is similar to the CondBERT method proposed in this paper, but CondBERT has additional control over content retention and can perform multi-word replacement.

End-to-end Architectures

End-to-end architectures, often based on Seq2Seq models, encode source sentences and then manipulate the resulting representations to incorporate new features. Some methods decompose latent representations into content and style representations, while others make encoders encode source sentences into style-independent content representations. There are also various methods for additional control over the style of the text. Treating the style transfer task as a paraphrase task (STRAP), which adds style-specific attributes to text, specifically, this method uses a pre-trained general-purpose paraphrase to transfer style-tagged text to neutral to create a pseudo-parallel dataset, and then train a sequence-to-sequence model on this pseudo-parallel dataset. The ParaGeDi model proposed in this paper is conceptually similar to these methods, except that the style is not injected into the model or the sentence representation, but is injected into the generator by another model.

Detoxification

​The two models PPLM and GeDi explore a similar problem: to prevent the language model from generating toxic text, they do not consider the semantic preservation of the preserved text, but the idea of ​​applying a discriminator to control the generation of the language model can be used in style transfer on task.

Main idea and Framework

ParaGeDi(paraphrasing GeDi)

Background

​ The model comes from two ideas:

  • External control is imposed on the output of the generative model through a class conditional language model (Class Conditional LM, CC-LM);
  • Considering text style transfer as a text paraphrase task, the original semantics can be well preserved by using the paraphrase model.

The GeDi model generates text under the guidance of a language model with known text-specific properties, such as style and topic. This paper extends this model by applying it to paraphrasing input text. The GeDi model is mainly composed of two parts: a generative model GPT-2, and a generative discriminative model: also a GPT, CC-LM trained on a data with an additional sentence-level style label. This allows the discriminative model to learn word distributions based on specific labels.

Framework

This paper uses a paraphraser instead of a conventional language model . The initial text entered is xxx , generating the textyyy has lengthTTT , expected attribute isccc . Then the ParaGeDi model follows the following probability distribution:

insert image description here

The last step is an estimation that allows us to decouple it, i.e. a paraphrase generator and a style transfer model can be trained independently, respectively . Also, as long as the paraphraser and CC-LM share a vocabulary, we can plug anything into it. The third (optional) part of the model is a ranker, a toxicity classifier that selects the least toxic hypothesis in the text generated by the model. The figure below describes the workflow of the model.

insert image description here

Loss

The final loss function is a linear combination of generative loss and discriminative loss.

insert image description here

Inference heuristics

Utilize a series of heuristics to improve the content retention and style transfer accuracy of the model.

  • Heuristic algorithm in original GeDi

insert image description here

  • Smoothing of probabilities

    Add a small α \alpha to the probability distribution generated by CC-LMα , preventing CC-LM from generating words with very small probabilities based on all classes.

insert image description here

  • Asymmetric lower and upper bounds

insert image description here

For the detoxification task, by reducing uuu is meant to encourage the model to remove existing toxic words instead of inserting new polite words. (did not understand)

CondBERT(conditional BERT)

According to a pointwise edit mechanism, BERT is used to replace toxic fragments found in sentences with their nontoxic synonyms. Semantic similarity is maintained by showing BERT the original text and reranking its hypotheses based on the similarity between the original word and its alternative.

Select toxic words

Train a bag-of-words classification model for logistic regression to use the words in a sentence as features to classify them as toxic or neutral. During training, each word generates a weight that roughly corresponds to its importance to the classification, and words with high weights tend to be poisonous . This paper uses the regularized weights from the classifier as the toxicity score.

For each word in the sentence, its toxicity score is first calculated, and then according to a threshold defined as:
t = max ( tmin , max ( s 1 , s 2 , . . . , sn ) / 2 ) t = max (t_{min}, max(s_1, s_2, ..., s_n)/2)t=max(tmin,max(s1,s2,...,sn)/2)

Among them, s 1 , s 2 , . . . , sn s_1, s_2, ..., s_ns1,s2,...,snIndicates the toxicity score of each word, tmin = 0.2 t_{min} = 0.2tmin=0.2 represents the smallest toxicity score. This adaptive threshold balances the percentage of toxic words in a sentence so that too many or no words are marked as toxic.

Replace the mask and rerank

To preserve the semantics of the replaced words, a content-preserving heuristic is employed:

  • Preserve replacement words instead of Masking them before they are replaced;
  • The replacement words predicted by BERT are reranked by their similarity to the original word embedding .

Penalize the toxic tokens

Although conditional BERT uses category-specific sentence embeddings, it still often predicts poisonous words, apparently paying more attention to context than desired category embeddings. To force the model to generate non-toxic words, we compute the toxicity of each token in BERT's vocabulary and penalize the predicted probability of words with positive toxicity .

Replace a single [MASK] token with mutiple tokens

Finally, we make BERT replace MASK tokens with multiple words. Words are incrementally generated by Beam Search and each multi-word sequence is scored by the harmonic mean of word probabilities.

insert image description here

Experiments and Analysis

Toxicity Classifier

The English data of three Jigsaw toxicity review datasets were merged, then divided into two parts, and two toxicity classifiers were fine-tuned using RoBERTa, one for reranking the assumptions of ParaGeDi and one for evaluating the detoxification performance.

Dataset

Using the Jigsaw-1 data set, the toxicity classifier is used to classify its sentences to obtain a toxic data set and a non-toxic data set.

Metrics

Considering that the three indicators of style transfer accuracy, content retention and text fluency are negatively correlated, the joint indicator JJ is selectedJ to balance, it is the product of the three.

Implementation Details

Random sampling is performed on a subset of ParaNMT data, and then a T5-based is fine-tuned as a Paraphraser. Fine-tune GPT2-medium on the Jigsaw-1 training set using two control codes, change its vocabulary to T5's vocabulary before fine-tuning, and change its embedding accordingly, train it using generative and discriminative losses . beam search=10, apply the trained toxicity classifier to select the least toxic candidate from the generated 10 recaps.

Competing Methods

  • Machine Translation: English → Pivot → English, use Google Translate API, use intermediate translation to eliminate toxicity.
  • Detoxifying GPT-2: Randomly sample and rewrite 200 sentences from Jigsaw to build a small parallel dataset for training.
  • Paraphraser: A general-purpose paraphrase generator.

Detoxification results and Analysis

insert image description here

ParaGeDi and CondBERT far outperform other models. The similarity of ParaGeDi is slightly lower, but the fluency is slightly higher, because it is a generative model, and the generation effect will be better than point-by-point editing. The closest performance is Mask&Infill , whose principle is similar to CondBERT, but some engineering decisions (such as masking all words at once) have resulted in a significant reduction in FL and some reduction in ACC.

Many advanced models perform worse than simple DRG models TemplateBased and RetrieveOnly . TemplateBased achieves high SIM because it keeps most of the initial sentences unchanged, and RetrieveOnly achieves high ACC and SIM because it retrieves real non-toxic sentences from the training data. DLSM and SST regenerate text and have low FL since their encoders are trained from scratch on a small dataset. Conversely, STRAP, which is also generative, has a larger pseudo-parallel dataset, resulting in high FL.

For MT, En→Ig→En detoxified 37% of the sentences but the SIM and FL scores were very low. On the contrary, En→Fr→En retained most of the features of the original sentences, including toxicity, as did the T5 paraphraser. On the other hand, the GPT2 model can achieve detoxification even on a very small number of parallel sentences (200 sentences in this experiment), even though it has lower performance than other models, but this paper argues that if on a larger parallel data set Training will greatly improve its performance.

Ablation Study

CondBERT ablation study

CondBERT's ablation experiment shows that multi-word replacement and toxic punishment play a key role. The former maintains a high text fluency, and the latter plays a stronger role in style control.

This paper uses two heuristics to improve content retention: unmasking toxic words and reordering replacement words. Experimental results show that deleting either of the two will lead to lower SIM and higher ACC, indicating that the two indicators are anti-correlated, but the J indicator remains unchanged. In addition, removing the multi-word replacement mechanism will reduce ACC and FL, even though it can produce higher SIM, because the output sentence contains fewer new words. This affects the model's J-metric score. Finally, the biggest impact is to delete the toxic penalty mechanism, which makes ACC drop significantly. Even if the other two indicators are slightly higher, it cannot remedy the low J caused by too low ACC.

insert image description here

ParaGeDi ablation study

ParaGeDi hyperparameters λ λλ controls the strength of the discriminator loss. Experiments in this paper have found that this value has only a marginal impact on the overall performance of the model, that is,Jis only atλ = 1 λ = 1l=1 (no generative loss) is significantly lower.

Furthermore, style strength control affects ACC, the upper bound on the word probability distribution increases SIM, and the absence of beam search decreases FL. On the other hand, Reranker, beam size, and smoothing do not affect model performance.

insert image description here
insert image description here

Reducing the Beam value lowers the ACC and FL because it makes the Reranker have fewer options to choose from. Removing Reranker lowered ACC and improved SIM and FL. Removed probabilistic smoothing reduced some SIM and FL. Removing the discriminator-corrected upper bound yields nearly 100% ACC, but the resulting sentences have very low similarity to the original sentence, as the model starts to hallucinate text. lower wwThe w parameter, which reduces ACC but increases SIM and FL, is used to control the intensity of style transfer.

Loss weights have less impact on performance than heuristic inference parameters. Only the discriminative loss can reach 77% ACC, and only the generative loss can reach 90% ACC. **The latter shows that models equipped with style labels can distinguish styles even if they are not explicitly trained. ** On the other hand, removing the generative loss leads to a large drop in FL. Even though the CC-LM in ParaGeDi is a GPT2, the lack of generative fine-tuning still degrades the quality of the generated text.

Mining a Parallel Detoxifying Corpus

This experiment is to verify that detoxified speech-sentence pairs can be mined in a large-scale paraphrase parallel dataset.

Specifically, a pre-trained toxicity classifier is used to classify sentences from the ParaNMT dataset, and 500,000 sentence pairs are selected, where one sentence is much more toxic than the other. regular paraphraser: fine-tuned on a randomly sampled subset of ParaNMT; mined paraphraser: fine-tuned on the developed toxic/safe parallel paraphrase corpus. Then they are all inserted into ParaGeDi to compare their performance. The experimental results are shown in the figure below.

insert image description here

Neither paraphraser detoxifies very well, but mined is clearly better. After using mined instead of Regular in ParaGeDi, it can maintain better ACC and FL without losing SIM, and J reached 0.54, which is the highest score in the detoxification experiment in this paper. This approach is supervised, so no comparisons were made in the main experiment. This result shows that the general-purpose ParaMNT corpus contains a large number of toxic/safe paraphrase sentence pairs. Therefore, this paper argues that mining parallel training sets from large-scale corpora is a fruitful direction relative to unsupervised style transfer methods.

Human Evaluation of Detoxification

Unsupervised automatic evaluation is fast and cheap, but often unreliable. The assessment of toxicity and fluency is not perfect and will produce errors. Measuring sentence similarity using embedding distance is weakly correlated with human evaluation. Therefore, a human evaluation is performed in this paper.

The metrics use the same ternary evaluation as the automatic evaluation: {0,0.5,1}.

insert image description here

Human evaluation confirmed the significant performance of our model, but also found no significant difference between our two models.

This paper also studies the correlation between automatic and human evaluations, computing automatic and human evaluation Spearman's ρ ρρ correlation coefficient. Obviously, the correlation is lowest on SIM. ACC-soft represents the confidence returned by the toxicity classifier. Using style transfer confidence is better than style binary classifiers, and using linguistic acceptability classification (FL) is better than PPL.

insert image description here

Qualitative Analysis

In fact, the actual success rate of detoxification is far lower than the estimated results . Define "perfect" as a detox sentence scored highest in all three dimensions by all three annotators. According to this definition, only 20% of the sentences generated by ParaGeDi, only 15% by CondBERT, and only 1.5% by Mask&InFill. The main reason can be known from Table 4, because of the loss of semantics . The cases where the model causes semantic loss/distortion are as follows:

For ParaGeDi, the model tends to overtransform semantics or undertransform. The specific performance is:

  • Replace toxic words with similar, less toxic words with different meanings;
  • replace sentence meaning;
  • Remove toxic or difficult parts. In some cases, however, ParaGeDi obscures or rewrites poisonous parts of the message while still retaining the general meaning.

CondBERT generally preserves sentence structure, but often uses inappropriate substitutions, often antonyms.

DLSM and Template-based DRG usually preserve semantics by preserving toxic words, so their ACC is low. Retrieve-only DRG preserves almost no semantics. Mask&Infill seems to be overfitting, it often replaces toxic words with irrelevant non-toxic words. These properties make Mask&InFill unsuitable as a baseline for unadapted detoxification tasks, and the CondBERT model is such an adaptation.

The typical mistakes of both ParaGeDi and CondBERT can be attributed to insufficient semantic understanding , and they often replace poisonous words with semantically related words of different (often opposite) meanings. Or simply use words that look similar. This paper conjectures that training a reteller on a larger corpus (only 2% of Para NMT is used in this paper) or harder samples will improve ParaGeDi's ability to preserve semantics.

Overall, the detoxified text produced by our model is not sufficient for practical applications, but can be used as a reference for human rewriters, or to detoxify the output of chatbots, which is less costly.

Sentiment Transfer Experiments

The speech detoxification task is not perfect, and it is difficult to verify the effectiveness of the model in this paper. Therefore, the experiment is carried out in a different field, that is, emotion transfer.

insert image description here

The experimental results show that ParaGeDi performs the best, and it is the only one that connects the pre-trained model and the full regeneration mechanism. The poor performance of the Cond BERT model on this task demonstrates that style transfer and detoxification tasks for other domains require different techniques.

Supervised evaluations also point to doubts about unsupervised metrics. First, ACC is limited by the performance of the classifier . Since it only gives a score of 0.81 on hand-rewritten sentences that are almost 100% detoxified, small differences in ACC should not be considered significant, and ACCs above 0.81 are not reliable. Overall, ParaGeDi can still be considered a strong style transfer model due to the close scores of human answers to ParaGeDi and Mask&Infill. Also, more precise evaluation should be done by humans, since automatic evaluation metrics cannot differentiate models at this level.

Toxification of Texts

The textual detoxification task also hints at the possibility of performing a poisoning task. In principle, each style transfer model can be used for text poisoning tasks. However, if it is CondBERT, the quality of this transfer will be poor, and these poisoned sentences can hardly be regarded as real poisoned sentences. The reason lies in the structure of the toxicity data.

One of the main characteristics of a toxic style is the presence of the style's lexical markers (rude or obscene words). These tags:

  • Carries almost all the style information of the sentence;
  • There are synonyms that depart from this style information.

Both of our methods rely strongly on these properties, identifying toxic words and replacing them with nontoxic synonyms. On the other hand, with reverse poisoning, it is impossible to exploit these properties. First, there are no non-toxic words that can strongly indicate a neutral style. Second, it is nearly infeasible to identify non-toxic words with toxic synonyms and make appropriate substitutions. Therefore, CondBERT is not suitable for text poisoning tasks.

Conclude

This paper proposes two style transfer models for text detoxification tasks, both concatenating high-quality pre-trained language models and additional style guidance. The model achieves SOTA performance on both automatic and human evaluation metrics.

  • However, both have low semantic preservation. It shows that the existing effective detoxification model has insufficient semantic understanding, and the generated text is prone to lose semantics. Therefore, training with larger detoxified parallel datasets is likely to alleviate this phenomenon. This is a direction that can be improved.
  • In addition, it may be more meaningful, more effective, and less costly to apply this task to dialogue generation detoxification, which is an effective research direction.
  • The author also pointed out that automatic evaluation has limitations and is not reliable, and the three commonly used indicators are negatively correlated. The evaluation of all this task is still a problem. It is better to have an evaluation specific to the data characteristics of this task, rather than simply refer to it. Other style transfer tasks.
  • Many heuristic algorithms are also used in model reasoning, which are very detailed and a bit incomprehensible.

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/128114092