[Paper reading] AugGPT: Leveraging XXX for Text Data Augmentation (AugGPT: Using XXX for text data enhancement)

1. Paper information

Thesis title: AugGPT: Leveraging XXX Transformer for Text Data Augmentation (AugGPT: Using XXX Transformer for Text Data Augmentation)

Year published: 2023-arXiv

Paper link: https://arxiv.org/abs/2302.13007

Author information: Haixing Dai* (University of Georgia, USA), Zhengliang Liu*, Wenxiong Liao*, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li

Remarks: This article focuses on the method of data enhancement in the NLP field, the baseline model of the author's experiment, and the indicators used for model evaluation. In addition, the rest of this paper may be supplemented and updated...

2. The content of the paper

Summary

In many natural language processing tasks, text data augmentation is an effective strategy to overcome the challenge of limited samples. This challenge is especially acute in few-shot learning scenarios, where the data in the target domain is usually much less and of lower quality. A natural and widely used strategy to alleviate such challenges is to perform data augmentation to better capture data invariance and increase sample size. However, existing text data augmentation methods either fail to guarantee the correct annotation of the generated data (lack of authenticity), or sufficient diversity of the generated data (lack of compactness), or both . Inspired by the recent success of large-scale language models, especially the development of XXX, this paper proposes an XXX-based text data augmentation method (AugGPT). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used for downstream model training. The experimental results on the few-shot learning text classification task show that, compared with the current mainstream text data enhancement methods, the AugGPT method has better performance in terms of test accuracy and enhanced sample distribution .

2. Related work

2.1. Data Augmentation

Data augmentation, i.e. artificially generating new text through transformations, is widely used in text classification to improve model training. In NLP, existing data augmentation methods are divided into different levels of granularity: characters, words, sentences, and documents.

Data augmentation at the character level:

  • Randomly inserting, exchanging, and deleting characters: refers to randomly inserting, exchanging, replacing, or deleting characters in the text to improve the robustness of the NLP model to noise.
  • Data Augmentation for Optical Character Recognition (OCR): Generate new text by simulating errors that occur when using OCR tools to recognize text from pictures. For example, when using OCR, "0" (number 0), "o" (lowercase O), and "O" (uppercase O) are difficult to distinguish, so the process of OCR recognition error can be simulated to generate new text.
  • Spelling Enhancement: Intentionally misspell some frequently misspelled words.
  • Keyboard Enhancement: Simulate random typing errors by replacing a selected key with another key close to the selected key on a QWERTY layout keyboard. For example, near "s" in the keyboard are "a", "w", "d", "z", "x", so any key of "a, w, d, z, x" can be used to replace the original word The "s" simulates random input errors.

Data augmentation at the word level:

  • Randomly swap, delete words: Randomly swap two words in the text, and randomly delete some words in the text [24].
  • Synonym Augmentation: Use the PPDB thesaurus [25] to replace randomly selected words [26], or use the WordNet thesaurus . [27]
  • Word embedding data augmentation: replace words with their top-n similar words to create new sentences. [28] proposed a word embedding-based data augmentation method to replace words with their top-n similar words to create new sentences. Different pretrained word embeddings are considered (e.g., GoogleNews [29]). The idea behind this approach is that words that are close to each other in the embedding space tend to appear in similar contexts, which may help maintain grammatical consistency. However, there is a serious flaw in embedding-based methods, that is, similar words in the embedding space are not necessarily semantically similar, but semantic changes will affect the classification results . For example, "hot" and "cold" often appear in similar contexts, so their word embeddings are close, but they have completely opposite semantics. **Inverse quasi-embedding enhancement algorithms [30], [31]** solve this problem by adjusting the initial word embeddings by using thesaurus and antonym dictionaries. Specifically, the distance between embeddings of synonyms is shortened, and the distance between embeddings of antonyms is increased.
  • Context enhancement [32-33]: Use masked language models (MLMs) such as BERT [34-35], DistilBERT [36], RoBERTA [37] to generate new texts based on context. Specifically, first add the mark <mask> in some positions of the text or replace some words in the text with <mask> , and then let the mask language model (MLMs) predict which words should be placed in the <mask> where you are. Since MLMs are pre-trained on large amounts of text, contextual augmentation can often generate meaningful new text. For example, the original English sentence is: "She is a pretty <mask> ." Since <mask> blocks the words in the sentence, MLMs can be used to predict, for example: "She is a pretty student ." , "She is a pretty girl .", "She is a pretty teacher ."...

Data Augmentation at Sentence and Text Levels:

  • Back-translation [38] (sentence and text level): Data augmentation using translation models. Translate text into another language and translate it back to the original language. Due to the randomness of the translation process, the enhanced text is different from the original text but maintains semantic consistency.
  • Document Paraphrase: Gangal et al. [39] proposed a method to paraphrase the entire document to maintain document-level consistency.

In general, regardless of granularity level or text generation backbone (i.e., rule-based or language model-based), the goal of data augmentation is to generate plausible and diverse new samples in order to maintain semantic consistency.

4. Method

4.1. Overall framework

insert image description here

AugGPT framework. a (above): first use XXX for data augmentation. Feed samples from all categories into XXX, and prompt XXX to generate samples that are semantically consistent with existing labeled instances. b (bottom): In the next step, the authors train a BERT-based sentence classifier on small samples and generated data samples, and evaluate the classification performance of the model.

4.4. Baseline method

In the experimental section, the authors compare the method with other popular data augmentation methods. For these methods, the authors used implementations in open source libraries, including nlpag [83] and textattack [84].

  • InsertCharAugmentation: Inserts random characters at random positions in the text.

  • SubstituteCharAugmentation: Randomly replaces selected characters.

  • SwapCharAugmentation[22]: Randomly swap two characters.

  • DeleteCharAugmentation: Randomly delete characters.

  • OCRAugmentation: Simulate OCR for data enhancement, for example: replace "I" with "1" and "O" with "0".

  • SpellingAugmentation[23]: Deliberate spelling mistakes, for example: changing "because" to "because".

  • KeyboardAugmentation[22]: Simulate keyboard typing errors, such as changing "s" into surrounding "w", "a", "z", "x", "d", "q", "e" and other characters.

  • SwapWordAug[24]: Randomly swap words in the text, this method is a sub-method of the Easy Data Augmentation (EDA) method proposed by Wei et al.

  • DeleteWordAug: Randomly deletes words in text.

  • PPDBSynonymAug[26]: PPDB thesaurus for synonym replacement.

  • WordNetSynonymAug: WordNet thesaurus for synonym replacement.

  • SubstituteWordByGoogleNewsEmbeddings[28]: Use the embedding word embedding space to replace the first n similar words. (The word embeddings used were pre-trained with the GoogleNews corpus.)

  • InsertWordByGoogleNewsEmbeddings [83]: It randomly selects words from the vocabulary of the GoogleNews corpus and inserts them at random positions in the text.

  • CounterFittedEmbeddingAug: It replaces words with their neighbors in the counterfitted embedding space. Compared with the GoogleNews word vector used by googlenewsembeddings, anti-simulation embedding introduces the constraints of synonyms and antonyms, that is, the embedding between synonyms will be pulled closer, and vice versa.

  • ContextualWordAugUsingBert(Insert): This method uses BERT to insert words according to the context, that is, add a <mask> tag at a random position of the input text, and then let BERT predict the mark at that position.

  • ContextualWordAugUsingDistilBERT(Insert): This method uses DistilBERT instead of BERT for prediction, and the rest is the same as ContextualWordAugUsingBert(Insert).

  • ContextualWordAugUsingRoBERTA(Insert): This method uses RoBERTA instead of BERT for prediction, and the rest is the same as ContextualWordAugUsingBert(Insert).

  • ContextualWordAugUsingBert(Substitute): This method [32-33] uses BERT to perform word replacement based on the context, that is, replace a randomly selected word in the text with <mask>, and then let BERT predict the content of the position.

  • ContextualWordAugUsingDistilBERT(Substitute): This method uses RoBERTA instead of BERT for prediction, and the rest is the same as ContextualWordAugUsingBert(Substitute).

  • ContextualWordAugUsingRoBERTA(Substitute): This method [38] translates a text first into German and then into English, resulting in a new text that is different from the original but has the same semantics.

4.6. Evaluation indicators

The authors use cosine similarity and TransRate [86] as indicators to evaluate the realism of augmented data (i.e. whether the generated data samples are close to the original samples) and compactness (i.e. whether the samples of each category are compact enough to be well distinguished) .

4.6.1. Cosine similarity

To evaluate the semantic similarity between the samples generated by the data augmentation method and the real samples, the embedding similarity between the generated samples and the real samples of the test dataset is used . Some of the most common similarity measures include Euclidean distance, cosine similarity, and dot product similarity. In this study, the authors choose cosine similarity to capture the distance relationship in the latent space . Cosine similarity measures the cosine of the angle between two vectors. This value increases and is bounded between 0 and 1 as the two vectors become more similar. Since the pre-trained language model without fine-tuning is difficult to capture semantics, the author uses the BERT-flow[87] method to fine-tune the pre-trained BERT on the basic data set, and finally applies the fine-tuned BERT to obtain the smaple embedding. The cosine similarity measure is commonly used in NLP [88], and the authors follow this convention.

4.6.2.TransRate

TransRate is an indicator to quantify transferability based on mutual information between features extracted by a pre-trained model and their labels, requiring only one traversal of the target data. This measure reaches its minimum value when the data covariance matrix is ​​the same for all classes, making it impossible to distinguish between different classes of data, making it impossible for any classifier to achieve better results than random guessing. Therefore, a higher TransRate may indicate better learnability of the data.

6. Summary and Discussion

This paper proposes a new data augmentation method for few-shot classification. Different from other methods, this model extends the limited data at the semantic level to enhance data consistency and robustness, thus achieving better performance than most current text data augmentation methods. With the development of LLM and its nature of multi-task learners [77], a series of tasks in NLP can be enhanced or even replaced in a similar way.

Although AugGPT has shown good results in data augmentation, it has certain limitations. For example, in medical text recognition and augmentation, AugGPT may produce incorrect augmentation results due to lack of domain knowledge in XXX . In future work, the authors will study the adaptation of general-domain large language models (such as XXX) to domain-specific data, such as medical texts, through model fine-tuning, context learning (hint engineering), knowledge distillation, style transfer, etc.

AugGPT proves that augmenting results can effectively improve the performance of downstream classification tasks. A promising direction for future research is to study AugGPT in a wider range of downstream tasks. For example, XXX has strong key point extraction ability and sentence understanding ability, which can be used for tasks such as text summarization. Specifically, XXX may be valuable for domain-specific scientific paper abstracts [90] and clinical report abstracts [91]. Publicly available datasets of abstracts of domain-specific scientific papers and datasets of clinical reports are very rare and usually available at small scales due to privacy concerns and the need for expert knowledge. However, XXX can address this challenge by generating different augmented summary samples with different representation styles. The data generated by XXX is usually very concise, which is valuable for further enhancing the generalization ability of the trained model.

The sharp rise of generative image models such as DALLE2 [92] and Stable Diffusion [93] provides an opportunity to apply AugGPT to few-shot learning tasks in computer vision. For example, precise language descriptions can be used to guide generative models to generate images from text or as data augmentation methods for few-shot learning tasks, especially in combination with efficient fine-tuning methods [94], [95] such as LoRA for stable diffusion . Therefore, prior knowledge from large language models can facilitate faster domain adaptation and better few-shot learning of generative models in computer vision.

Recent studies have shown that large language models (LLMs), such as XXX-3 and XXX, are capable of solving theory of mind (ToM) tasks, previously thought to be unique to humans [96]. While the LLM's tom-like abilities may be an unintended byproduct of improved performance, the potential connection between cognitive science and the human brain is an area ripe for exploration. Advances in cognitive and brain science can also be used to inspire and optimize the design of LLMs. For example, it has been suggested that the activation patterns of neurons in the BERT model and those in the human brain network may have similarities and can be coupled together [97]. This provides a promising new direction for developing LLMs using prior knowledge from brain science. As researchers continue to study the connection between LLMs and the human brain, the authors may discover new ways to enhance the performance and capabilities of AI systems, leading to exciting breakthroughs in the field.

Guess you like

Origin blog.csdn.net/m0_38068876/article/details/131381417