The latest research, GPT-4 exposed shortcomings! Can't fully understand language ambiguity!

3fd945787c13448c1027d9214797f686.pngXi Xiaoyao Technology said the original
author | IQ dropped a place, Python
Natural Language Inference (Natural Language Inference, NLI) is an important task in natural language processing. Its goal is to judge whether the hypothesis is based on the given premises and assumptions. can be inferred from the premise. However, since ambiguity is an intrinsic feature of natural language, handling ambiguity is also an important part of human language understanding . Due to the diversity of human language expressions, ambiguity handling has become one of the difficulties in solving natural language reasoning problems. Currently, various natural language processing algorithms are applied to scenarios such as question answering systems, speech recognition, intelligent translation, and natural language generation, but even with these technologies, fully resolving ambiguity is still a very challenging task.

For NLI tasks, large natural language processing models such as GPT-4 do face challenges. One of the problems is that language ambiguity makes it difficult for the model to accurately understand the true meaning of the sentence . In addition, due to the flexibility and diversity of natural language, there may be various relationships between different texts, which makes the datasets in NLI tasks extremely complex, and also affects the universality and The ability to generalize presents a formidable challenge. Therefore, in dealing with ambiguous language, it will be crucial if large models succeed in the future, and large models have been widely used in areas such as conversational interfaces and writing aids. Dealing with ambiguity will help adapt to different contexts, improve clarity of communication, and the ability to recognize misleading or deceptive speech .

The title of this paper discussing ambiguity in large models uses a pun, "We're Afraid...", which not only expresses the current concerns about the difficulty of accurately modeling ambiguity in language models, but also hints at the language structure described in the paper. This paper also shows that people are working hard to develop new benchmarks to really challenge powerful new large models for more accurate understanding and generation of natural language, and to achieve new breakthroughs in models.

Thesis Title
We're Afraid Language Models Aren't Modeling Ambiguity

Paper link
https://arxiv.org/abs/2304.14399

Code and data address
https://github.com/alisawuffles/ambient


Article Quick Facts

The authors of this paper plan to investigate the ability of pre-trained large models to recognize and distinguish sentences with multiple possible interpretations , evaluating how well the model distinguishes between different readings and interpretations. However, existing benchmark data usually does not contain ambiguous examples, so one needs to construct experiments to explore this issue by oneself.

The traditional NLI three-way tagging scheme refers to a tagging method for natural language inference (NLI) tasks, which requires taggers to choose one tag among three tags to represent the relationship between the original text and the hypothesis . These three labels are usually "entailment", "neutral" and "contradiction".

The authors use the format of an NLI task for their experiments, taking a functional approach to characterize ambiguity in premises or hypotheses by their effect on entailment relations . The authors propose a benchmark called AMBIENT (Ambiguity in Entailment) that covers a variety of lexical, syntactic and pragmatic ambiguities, and more broadly covers sentences that may convey multiple different messages.

As shown in Figure 1, ambiguity can be an unconscious misunderstanding (top of Figure 1) or intentional to mislead the audience (bottom of Figure 1). For example, if a cat becomes disoriented after leaving the house, it is lost (implicative edge) in the sense that it cannot find its way home; In a sense, it is also lost (neutral side).

7004424e169fdc01528a4e64e8b37a84.png
▲Figure 1 An example of ambiguity explained by cat stray

Introduction to the AMBIENT dataset

Featured Examples

The authors provide 1645 sentence examples covering multiple types of ambiguity, including handwriting samples and from existing NLI datasets and linguistics textbooks. Each example in AMBIENT contains a set of labels, corresponding to various possible understandings, and a disambiguation rewrite for each understanding, as shown in Table 1.

992fe0e09a2c94e639ac4d4a6fb4ba07.png
▲ Table 1 Premise and assumption pairs in selected examples

Generated example

The researchers also employed overgeneration and filtering to build a large corpus of unlabeled NLI examples to more comprehensively cover different ambiguity situations. Inspired by previous work, they automatically identify pairs of antecedent hypotheses that share inference patterns, and enhance the quality of the corpus by encouraging the creation of new examples with the same patterns.

Annotation and Validation

Annotations and annotations are required for the examples obtained in the previous steps. This process involved annotation by two experts, validation and summarization by one expert, and validation by some authors . Meanwhile, 37 linguistics students selected a set of labels for each example and provided disambiguation rewrites. All these annotated examples were screened and validated, resulting in 1503 final examples.

The specific process is shown in Figure 2: First, unlabeled examples are created using InstructGPT, and then annotated independently by two linguists. Finally, through the integration of one author, the final annotations and annotations are obtained.

7a299c08982e8ad00e9876d7331140f4.png
▲Figure 2 Annotation process for generating examples in AMBIENT

In addition, the consistency of annotation results between different annotators is explored here, as well as the types of ambiguities that exist in the AMBIENT dataset. The author randomly selected 100 samples in the dataset as the development set, and the rest of the samples were used as the test set. Figure 3 shows the distribution of the set labels, and each sample has a corresponding inference relationship label. Studies have shown that in the case of ambiguity, the annotation results of multiple annotators are consistent , and the joint results of multiple annotators can improve the accuracy of annotation .

28d50a02c5f1e6cfb7dfa1682be3e761.png
▲Figure 3 Distribution of collection tags in AMBIENT

Does the ambiguity say "disagree"?

This study analyzes the behavior of annotators when annotating ambiguous inputs under the traditional NLI three-way annotation scheme. The study finds that annotators are aware of ambiguity and that ambiguity is the main cause of discrepancies in labels , thus challenging the popular assumption that "disagree" is the uncertainty of modeled examples.

In the study, using the AMBIENT dataset, 9 crowdworkers were hired to label each ambiguous example.

The task is divided into three steps:

  1. Examples of labeling ambiguity

  2. Identify possible alternative interpretations

  3. Labeling Disambiguated Examples

Among them, in step 2, three possible explanations include two possible meanings and a similar but not identical sentence. Finally, for each possible explanation, it is substituted into the original example to obtain three new NLI examples, and let the annotator choose a label respectively.

The results of this experiment support the hypothesis: under the single labeling system, the original ambiguous examples will produce highly inconsistent results, that is, in the process of labeling sentences, people tend to make different judgments on ambiguous sentences, leading to Inconsistent . However, when a disambiguation step was added to the task, annotators were generally able to identify and verify multiple possibilities for sentences, and the inconsistencies in the results were largely resolved. Therefore, disambiguation is an effective way to reduce the influence of annotator's subjectivity on the results .

Evaluate performance on large models

Q1. Is it possible to directly generate content related to disambiguation

This part focuses on testing the learning ability of language models to generate disambiguation and corresponding labels directly in context . To this end, the authors construct a natural prompt and use automatic evaluation and human evaluation to verify the performance of the model, as shown in Table 2.

b78eeb8772054f226de70b6c4002dc04.png
▲ Table 2 When the premise is not clear, few-shot templates for generating ambiguity disambiguation tasks

In testing, each example has 4 other test examples as context, and scores and correctness are calculated using the EDIT-F1 metric and human evaluation. The experimental results are shown in Table 3, GPT-4 performed the best in the test, achieving 18.0% EDIT-F1 score and 32.0% human evaluation correctness. In addition, it is also observed that large models often adopt the strategy of adding additional context to directly confirm or reject hypotheses during disambiguation . Note, however, that human evaluation may overestimate the model's ability to accurately report sources of ambiguity.

cf01fca1b3a655654097a2b36eeb369e.png
▲Table 3 Performance of large models on AMBIENT

Q2. Can the validity of a reasonable explanation be identified?

This part mainly studies the performance of the large model in recognizing ambiguous sentences . By creating a series of templates of true and false statements and zero-shot testing the models, the researchers assessed how well the large models performed in choosing between true and false predictions. Experimental results show that the best model is GPT-4, however, when ambiguity is taken into account, GPT-4 performs less than random guessing in answering ambiguous explanations for all four templates . In addition, there is a consistency problem in the large model , and for different interpretation pairs of the same ambiguous sentence, the model may have internal contradictions .

These findings suggest that we need to further study how to improve the understanding ability of large models for ambiguous sentences and better evaluate the performance of large models.

Q3. Simulating open-ended continuous generation through different interpretations

This part mainly studies the ambiguity understanding ability based on language model . Given a context , test the language model and compare the model's predictions for text continuation under different possible interpretations. In order to measure the ability of the model to deal with ambiguity, the researchers used the KL divergence to measure the "surprise" of the model by comparing the probability and expectation produced by the model in a given ambiguity and a given correct context in the corresponding context. , and introduce "interference sentences" that randomly replace nouns to further test the ability of the model.

The experimental results show that FLAN-T5 has the highest correct rate, but the performance results of different test suites (LS involves synonym replacement, PC involves spelling error correction, SSD involves grammatical structure correction) and different models are inconsistent, indicating that ambiguity is still a part of the model . serious challenge .

Multi-label NLI model experiment

As shown in Table 4, there is still a lot of room for improvement in fine-tuning NLI models on existing data with label changes, especially in multi-label NLI tasks.

a0496d411f6a8c7cc1a86e72274484ee.png
▲Table 4 Performance of multi-label NLI model on AMBIENT

Detecting Misleading Political Speech

This experiment examines different ways of understanding political speech, demonstrating that models sensitive to different ways of understanding can be effectively exploited . The research results are shown in Table 5. For ambiguous sentences, some explanatory paraphrases can naturally eliminate ambiguity, because these paraphrases can only preserve ambiguity or clearly express a specific meaning.

7b8233ffbca4a0f80fc74d2fc1469d41.png
▲Table 5 The detection method of this article is marked as ambiguous political speech

Furthermore, interpretation of such predictions can reveal sources of ambiguity. By further analyzing the results of false positives, the authors also found many ambiguities that were not mentioned in the fact-checking , indicating that these tools have great potential in preventing misinterpretation.

summary

As pointed out in this paper, the ambiguity of natural language will be a key challenge in model optimization . We expect that in future technological developments, natural language understanding models will be able to more accurately identify context and emphasis in text , and show higher sensitivity when dealing with ambiguous text . Although we have established benchmarks for evaluating natural language processing models for identifying ambiguities and have been able to better understand the limitations of models in this domain, it remains a very challenging task.

We look forward to the emergence of more complex and accurate natural language understanding models, which will help us understand human language more comprehensively, thus making AI more widely used. Looking forward to the future technological development, chatbots will become our true friends, able to better understand our needs, and provide more intelligent recommendations and answers~

0a2eac5e3fc3aaae3354313eaae835af.png e4140b2646ffd58c9dea7a0ae4ac7586.png

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130538075