GLUE list

Natural language processing (NLP) is mainly natural language understanding (NLU) and natural language generation (NLG). The nine tasks of GLUE (General Language Understanding Evaluation) involve multiple tasks such as natural language inference, text entailment, sentiment analysis, and semantic similarity. Well-known models such as BERT, XLNet, RoBERTa, ERINE, T5, etc. will be tested on this benchmark.

List address: GLUE Benchmark

CoLA: Sentences are grammatical and ungrammatical, with 0 indicating ungrammatical and 1 indicating grammatical.

SST-2: Sentiment Classification, Positive Sentiment and Negative Sentiment Binary Classification

MRPC: Semantic Equivalent Binary Classification

STS-B: Sentence Meaning Similarity Score [0,5]

QQP: Sentence pairs are equivalent to binary classification

MNLI: Three classifications of sentence-to-premise and hypothetical relations: entailment, contradiction, and neutrality

QNLI: Whether the question and the sentence contain two categories

RTE: Does the sentence pair imply binary classification?

WNLI: Does the sentence pair imply binary classification?

1. CoLA dataset

CoLA (The Corpus of Linguistic Acceptability, language acceptability corpus), single sentence classification task, the corpus comes from books and journals of language theory.

Number of samples: 8,551 in the training set, 1,043 in the development set, and 1,063 in the test set.

Task: The sentence is grammatical and ungrammatical, 0 means it is not grammatical, and 1 means it is grammatical

The evaluation index is: MCC (Matthews correlation coefficient, a binary evaluation index used when the distribution of positive and negative samples is very uneven)

 train.tsv has the same data format as dev.tsv and is divided into 4 columns. The first column of data, such as gj04, bc01, etc. represent the source of each text data, that is, the publication code; the second column of data, 0 or 1, represents whether the grammar of each text data is correct, 0 means incorrect, 1 means correct; The third column of data, '', is the author's initial positive and negative sample mark, which has the same meaning as the second column, '' means incorrect; the fourth column is the text sentence marked whether the grammatical usage is correct or not.

The data content in test.tsv is divided into 2 columns. The first column of data represents the index of each text data; the second column of data represents the sentences used for testing.

2. SST-2 data set

SST-2 (The Stanford Sentiment Treebank, Stanford Sentiment Treebank), a single sentence classification task, contains human annotations of sentences in movie reviews and their sentiments.

Number of samples: 67,350 training sets, 873 development sets, and 1,821 test sets.

Task: sentiment classification, positive sentiment (labeled as 1) and negative sentiment (labeled as 0)

The evaluation index is: accuracy

train.tsv is divided into 2 columns, the first column of data represents the comment text with emotional color; the second column of data, 0 or 1, represents each text data is a positive or negative comment, 0 represents negative, 1 represents positive.

 

 test.tsv is divided into 2 columns, the first column of data represents the index of each text data; the second column of data represents the sentences used for testing.

 3. MRPC data set

MRPC (The Microsoft Research Paraphrase Corpus, Microsoft Research Paraphrase Corpus) is a corpus of sentence pairs automatically extracted from online news sources, and manually annotated whether the sentences in the sentence pairs are semantically equivalent. Class imbalance, 68% positive samples.

Number of samples: 3,668 training sets, 408 development sets, and 1,725 ​​test sets.

Task: Meaning-equivalent binary classification.

Evaluation criteria: accuracy and F1 value.

train.tsv is divided into 5 columns, the first column of data, 0 or 1, represents whether each pair of sentences has the same meaning, 0 means different meanings, 1 means the same meaning. The second and third columns represent each pair of sentences id, the fourth and fifth columns are sentences.

 test.tsv is divided into 5 columns, the first column of data represents the index of each text data; the meaning of the rest of the columns is the same as in train.tsv.

 4. STS-B data set

STSB (The Semantic Textual Similarity Benchmark) is a collection of sentence pairs extracted from news headlines, video headlines, image headlines, and natural language inference data, and its similarity score is a floating-point number of 0-5.

Number of samples: 5,749 training sets, 1,379 development sets, and 1,377 test sets.

Task: A regression task, predicting a float with a similarity score between 1-5.

Evaluation criteria: Pearson-Spearman Corr.

The data content in train.tsv is divided into 10 columns, the first column of data is the data index; the second column represents the source of each pair of sentences, such as main-captions means from subtitles; the third column represents the specific save file name of the source, The fourth column represents the time of appearance (year); the fifth column represents the index of the original data; the sixth and seventh columns represent the original source of sentence pairs; the eighth and ninth columns represent sentence pairs with different degrees of similarity; the tenth Columns represent the similarity of sentence pairs from low to high, and the value range is [0, 5].

The data content in test.tsv is divided into 9 columns, the meaning is the same as the first 9 columns of train.tsv.

5. QQP data set

QQP (The Quora Question Pairs, Quora Question Pairs Set) is a collection of question pairs in the community Q&A website Quora. The task is to determine whether a pair of questions are semantically equivalent. Like MRPC, QQP is also unbalanced in positive and negative samples. The difference is that QQP negative samples account for 63% and positive samples account for 37%. We use a standard test set, for which we obtain dedicated labels from the authors. We observe that the test set is distributed differently than the training set.

Number of samples: 363,870 training sets, 40,431 development sets, and 390,965 test sets.

Task: Determine whether the sentence pair is equivalent, equivalent or not, two classification tasks.

Evaluation criteria: accuracy and F1 value.

The data content in train.tsv is divided into 6 columns, the first column represents the text data index; the second and third column data represent the id of question 1 and question 2 respectively; the fourth and fifth columns represent the need to carry out' Repeated or not' judged sentence pair; the sixth column represents the label of whether the above question is/is not a repetitive question, 0 means no repetition, 1 represents repetition.

The data content in test.tsv is divided into 3 columns, the first column of data represents the index of each text data; the second and third columns of data represent the question sentence pairs used for testing.

 6. MNLI dataset

MNLI (The Multi-Genre Natural Language Inference Corpus, a multi-type natural language inference database) is a collection of textual entailment annotations for sentence pairs through crowdsourcing. Given a premise statement and a hypothesis statement, the task is to predict whether the premise statement contains the hypothesis (implication), contradicts the hypothesis (contradiction), or neither (neutral).

Number of samples: training set 392,702, development set dev-matched 9,815, development set dev-mismatched9,832, test set test-matched 9,796, test set test-dismatched9,847. Because MNLI is a collection of texts in many different domain styles, it is divided into two versions of matched and mismatched data sets. Matched refers to the same data source of the training set and the test set, and mismatched refers to the source of the training set and the test set. Inconsistent.

Task: Sentence pairs, one premise and one hypothesis. There are three situations in the relationship between premises and assumptions: entailment, contradiction, and neutral. Sentence pair three-category problem.

Evaluation criteria: matched accuracy/mismatched accuracy.

The data content in train.tsv is divided into 12 columns, the first column represents the text data index; the second and third column data represent different types of ids of sentence pairs; the fourth column represents the source of sentence pairs; the fifth column and the sixth column represent the sentence pair representation with syntactic structure analysis; the seventh and eighth columns represent the sentence pair representation with syntactic structure and part-of-speech tagging, the ninth and tenth columns represent the original sentence pair, the eleventh and The twelfth column represents the labels generated by different standard labeling methods. Here, they are always the same. There are three types of labels. Neutral means that the two sentences are neither contradictory nor implied, and entailment means that the two sentences have an implied relationship. Contradiction represents the contradiction between two sentences.

The data content in test_matched.tsv is divided into 10 columns, which have the same meaning as the first 10 columns of train.tsv.

 7. QNLI data set

QNLI (Quusetion-answering NLI, Question Answering Natural Language Inference) is converted from the Stanford Question Answering Dataset (SQuAD).

Number of samples: 104,743 training sets, 5,463 development sets, and 5,461 test sets.

Task: judge whether the question and sentence contain, contain and not contain, and classify.

Evaluation criteria: accuracy (accuracy).

The data content in train.tsv is divided into 4 columns, the first column represents the text data index; the second and third column data represent the sentence pairs that need to be judged "whether it implies"; the fourth column data represents whether the two sentences Has an implied relationship, 0/not_entailment means not implied relationship, 1/entailment means implied relationship.

The data content in test.tsv is divided into 3 columns. The data in the first column represents the index of each text data; the data in the second and third columns represent the sentence pairs that need to be judged "whether it implies".

 8. RTE data set

RTE (The Recognizing Textual Entailment datasets, Recognizing Textual Entailment datasets), a natural language inference task, is a series of datasets from the annual Textual Entailment Challenge.

Number of samples: training set 2,491, development set 277, test set 3,000.

Task: judge whether a sentence pair entails, whether sentence 1 and sentence 2 entail each other, and a binary classification task.

Evaluation criteria: accuracy (accuracy).

The train.tsv and test.tsv styles are basically the same as the QNLI dataset.

 9. WNLI dataset

WNLI (Winograd NLI, Winograd Natural Language Inference), natural language inference task, the data set comes from the conversion of competition data. The two categories of the training set are balanced, the test set is unbalanced, and 65% is not implied.

Number of samples: 635 training sets, 71 development sets, and 146 test sets.

Tasks: Determine whether a sentence pair is related, implies and does not imply, binary classification tasks.

Evaluation criteria: accuracy (accuracy).

The train.tsv and test.tsv styles are basically the same as the QNLI dataset.

Guess you like

Origin blog.csdn.net/qq_39066502/article/details/130764727