1.SuperGLUE task

1. COPA: Choice of Plausibe Ansewers

数据集代表了一项因果推理任务，其会向系统提供一个前提句子和两个可能的可选项。系统必须选择与前提句子有更可信因果关系的可选项。用于构建可选项的方法要确保需要因果推理才能解决该任务。样本要么针对前提句子的可能原因，要么则是可能结果，再加上模型的两个实例类型之间的简单问题消岐。

Premise(前提): I knocked on my neighbor’s door.
What happened as a result?

Alternative 1(两个实例): My neighbor invited me in.

Alternative 2: My neighbor left his house.

2. BOOlQ

BoolQ是包含15942个示例的Yes/No问题的问答数据集。这些问题是自然产生的–它们是在无提示和无约束的设置中产生的。 每个示例都是（问题、段落、答案）的三元组，页面标题作为可选的附加上下文。

{ “question”: “is windows movie maker part of windows essentials”
"passage": “Windows Movie Maker – Windows Movie Maker (formerly known
as Windows Live Movie Maker in Windows 7) is a discontinued video
editing software by Microsoft. It is a part of Windows Essentials
software suite and offers the ability to create and edit videos as
well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and
Flickr.”, “idx”: 2, “label”: true}

train.jsonl：9427带标签的训练示例
dev.jsonl：3270标记的开发示例
test.jsonl：3245未标记的测试示例

3. CB: Commitment Bank

CB一个短文本语料库，其中至少有一个句子包含一个嵌入从句。其中每个嵌入从句都标注了该从句的预期的真实性程度。所得到的任务框架是三类文本蕴涵（three-class textual entailment），其样本来自《华尔街日报（Wall Street Journal）》、英国国家语料库（British National Corpus）的小说、Switchboard。每个样本都包含一个含有一个嵌入从句的前提（premise），对应的假设（hypothesis）则是该从句的提取。SuperCLUE 使用了该数据集的一个子集，该子集中注释之间的一致程度超过 0.85。这些数据不很平衡（中性样本相对较少），所以评估指标是准确度和 F1 分数，其中多类 F1 分数是每类 F1 分数的不加权的平均值。
实际上，CB是一个文本蕴含任务。模型处理前提(premise)后，检查基于前提的假设(hypothesis)是中性的还是蕴含的还是相矛盾的。

{ “premise”: “The Susweca. It means ‘‘dragonfly’’ in Sioux, you know.
Did I ever tell you that’s where Paul and I met?”
"hypothesis":“Susweca is where she and Paul met,”
"label": “entailment”, “idx”: 77}

4. MultiRC: Multi-Sentence Reading Comprehension

MultiRC是一项真假问答任务。每个样本都包含一个上下文段落、一个有关该段落的问题和一个该问题的可能答案的列表，这些答案必须标注了「真（true）」或「假（false）」。问答是很常见的问题，有很多数据集。
这里选择 MultiRC 的原因包括：
（1）每个问题都可以有多个可能的正确答案，所以每个问答对都必须独立于其它问答对进行评估；（2）问题的设计方式使得每个问题的解答都需要从多个上下文句子中提取事实；
（3）相比于基于范围的抽取型问答，这个数据集的问答对格式更匹配其它 SuperGLUE 任务的 API。
这些段落取自七个领域，包括新闻、小说和历史文本。评估指标是每个问题的正确答案集的 macro-average F1 分数（F1m）和在所有答案选项上的 binary F1 分数（F1a）。例如给定文本：

"Text": “text”: "The rally took place on October 17, the shooting on
February 29. Again, standard filmmaking techniques are interpreted as
smooth distortion: “Moore works by depriving you of context and
guiding your mind to fill the vacuum – with completely false ideas.
It is brilliantly, if unethically, done.” As noted above, the “from
my cold dead hands” part is simply Moore’s way to introduce Heston.
Did anyone but Moore’s critics view it as anything else? He certainly
does not “attribute it to a speech where it was not uttered” and, as
noted above, doing so twice would make no sense whatsoever if Moore
was the mastermind deceiver that his critics claim he is. Concerning
the Georgetown Hoya interview where Heston was asked about Rolland,
you write: “There is no indication that [Heston] recognized Kayla
Rolland’s case.” This is naive to the extreme – Heston would not be
president of the NRA if he was not kept up to date on the most
prominent cases of gun violence. Even if he did not respond to that
part of the interview, he certainly knew about the case at that point.
Regarding the NRA website excerpt about the case and the highlighting
of the phrase “48 hours after Kayla Rolland is pronounced dead”:
This is one valid criticism, but far from the deliberate distortion
you make it out to be; rather, it is an example for how the facts can
sometimes be easy to miss with Moore’s fast pace editing. The reason
the sentence is highlighted is not to deceive the viewer into
believing that Heston hurried to Flint to immediately hold a rally
there (as will become quite obvious), but simply to highlight the
first mention of the name “Kayla Rolland” in the text, which is in
this paragraph. "

以及答案

"question": “When was Kayla Rolland shot?”
"answers":
[{“text”: “February 17”, “idx”: 168, “label”: 0},
{“text”: “February 29”, “idx”: 169, “label”: 1},
{“text”: “October 29”, “idx”: 170, “label”: 0},
{“text”: “October 17”, “idx”: 171, “label”: 0},
{“text”: “February 17”, “idx”: 172, “label”: 0}], “idx”: 26},
{ “question”: “Who was president of the NRA on February 29?”,
"answers":
[{“text”: “Charleton Heston”, “idx”: 173, “label”: 1},
{“text”: “Moore”, “idx”: 174, “label”: 0},
{“text”: “George Hoya”, “idx”: 175, “label”: 0},
{“text”: “Rolland”, “idx”: 176, “label”: 0},
{“text”: “Hoya”, “idx”: 177, “label”: 0},
{“text”: “Kayla”, “idx”: 178,“label”: 0}], “idx”: 27},

实际上**，MultiRC就是让模型处理一段文本和从几个带答案的问题选正确答案**。

5.ReCoRD:Reading Comprehension with Commonsense Reasoning Dataset

用常识数据集来进行阅读理解是常识推理任务，数据集包含从70,000新闻文档中提取的120,000 查询问题。样例如下：

"source": “Daily mail”
A passage contains the text and indications as to where the entities are located.
A passage begins with the text:
"passage": {
"text": “A Peruvian tribe once revered by the Inca’s for
their fierce hunting skills and formidable warriors are clinging on to
their traditional existence in the coca growing valleys of South
America, sharing their land with drug traffickers, rebels and illegal
loggers. Ashaninka Indians are the largest group of indigenous people
in the mountainous nation’s Amazon region, but their settlements are
so sparse that they now make up less than one per cent of Peru’s 30
million population. Ever since they battled rival tribes for territory
and food during native rule in the rainforests of South America, the
Ashaninka have rarely known peace.\n@highlight\nThe Ashaninka tribe
once shared the Amazon with the like of the Incas hundreds of years
ago\n@ highlight\nThey have been forced to share their land after
years of conflict forced rebels and drug dealers into the
forest\n@highlight\n. Despite settling in valleys rich with valuable
coca, they live a poor pre-industrial existence”,

再加上是实体指代的内容：

“entities”: [{“start”: 2,“end”: 9}, …,“start”: 711,“end”: 715}]

任务是要找出placeholder指代的内容。

{ “query”: “Innocence of youth: Many of the @placeholder’s younger generations have turned their backs on tribal life and moved to the cities where living conditions are better”,
"answers":[{“start”:263,“end”:271,“text”:“Ashaninka”},{“start”:601,“end
“:609,“text”:“Ashaninka”},{“start”:651,“end”:659,“text”:“Ashaninka”}],”
idx”:9}],“idx”:3}

6. RTE: Recognizing Textual Entailment

文本蕴含任务：处理前提句子后检查假设然后预测蕴含假设的状态标签。就是模型处理前提句子后预测假设是不是蕴含。

{ “premise”: “U.S. crude settled $1.32 lower at $42.83 a barrel.”,
"hypothesis": “Crude the light American lowered to the closing 1.32 dollars, to 42.83 dollars the barrel.”, “label”: “not_entailment”, “idx”: 19}

7.WiC: Words in Context

文本中词的意思:目标词在待分析两个句子中意思是不是一样。
例如，目标词: “word”: "place"

“sentence1”: “Do you want to come over to my place later?”,
“sentence2”: “A political system with no place for the less prominent
groups.”
train.json 中还会指出标签和词的位置。

"idx": 0,
"label": false,
"start1": 31,
"start2": 27,
"end1": 36,
"end2": 32,

8. WSC:The Winograd Schema Challenge

威诺格拉德模式挑战:是一个阅读理解任务，其中系统必须阅读一个带有一个代词的句子，并从一个选项列表中选择该代词所代指的目标。
GLUE 中就已包含 WSC 任务，这个任务难度颇大，仍有很大的进步空间。
SuperGLUE 中的 WSC 数据集被重新设定成了其共指形式，任务则被设定成了一个二元分类问题，而不再是 N 项多选题；这样做的目的是单独验证模型理解句子中共指链接的能力，而不会涉及到多选题环境中可能用到的其它策略。

{“text”: “I poured water from the bottle into the cup until it was full.”,

The WSC ask the model to find the target pronoun token number 10 starting at 0:
"target": { “span2_index”: 10,

Then it asks the model to determine if “it” refers to “the cup” or not:
"span1_index": 7,
"span1_text": “the cup”,
"span2_text": “it”},
For sample index #4, the label is true: “idx”: 4, “label”: true}

Inference

^1 BERT之后，GLUE基准升级为SuperGLUE：难度更大