CoNaLa数据集官方：代码/自然语言挑战

CoNaLa：代码/自然语言挑战

欢迎来到 CMU CoNaLa 网站，这是代码/自然语言挑战赛，是卡内基梅隆大学NeuLab和STRUDEL实验室的联合项目！此挑战赛旨在测试从自然语言生成程序片段的系统。例如，如果输入是sort list x in reverse order，则系统需要x.sort(reverse=True)以 Python 输出。

数据集信息

我们发布了一个从Stack Overflow抓取的数据集，该数据集经过自动筛选，然后由注释者整理，分为 2,379 个训练示例和 500 个测试示例（在此处阅读有关该过程的更多信息）。我们还提供了一个包含 600k 个示例的大型自动挖掘数据集，以及指向其他类似数据集的链接。这些数据集可用于 CoNaLa 挑战赛，或用于任何其他有关代码和自然语言交集的研究。

下载：CoNaLa Corpus v1.1

我们在下面简要描述了数据，您可以在我们的 MSR 2018 论文中找到更多详细信息，如果您在研究中使用该语料库或参与挑战，我们将非常感谢您引用该论文：

<span style="color:#d0d0d0"><span style="background-color:#303030"><span style="color:#d0d0d0"><code>@inproceedings{yin2018mining,
  author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
  title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
  booktitle = {International Conference on Mining Software Repositories},
  series = {MSR},
  pages = {476--486},
  year = {2018},
  publisher = {ACM},
  doi = {https://doi.org/10.1145/3196398.3196408},
}
</code></span></span></span>

手动整理数据

手动整理的 CoNaLa 数据集包含 Python 中的高质量自然语言意图和源代码片段对，分为conala-train 和conala-test数据集。

训练/测试分割以格式存储json，您可以在下面看到一些示例：数据集中的一些示例是：

<span style="color:#d0d0d0"><span style="background-color:#303030"><span style="color:#d0d0d0"><code>{
  "question_id": 36875258,
  "intent": "copying one file's contents to another in python", 
  "rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'", 
  "snippet": "shutil.copy('file.txt', 'file2.txt')", 
}

{
  "intent": "How do I check if all elements in a list are the same?", 
  "rewritten_intent": "check if all elements in list `mylist` are the same", 
  "snippet": "len(set(mylist)) == 1", 
  "question_id": 22240602
}

{
  "intent": "Iterate through words of a file in Python", 
  "rewritten_intent": "get a list of words `words` of a file 'myfile'", 
  "snippet": "words = open('myfile').read().split()", 
  "question_id": 7745260
}
</code></span></span></span>

以下是示例中每个字段的描述：

场地	描述
问题 ID	Stack Overflow 问题的 ID
意图	自然语言意图（即 Stack Overflow 问题的标题）
重写意图	众包修改后的意图试图更好地反映代码的完整含义，通常通过将代码中出现的变量名称和函数参数合并到意图中来实现。这是 CoNaLa 挑战中的系统要使用的输入。
片段	实现意图的代码片段。这是挑战中的系统的输出。

其他数据源

在 CoNaLa 挑战中，你可以使用其他数据源来提高系统准确性，只要你排除测试集中包含的特定 Stack Overflow 问题的任何信息即可。我们在下面提供了许多数据源的链接，但也可以使用其他来源：

自动挖掘意图/片段对

上述档案包含我们系统挖掘出的数据集中的 598,237 个候选意图/摘要对。该文件以Json 行conala-mined格式存储。每个字段的描述如下：

场地	描述
问题 ID	Stack Overflow 问题的 ID
parent_answer_post_id	提取候选片段的答案帖子的 ID
意图	自然语言意图
片段	提取的代码片段
ID	此意图/代码片段对的唯一 ID
可能	挖掘模型给出的概率

外部数据集

您还可以使用来自其他外部来源的数据，例如：

Django 数据集
StaQC：请注意，这是从 StackOveflow 挖掘出来的，因此您必须确保不使用 CoNaLa 测试集中包含的问题。
代码文档字符串语料库

培训系统

要参加 CoNaLa 挑战赛，您应该使用conala-train 和/或conala-mined数据集来训练系统，将数据集rewritten_intent 的字段conala-test作为输入，并从中生成输出。有关如何执行此操作的更多详细信息以及执行预处理和训练基线序列到序列模型的示例脚本，可在以下 GitHub 网站上找到：

conala 基线

提交结果

通过创建包含单个文件的 zip 文件来提交结果answer.txt，该文件为 JSON 数组格式，每行是一个代码片段。有关如何创建此文件的示例也可以在conala-baseline 目录中找到。

创建此文件后，您可以将其提交到CodaLab 上的排行榜。结果根据标记化后的 BLEU 分数进行评估，如基线 github 存储库中的脚本中所述。官方结果在排行榜上，但我们也会在此处维护一份（可能已过时的）副本，以便于浏览：

日期	团队	姓名	描述	布鲁
2018 年 6 月 18 日	主办方	seq2seq 注释+我的	在注释数据和 100k 挖掘数据上训练的基线序列到序列模型。	14.26
2018 年 6 月 18 日	主办方	seq2seq注释	仅针对注释数据进行训练的基线序列到序列模型。	10.58

主办方

联系人：殷鹏程、Edgar Chen、Bogdan Vasilescu、Graham Neubig

致谢

CoNaLa 语料库的开发和维护部分由美国国家科学基金会资助，资助编号为 1815287，“开放领域、数据驱动的自然语言代码合成”。本材料中表达的任何观点、发现、结论或建议均为作者的观点，并不一定反映美国国家科学基金会的观点。

CoNaLa: The Code/Natural Language Challenge

Welcome to the site of CMU CoNaLa, the Code/Natural Language Challenge, a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab! This challenge was designed to test systems for generating program snippets from natural language. For example, if the input is sort list x in reverse order, then the system would be required to output x.sort(reverse=True) in Python.

Dataset Information

We have released a dataset crawled from Stack Overflow, automatically filtered, then curated by annotators, split into 2,379 training and 500 test examples (read more about the process here). We also provide a large automatically-mined dataset with 600k examples, and links to other similar datasets. These data sets can be used for the CoNaLa challenge, or for any other research on the intersection of code and natural language.

Download: CoNaLa Corpus v1.1

We describe the data briefly below, and you can find more detail in our MSR 2018 paper, which we’d appreciate you cite if you use the corpus in your research or participate in the challenge:

@inproceedings{yin2018mining,
  author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
  title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
  booktitle = {International Conference on Mining Software Repositories},
  series = {MSR},
  pages = {476--486},
  year = {2018},
  publisher = {ACM},
  doi = {https://doi.org/10.1145/3196398.3196408},
}

Manually Curated Data

The manually curated CoNaLa dataset contains high-quality natural language intent and source code snippet pairs in Python, split into the conala-train and conala-test datasets.

The train/test splits are stored in json format, and you can see some examples below: Some examples in the dataset are:

{
  "question_id": 36875258,
  "intent": "copying one file's contents to another in python", 
  "rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'", 
  "snippet": "shutil.copy('file.txt', 'file2.txt')", 
}

{
  "intent": "How do I check if all elements in a list are the same?", 
  "rewritten_intent": "check if all elements in list `mylist` are the same", 
  "snippet": "len(set(mylist)) == 1", 
  "question_id": 22240602
}

{
  "intent": "Iterate through words of a file in Python", 
  "rewritten_intent": "get a list of words `words` of a file 'myfile'", 
  "snippet": "words = open('myfile').read().split()", 
  "question_id": 7745260
}

Here is the description of each field in an example:

Field	Description
question_id	Id of the Stack Overflow question
intent	Natural Language intent (i.e., the title of a Stack Overflow question)
rewritten_intent	Crowdsourced revised intents that try to better reflect the full meaning of the code, typically done by incorporating variable names and function arguments that appeared in the code into the intent. This is the input to be used by systems in the CoNaLa challenge.
snippet	A code snippet that implements the intent. This is the output of systems in the challenge.

Other Data Sources

In the CoNaLa challenge, you are allowed to use other data sources to improve your system accuracy as long as you exclude any information from the specific Stack Overflow questions that are included in the test set. We provide links to a number of data sources below, but other sources may be used as well:

Automatically Mined Intent/Snippet Pairs

The above archive includes 598,237 candidate intent/snippet pairs mined by our system, in the conala-mined data set. The file is stored in Json lines format. A description of each field is:

Field	Description
question_id	Id of the Stack Overflow question
parent_answer_post_id	Id of the answer post from which the candidate snippet is extracted
intent	The natural language intent
snippet	The extracted code snippet
id	Unique id for this intent/snippet pair
prob	Probability given by the mining model

External Datasets

You may also use data from other external sources such as:

Django Dataset
StaQC: note that this is mined from StackOveflow, so you must ensure that you do not use the questions included in the CoNaLa test set.
Code Docstring Corpus

Training Systems

To participate in the CoNaLa challenge, you should use the conala-train and/or conala-mined datasets to train a system, take the rewritten_intent field of the conala-test dataset as input, and generate output from it. More details of how to do so, along with example scripts to perform preprocessing and train a baseline sequence-to-sequence model can be found on the following GitHub site:

conala-baseline

Submitting Results

The results are submitted by creating a zip file containing a single file answer.txt, which is in JSON array format with one line being one code snippet. An example of how to create this file can also be found in the conala-baseline directory.

Once you have created this file, you can submit it to the Leader Board on CodaLab. The results are evaluated according to BLEU score after tokenization, as detailed in the scripts in the baseline github repository. The official results are on the leaderboard, but we’ll also be maintaining a (potentially outdated) copy here for easy browsing:

Date	Team	Name	Description	BLEU
6/18/2018	Organizers	seq2seq annot+mine	A baseline sequence-to-sequence model trained on both annotated and 100k mined data.	14.26
6/18/2018	Organizers	seq2seq annot	A baseline sequence-to-sequence model trained on only annoated data.	10.58

Organizers

Contact: Pengcheng Yin, Edgar Chen, Bogdan Vasilescu, Graham Neubig

Acknowledgement

This development and maintenance of the CoNaLa corpus is supported in part by the National Science Foundation under Grant No. 1815287, “Open-domain, Data-driven Code Synthesis from Natural Language”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.