CoNaLa:代码/自然语言挑战
欢迎来到 CMU CoNaLa 网站,这是代码/自然语言挑战赛,是卡内基梅隆大学NeuLab和STRUDEL实验室的联合项目!此挑战赛旨在测试从自然语言生成程序片段的系统。例如,如果输入是sort list x in reverse order
,则系统需要x.sort(reverse=True)
以 Python 输出。
数据集信息
我们发布了一个从Stack Overflow抓取的数据集,该数据集经过自动筛选,然后由注释者整理,分为 2,379 个训练示例和 500 个测试示例(在此处阅读有关该过程的更多信息)。我们还提供了一个包含 600k 个示例的大型自动挖掘数据集,以及指向其他类似数据集的链接。这些数据集可用于 CoNaLa 挑战赛,或用于任何其他有关代码和自然语言交集的研究。
我们在下面简要描述了数据,您可以在 我们的 MSR 2018 论文中找到更多详细信息,如果您在研究中使用该语料库或参与挑战,我们将非常感谢您引用该论文:
<span style="color:#d0d0d0"><span style="background-color:#303030"><span style="color:#d0d0d0"><code>@inproceedings{yin2018mining,
author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
booktitle = {International Conference on Mining Software Repositories},
series = {MSR},
pages = {476--486},
year = {2018},
publisher = {ACM},
doi = {https://doi.org/10.1145/3196398.3196408},
}
</code></span></span></span>
手动整理数据
手动整理的 CoNaLa 数据集包含 Python 中的高质量自然语言意图和源代码片段对,分为conala-train
和conala-test
数据集。
训练/测试分割以格式存储json
,您可以在下面看到一些示例:数据集中的一些示例是:
<span style="color:#d0d0d0"><span style="background-color:#303030"><span style="color:#d0d0d0"><code>{
"question_id": 36875258,
"intent": "copying one file's contents to another in python",
"rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'",
"snippet": "shutil.copy('file.txt', 'file2.txt')",
}
{
"intent": "How do I check if all elements in a list are the same?",
"rewritten_intent": "check if all elements in list `mylist` are the same",
"snippet": "len(set(mylist)) == 1",
"question_id": 22240602
}
{
"intent": "Iterate through words of a file in Python",
"rewritten_intent": "get a list of words `words` of a file 'myfile'",
"snippet": "words = open('myfile').read().split()",
"question_id": 7745260
}
</code></span></span></span>
以下是示例中每个字段的描述:
场地 | 描述 |
---|---|
问题 ID | Stack Overflow 问题的 ID |
意图 | 自然语言意图(即 Stack Overflow 问题的标题) |
重写意图 | 众包修改后的意图试图更好地反映代码的完整含义,通常通过将代码中出现的变量名称和函数参数合并到意图中来实现。这是 CoNaLa 挑战中的系统要使用的输入。 |
片段 | 实现意图的代码片段。这是挑战中的系统的输出。 |
其他数据源
在 CoNaLa 挑战中,你可以使用其他数据源来提高系统准确性,只要你排除测试集中包含的特定 Stack Overflow 问题的任何信息即可。我们在下面提供了许多数据源的链接,但也可以使用其他来源:
自动挖掘意图/片段对
上述档案包含我们系统挖掘出的数据集中的 598,237 个候选意图/摘要对。该文件以Json 行conala-mined
格式存储。每个字段的描述如下:
场地 | 描述 |
---|---|
问题 ID | Stack Overflow 问题的 ID |
parent_answer_post_id | 提取候选片段的答案帖子的 ID |
意图 | 自然语言意图 |
片段 | 提取的代码片段 |
ID | 此意图/代码片段对的唯一 ID |
可能 | 挖掘模型给出的概率 |
外部数据集
您还可以使用来自其他外部来源的数据,例如:
- Django 数据集
- StaQC:请注意,这是从 StackOveflow 挖掘出来的,因此您必须确保不使用 CoNaLa 测试集中包含的问题。
- 代码文档字符串语料库
培训系统
要参加 CoNaLa 挑战赛,您应该使用conala-train
和/或conala-mined
数据集来训练系统,将数据集rewritten_intent
的字段conala-test
作为输入,并从中生成输出。有关如何执行此操作的更多详细信息以及执行预处理和训练基线序列到序列模型的示例脚本,可在以下 GitHub 网站上找到:
提交结果
通过创建包含单个文件的 zip 文件来提交结果answer.txt
,该文件为 JSON 数组格式,每行是一个代码片段。有关如何创建此文件的示例也可以在conala-baseline 目录中找到。
创建此文件后,您可以将其提交到CodaLab 上的排行榜。结果根据标记化后的 BLEU 分数进行评估,如基线 github 存储库中的脚本中所述。官方结果在排行榜上,但我们也会在此处维护一份(可能已过时的)副本,以便于浏览:
日期 | 团队 | 姓名 | 描述 | 布鲁 |
---|---|---|---|---|
2018 年 6 月 18 日 | 主办方 | seq2seq 注释+我的 | 在注释数据和 100k 挖掘数据上训练的基线序列到序列模型。 | 14.26 |
2018 年 6 月 18 日 | 主办方 | seq2seq注释 | 仅针对注释数据进行训练的基线序列到序列模型。 | 10.58 |
主办方
致谢
CoNaLa 语料库的开发和维护部分由美国国家科学基金会资助,资助编号为 1815287,“开放领域、数据驱动的自然语言代码合成”。本材料中表达的任何观点、发现、结论或建议均为作者的观点,并不一定反映美国国家科学基金会的观点。
CoNaLa: The Code/Natural Language Challenge
Welcome to the site of CMU CoNaLa, the Code/Natural Language Challenge, a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab! This challenge was designed to test systems for generating program snippets from natural language. For example, if the input is sort list x in reverse order
, then the system would be required to output x.sort(reverse=True)
in Python.
Dataset Information
We have released a dataset crawled from Stack Overflow, automatically filtered, then curated by annotators, split into 2,379 training and 500 test examples (read more about the process here). We also provide a large automatically-mined dataset with 600k examples, and links to other similar datasets. These data sets can be used for the CoNaLa challenge, or for any other research on the intersection of code and natural language.
- Download: CoNaLa Corpus v1.1
We describe the data briefly below, and you can find more detail in our MSR 2018 paper, which we’d appreciate you cite if you use the corpus in your research or participate in the challenge:
@inproceedings{yin2018mining,
author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
booktitle = {International Conference on Mining Software Repositories},
series = {MSR},
pages = {476--486},
year = {2018},
publisher = {ACM},
doi = {https://doi.org/10.1145/3196398.3196408},
}
Manually Curated Data
The manually curated CoNaLa dataset contains high-quality natural language intent and source code snippet pairs in Python, split into the conala-train
and conala-test
datasets.
The train/test splits are stored in json
format, and you can see some examples below: Some examples in the dataset are:
{
"question_id": 36875258,
"intent": "copying one file's contents to another in python",
"rewritten_intent": "copy the content of file 'file.txt' to file 'file2.txt'",
"snippet": "shutil.copy('file.txt', 'file2.txt')",
}
{
"intent": "How do I check if all elements in a list are the same?",
"rewritten_intent": "check if all elements in list `mylist` are the same",
"snippet": "len(set(mylist)) == 1",
"question_id": 22240602
}
{
"intent": "Iterate through words of a file in Python",
"rewritten_intent": "get a list of words `words` of a file 'myfile'",
"snippet": "words = open('myfile').read().split()",
"question_id": 7745260
}
Here is the description of each field in an example:
Field | Description |
---|---|
question_id | Id of the Stack Overflow question |
intent | Natural Language intent (i.e., the title of a Stack Overflow question) |
rewritten_intent | Crowdsourced revised intents that try to better reflect the full meaning of the code, typically done by incorporating variable names and function arguments that appeared in the code into the intent. This is the input to be used by systems in the CoNaLa challenge. |
snippet | A code snippet that implements the intent. This is the output of systems in the challenge. |
Other Data Sources
In the CoNaLa challenge, you are allowed to use other data sources to improve your system accuracy as long as you exclude any information from the specific Stack Overflow questions that are included in the test set. We provide links to a number of data sources below, but other sources may be used as well:
Automatically Mined Intent/Snippet Pairs
The above archive includes 598,237 candidate intent/snippet pairs mined by our system, in the conala-mined
data set. The file is stored in Json lines format. A description of each field is:
Field | Description |
---|---|
question_id | Id of the Stack Overflow question |
parent_answer_post_id | Id of the answer post from which the candidate snippet is extracted |
intent | The natural language intent |
snippet | The extracted code snippet |
id | Unique id for this intent/snippet pair |
prob | Probability given by the mining model |
External Datasets
You may also use data from other external sources such as:
- Django Dataset
- StaQC: note that this is mined from StackOveflow, so you must ensure that you do not use the questions included in the CoNaLa test set.
- Code Docstring Corpus
Training Systems
To participate in the CoNaLa challenge, you should use the conala-train
and/or conala-mined
datasets to train a system, take the rewritten_intent
field of the conala-test
dataset as input, and generate output from it. More details of how to do so, along with example scripts to perform preprocessing and train a baseline sequence-to-sequence model can be found on the following GitHub site:
Submitting Results
The results are submitted by creating a zip file containing a single file answer.txt
, which is in JSON array format with one line being one code snippet. An example of how to create this file can also be found in the conala-baseline directory.
Once you have created this file, you can submit it to the Leader Board on CodaLab. The results are evaluated according to BLEU score after tokenization, as detailed in the scripts in the baseline github repository. The official results are on the leaderboard, but we’ll also be maintaining a (potentially outdated) copy here for easy browsing:
Date | Team | Name | Description | BLEU |
---|---|---|---|---|
6/18/2018 | Organizers | seq2seq annot+mine | A baseline sequence-to-sequence model trained on both annotated and 100k mined data. | 14.26 |
6/18/2018 | Organizers | seq2seq annot | A baseline sequence-to-sequence model trained on only annoated data. | 10.58 |
Organizers
- Contact: Pengcheng Yin, Edgar Chen, Bogdan Vasilescu, Graham Neubig
Acknowledgement
This development and maintenance of the CoNaLa corpus is supported in part by the National Science Foundation under Grant No. 1815287, “Open-domain, Data-driven Code Synthesis from Natural Language”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.