Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
02800cf
Microsoft Code X Glue datasets.
madlag Dec 9, 2020
3529e93
Fix in READMEs.
madlag Dec 9, 2020
b2595ea
Changing language type to "code"
madlag Dec 14, 2020
56792f9
Changing the dataset to original (=not in the datasets repository)
madlag Dec 14, 2020
9d843c0
Remove template script
May 12, 2021
33514d3
Revert changes to dummy_data.py script
May 12, 2021
80dbab6
Remove additional readme template
May 12, 2021
b1082dd
Add contribution subsection to readmes
May 12, 2021
bc2e755
Fix camel case
May 12, 2021
29c03d3
Update desc and cites to use global vars
May 13, 2021
7284a61
Fix typos
May 13, 2021
5afdbe0
Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…
May 13, 2021
8f8d7e2
Remove extra lines and update contributions
May 13, 2021
c37ae3a
Fix typos and camel case
May 13, 2021
5392aa7
Fix styling
May 13, 2021
67afd5f
Merge branch 'microsoft-codexglue-code-to-code-trans' of https://gith…
May 13, 2021
22d2da5
Update datasets/code_x_glue_cc_clone_detection_poj_104/generated_defi…
May 14, 2021
a6cca2f
Add encodings to all open calls
May 14, 2021
f720467
Convert clone detection poj dataset to use yield instead of writing t…
May 17, 2021
f451cc1
Fix styling
May 17, 2021
38e4d76
Remove marker file being written
May 18, 2021
4f2e277
Fix styling
May 18, 2021
20c96c6
Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…
May 18, 2021
18104c4
Update datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
May 27, 2021
55149c0
Update datasets/code_x_glue_cc_clone_detection_poj_104/README.md
May 27, 2021
a78cba1
Update datasets/code_x_glue_cc_clone_detection_poj_104/README.md
May 27, 2021
b4a130c
Update datasets/code_x_glue_cc_cloze_testing_all/README.md
May 27, 2021
2b0e821
Update datasets/code_x_glue_cc_cloze_testing_maxmin/README.md
May 27, 2021
f86a919
Update datasets/code_x_glue_tc_text_to_code/README.md
May 27, 2021
cc4bc03
Update datasets/code_x_glue_tc_text_to_code/README.md
May 27, 2021
b87aed6
Update datasets/code_x_glue_ct_code_to_text/README.md
May 27, 2021
96187d7
Update datasets/code_x_glue_ct_code_to_text/README.md
May 27, 2021
61c34ee
Update datasets/code_x_glue_tt_text_to_text/README.md
May 27, 2021
2fc0beb
Merge remote-tracking branch 'origin/master' into microsoft-codexglue…
May 27, 2021
8246dae
Add new TOC outline
May 27, 2021
3d77085
Fill in new README sections for big clone bench
May 30, 2021
29e91e9
Fill in new README sections for POJ
May 30, 2021
d1c5214
Fill in new README sections for the Cloze Test benchmarks
May 30, 2021
8454b03
Remove extra square bracket
May 30, 2021
9678ecd
Fill in new README sections for the Code Completion benchmarks
May 30, 2021
60628a9
Fill in new README sections for the Code Refinement dataset
May 30, 2021
60a618a
Fill in new README sections for the Code Translation dataset
May 30, 2021
13194f1
Fill in new README sections for the Code Defect Detection dataset
May 30, 2021
2fe2177
Change lang tag to code
May 30, 2021
32d952f
Fill in new README sections for the Code Docstring Generation dataset
May 30, 2021
b6efa8a
Update task tag
May 30, 2021
b6a38e7
Fill in new README sections for the Code Search dataset
May 30, 2021
67826ad
Update task tags
May 30, 2021
ef41c23
Fill in new README sections for the Code Generation dataset
May 30, 2021
70ba739
Fill in new README sections for the Code Documentation Translation da…
May 30, 2021
1920c9c
Fix heading format, update task tags, and add missing spaces for cont…
May 30, 2021
27c1c93
Rename sections to proper names
Jun 5, 2021
2825aaf
Add additional source data subsubsections and add new data on source …
Jun 5, 2021
3437146
Add missing subsubsections of Annotations
Jun 5, 2021
4186048
Fix missing source_data yaml tag
Jun 5, 2021
95fa8ee
Update source_data tag to valid one
Jun 5, 2021
5e10c18
Fix additional information subsection heading
Jun 5, 2021
8d91fb1
Fix heading format
Jun 5, 2021
905ce59
Fix language tag with code
Jun 5, 2021
98ed142
Hopefully fix codec issue
Jun 5, 2021
1805d08
Moving code_x_glue_cc_clone_detection_poj_104 to code_x_glue_cc_clone…
madlag Jun 7, 2021
020899b
Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…
Jun 7, 2021
b95a4a9
Fix headings and remove special chars
Jun 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
annotations_creators:
- found
language_creators:
- found
languages:
- code
licenses:
- other-C-UDA
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- semantic-similarity-classification
---
# Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench

### Dataset Summary

CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench

Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.

### Supported Tasks and Leaderboards

- `semantic-similarity-classification`: The dataset can be used to train a model for classifying if two given java methods are cloens of each other.

### Languages

- Java **programming** language

## Dataset Structure

### Data Instances

An example of 'test' looks as follows.
```
{
"func1": " @Test(expected = GadgetException.class)\n public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n HttpRequest request = createCacheableRequest();\n expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n replay(pipeline);\n try {\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n fail(\"No exception thrown on bad parse\");\n } catch (GadgetException e) {\n }\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n }\n",
"func2": " public InputStream getInputStream() throws TGBrowserException {\n try {\n if (!this.isFolder()) {\n URL url = new URL(this.url);\n InputStream stream = url.openStream();\n return stream;\n }\n } catch (Throwable throwable) {\n throw new TGBrowserException(throwable);\n }\n return null;\n }\n",
"id": 0,
"id1": 2381663,
"id2": 4458076,
"label": false
}
```

### Data Fields

In the following each data field in go is explained for each config. The data fields are the same among all splits.

#### default

|field name| type | description |
|----------|------|---------------------------------------------------|
|id |int32 | Index of the sample |
|id1 |int32 | The first function id |
|id2 |int32 | The second function id |
|func1 |string| The full text of the first function |
|func2 |string| The full text of the second function |
|label |bool | 1 is the functions are not equivalent, 0 otherwise|

### Data Splits

| name |train |validation| test |
|-------|-----:|---------:|-----:|
|default|901028| 415416|415416|

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

Data was mined from the IJaDataset 2.0 dataset.
[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

Data was manually labeled by three judges by automatically identifying potential clones using search heuristics.
[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare.

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

https://github.com/microsoft, https://github.com/madlag

### Licensing Information

Computational Use of Data Agreement (C-UDA) License.

### Citation Information

```
@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}

@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}
```

### Contributions

Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
from typing import List

import datasets

from .common import TrainValidTestChild
from .generated_definitions import DEFINITIONS


_DESCRIPTION = """Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree."""

_CITATION = """@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}

@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}"""


class CodeXGlueCcCloneDetectionBigCloneBenchImpl(TrainValidTestChild):
_DESCRIPTION = _DESCRIPTION
_CITATION = _CITATION

_FEATURES = {
"id": datasets.Value("int32"), # Index of the sample
"id1": datasets.Value("int32"), # The first function id
"id2": datasets.Value("int32"), # The second function id
"func1": datasets.Value("string"), # The full text of the first function
"func2": datasets.Value("string"), # The full text of the second function
"label": datasets.Value("bool"), # 1 is the functions are not equivalent, 0 otherwise
}

_SUPERVISED_KEYS = ["label"]

def generate_urls(self, split_name):
yield "index", f"{split_name}.txt"
yield "data", "data.jsonl"

def _generate_examples(self, split_name, file_paths):
import json

js_all = {}

with open(file_paths["data"], encoding="utf-8") as f:
for idx, line in enumerate(f):
entry = json.loads(line)
js_all[int(entry["idx"])] = entry["func"]

with open(file_paths["index"], encoding="utf-8") as f:
for idx, line in enumerate(f):
line = line.strip()
idx1, idx2, label = [int(i) for i in line.split("\t")]
func1 = js_all[idx1]
func2 = js_all[idx2]

yield idx, dict(id=idx, id1=idx1, id2=idx2, func1=func1, func2=func2, label=(label == 1))


CLASS_MAPPING = {
"CodeXGlueCcCloneDetectionBigCloneBench": CodeXGlueCcCloneDetectionBigCloneBenchImpl,
}


class CodeXGlueCcCloneDetectionBigCloneBench(datasets.GeneratorBasedBuilder):
BUILDER_CONFIG_CLASS = datasets.BuilderConfig
BUILDER_CONFIGS = [
datasets.BuilderConfig(name=name, description=info["description"]) for name, info in DEFINITIONS.items()
]

def _info(self):
name = self.config.name
info = DEFINITIONS[name]
if info["class_name"] in CLASS_MAPPING:
self.child = CLASS_MAPPING[info["class_name"]](info)
else:
raise RuntimeError(f"Unknown python class for dataset configuration {name}")
ret = self.child._info()
return ret

def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
return self.child._split_generators(dl_manager=dl_manager)

def _generate_examples(self, split_name, file_paths):
return self.child._generate_examples(split_name, file_paths)
75 changes: 75 additions & 0 deletions datasets/code_x_glue_cc_clone_detection_big_clone_bench/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
from typing import List

import datasets


# Citation, taken from https://github.com/microsoft/CodeXGLUE
_DEFAULT_CITATION = """@article{CodeXGLUE,
title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence},
year={2020},}"""


class Child:
_DESCRIPTION = None
_FEATURES = None
_CITATION = None
SPLITS = {"train": datasets.Split.TRAIN}
_SUPERVISED_KEYS = None

def __init__(self, info):
self.info = info

def homepage(self):
return self.info["project_url"]

def _info(self):
# This is the description that will appear on the datasets page.
return datasets.DatasetInfo(
description=self.info["description"] + "\n\n" + self._DESCRIPTION,
features=datasets.Features(self._FEATURES),
homepage=self.homepage(),
citation=self._CITATION or _DEFAULT_CITATION,
supervised_keys=self._SUPERVISED_KEYS,
)

def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
SPLITS = self.SPLITS
_URL = self.info["raw_url"]
urls_to_download = {}
for split in SPLITS:
if split not in urls_to_download:
urls_to_download[split] = {}

for key, url in self.generate_urls(split):
if not url.startswith("http"):
url = _URL + "/" + url
urls_to_download[split][key] = url

downloaded_files = {}
for k, v in urls_to_download.items():
downloaded_files[k] = dl_manager.download_and_extract(v)

return [
datasets.SplitGenerator(
name=SPLITS[k],
gen_kwargs={"split_name": k, "file_paths": downloaded_files[k]},
)
for k in SPLITS
]

def check_empty(self, entries):
all_empty = all([v == "" for v in entries.values()])
all_non_empty = all([v != "" for v in entries.values()])

if not all_non_empty and not all_empty:
raise RuntimeError("Parallel data files should have the same number of lines.")

return all_empty


class TrainValidTestChild(Child):
SPLITS = {
"train": datasets.Split.TRAIN,
"valid": datasets.Split.VALIDATION,
"test": datasets.Split.TEST,
}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"default": {"description": "CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench\n\nGiven two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.\nThe dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.", "citation": "@inproceedings{svajlenko2014towards,\ntitle={Towards a big data curated benchmark of inter-project code clones},\nauthor={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},\nbooktitle={2014 IEEE International Conference on Software Maintenance and Evolution},\npages={476--480},\nyear={2014},\norganization={IEEE}\n}\n\n@inproceedings{wang2020detecting,\ntitle={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},\nauthor={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},\nbooktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},\npages={261--271},\nyear={2020},\norganization={IEEE}\n}", "homepage": "https://github.com/madlag/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench", "license": "", "features": {"id": {"dtype": "int32", "id": null, "_type": "Value"}, "id1": {"dtype": "int32", "id": null, "_type": "Value"}, "id2": {"dtype": "int32", "id": null, "_type": "Value"}, "func1": {"dtype": "string", "id": null, "_type": "Value"}, "func2": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": {"input": "label", "output": ""}, "task_templates": null, "builder_name": "code_x_glue_cc_clone_detection_big_clone_bench", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2888035757, "num_examples": 901028, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "validation": {"name": "validation", "num_bytes": 1371399694, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "test": {"name": "test", "num_bytes": 1220662901, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}}, "download_checksums": {"https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/train.txt": {"num_bytes": 17043552, "checksum": "29119bfa94673374249c3424809fbe6baaa1f0e87a13e3c727bbd6cdf1224b77"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/data.jsonl": {"num_bytes": 15174797, "checksum": "d8bc51e62deddcc45bd26c5b57f5add2a2cf377f13b9f6c2fb656fbc8fca4dd2"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/valid.txt": {"num_bytes": 7861019, "checksum": "e59e8c1321df59b6ab0143165cb603030c55800c00e2d782e06810517b8de1e4"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/test.txt": {"num_bytes": 7876506, "checksum": "a6c0cf79be34e582fdc64007aa894ed094e4f9ff2e5395a8d2b5c39eeef2737a"}}, "download_size": 47955874, "post_processing_size": null, "dataset_size": 5480098352, "size_in_bytes": 5528054226}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
DEFINITIONS = {
"default": {
"class_name": "CodeXGlueCcCloneDetectionBigCloneBench",
"dataset_type": "Code-Code",
"description": "CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench",
"dir_name": "Clone-detection-BigCloneBench",
"name": "default",
"project_url": "https://github.com/madlag/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench",
"raw_url": "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset",
"sizes": {"test": 415416, "train": 901028, "validation": 415416},
}
}
Loading