huggingface · lhoestq · Jun 8, 2021 · Dec 9, 2020 · Dec 9, 2020 · Dec 14, 2020
diff --git a/datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md b/datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
@@ -0,0 +1,185 @@
+---
+annotations_creators:
+- found
+language_creators:
+- found
+languages:
+- code
+licenses:
+- other-C-UDA
+multilinguality:
+- monolingual
+size_categories:
+- 1M<n<10M
+source_datasets:
+- original
+task_categories:
+- text-classification
+task_ids:
+- semantic-similarity-classification
+---
+# Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks and Leaderboards](#supported-tasks)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits-sample-size)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
+
+### Dataset Summary
+
+CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
+
+Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
+The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.
+
+### Supported Tasks and Leaderboards
+
+- `semantic-similarity-classification`: The dataset can be used to train a model for classifying if two given java methods are cloens of each other.
+
+### Languages
+
+- Java **programming** language
+
+## Dataset Structure
+
+### Data Instances
+
+An example of 'test' looks as follows.
+```
+{
+    "func1": "    @Test(expected = GadgetException.class)\n    public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n        HttpRequest request = createCacheableRequest();\n        expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n        replay(pipeline);\n        try {\n            specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n            fail(\"No exception thrown on bad parse\");\n        } catch (GadgetException e) {\n        }\n        specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n    }\n", 
+    "func2": "    public InputStream getInputStream() throws TGBrowserException {\n        try {\n            if (!this.isFolder()) {\n                URL url = new URL(this.url);\n                InputStream stream = url.openStream();\n                return stream;\n            }\n        } catch (Throwable throwable) {\n            throw new TGBrowserException(throwable);\n        }\n        return null;\n    }\n", 
+    "id": 0, 
+    "id1": 2381663, 
+    "id2": 4458076, 
+    "label": false
+}
+```
+
+### Data Fields
+
+In the following each data field in go is explained for each config. The data fields are the same among all splits.
+
+#### default
+
+|field name| type |                    description                    |
+|----------|------|---------------------------------------------------|
+|id        |int32 | Index of the sample                               |
+|id1       |int32 | The first function id                             |
+|id2       |int32 | The second function id                            |
+|func1     |string| The full text of the first function               |
+|func2     |string| The full text of the second function              |
+|label     |bool  | 1 is the functions are not equivalent, 0 otherwise|
+
+### Data Splits
+
+| name  |train |validation| test |
+|-------|-----:|---------:|-----:|
+|default|901028|    415416|415416|
+
+## Dataset Creation
+
+### Curation Rationale
+
+[More Information Needed]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+Data was mined from the IJaDataset 2.0 dataset.
+[More Information Needed]
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+Data was manually labeled by three judges by automatically identifying potential clones using search heuristics.
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare.
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+https://github.com/microsoft, https://github.com/madlag
+
+### Licensing Information
+
+Computational Use of Data Agreement (C-UDA) License.
+
+### Citation Information
+
+```
+@inproceedings{svajlenko2014towards,
+  title={Towards a big data curated benchmark of inter-project code clones},
+  author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
+  booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
+  pages={476--480},
+  year={2014},
+  organization={IEEE}
+}
+
+@inproceedings{wang2020detecting,
+  title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
+  author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
+  booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
+  pages={261--271},
+  year={2020},
+  organization={IEEE}
+}
+```
+
+### Contributions
+
+Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
diff --git a/...glue_cc_clone_detection_big_clone_bench/code_x_glue_cc_clone_detection_big_clone_bench.py b/...glue_cc_clone_detection_big_clone_bench/code_x_glue_cc_clone_detection_big_clone_bench.py
@@ -0,0 +1,95 @@
+from typing import List
+
+import datasets
+
+from .common import TrainValidTestChild
+from .generated_definitions import DEFINITIONS
+
+
+_DESCRIPTION = """Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
+The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree."""
+
+_CITATION = """@inproceedings{svajlenko2014towards,
+title={Towards a big data curated benchmark of inter-project code clones},
+author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
+booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
+pages={476--480},
+year={2014},
+organization={IEEE}
+}
+
+@inproceedings{wang2020detecting,
+title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
+author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
+booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
+pages={261--271},
+year={2020},
+organization={IEEE}
+}"""
+
+
+class CodeXGlueCcCloneDetectionBigCloneBenchImpl(TrainValidTestChild):
+    _DESCRIPTION = _DESCRIPTION
+    _CITATION = _CITATION
+
+    _FEATURES = {
+        "id": datasets.Value("int32"),  # Index of the sample
+        "id1": datasets.Value("int32"),  # The first function id
+        "id2": datasets.Value("int32"),  # The second function id
+        "func1": datasets.Value("string"),  # The full text of the first function
+        "func2": datasets.Value("string"),  # The full text of the second function
+        "label": datasets.Value("bool"),  # 1 is the functions are not equivalent, 0 otherwise
+    }
+
+    _SUPERVISED_KEYS = ["label"]
+
+    def generate_urls(self, split_name):
+        yield "index", f"{split_name}.txt"
+        yield "data", "data.jsonl"
+
+    def _generate_examples(self, split_name, file_paths):
+        import json
+
+        js_all = {}
+
+        with open(file_paths["data"], encoding="utf-8") as f:
+            for idx, line in enumerate(f):
+                entry = json.loads(line)
+                js_all[int(entry["idx"])] = entry["func"]
+
+        with open(file_paths["index"], encoding="utf-8") as f:
+            for idx, line in enumerate(f):
+                line = line.strip()
+                idx1, idx2, label = [int(i) for i in line.split("\t")]
+                func1 = js_all[idx1]
+                func2 = js_all[idx2]
+
+                yield idx, dict(id=idx, id1=idx1, id2=idx2, func1=func1, func2=func2, label=(label == 1))
+
+
+CLASS_MAPPING = {
+    "CodeXGlueCcCloneDetectionBigCloneBench": CodeXGlueCcCloneDetectionBigCloneBenchImpl,
+}
+
+
+class CodeXGlueCcCloneDetectionBigCloneBench(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIG_CLASS = datasets.BuilderConfig
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name=name, description=info["description"]) for name, info in DEFINITIONS.items()
+    ]
+
+    def _info(self):
+        name = self.config.name
+        info = DEFINITIONS[name]
+        if info["class_name"] in CLASS_MAPPING:
+            self.child = CLASS_MAPPING[info["class_name"]](info)
+        else:
+            raise RuntimeError(f"Unknown python class for dataset configuration {name}")
+        ret = self.child._info()
+        return ret
+
+    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
+        return self.child._split_generators(dl_manager=dl_manager)
+
+    def _generate_examples(self, split_name, file_paths):
+        return self.child._generate_examples(split_name, file_paths)
diff --git a/datasets/code_x_glue_cc_clone_detection_big_clone_bench/common.py b/datasets/code_x_glue_cc_clone_detection_big_clone_bench/common.py
@@ -0,0 +1,75 @@
+from typing import List
+
+import datasets
+
+
+# Citation, taken from https://github.com/microsoft/CodeXGLUE
+_DEFAULT_CITATION = """@article{CodeXGLUE,
+         title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence},
+         year={2020},}"""
+
+
+class Child:
+    _DESCRIPTION = None
+    _FEATURES = None
+    _CITATION = None
+    SPLITS = {"train": datasets.Split.TRAIN}
+    _SUPERVISED_KEYS = None
+
+    def __init__(self, info):
+        self.info = info
+
+    def homepage(self):
+        return self.info["project_url"]
+
+    def _info(self):
+        # This is the description that will appear on the datasets page.
+        return datasets.DatasetInfo(
+            description=self.info["description"] + "\n\n" + self._DESCRIPTION,
+            features=datasets.Features(self._FEATURES),
+            homepage=self.homepage(),
+            citation=self._CITATION or _DEFAULT_CITATION,
+            supervised_keys=self._SUPERVISED_KEYS,
+        )
+
+    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
+        SPLITS = self.SPLITS
+        _URL = self.info["raw_url"]
+        urls_to_download = {}
+        for split in SPLITS:
+            if split not in urls_to_download:
+                urls_to_download[split] = {}
+
+            for key, url in self.generate_urls(split):
+                if not url.startswith("http"):
+                    url = _URL + "/" + url
+                urls_to_download[split][key] = url
+
+        downloaded_files = {}
+        for k, v in urls_to_download.items():
+            downloaded_files[k] = dl_manager.download_and_extract(v)
+
+        return [
+            datasets.SplitGenerator(
+                name=SPLITS[k],
+                gen_kwargs={"split_name": k, "file_paths": downloaded_files[k]},
+            )
+            for k in SPLITS
+        ]
+
+    def check_empty(self, entries):
+        all_empty = all([v == "" for v in entries.values()])
+        all_non_empty = all([v != "" for v in entries.values()])
+
+        if not all_non_empty and not all_empty:
+            raise RuntimeError("Parallel data files should have the same number of lines.")
+
+        return all_empty
+
+
+class TrainValidTestChild(Child):
+    SPLITS = {
+        "train": datasets.Split.TRAIN,
+        "valid": datasets.Split.VALIDATION,
+        "test": datasets.Split.TEST,
+    }
diff --git a/datasets/code_x_glue_cc_clone_detection_big_clone_bench/dataset_infos.json b/datasets/code_x_glue_cc_clone_detection_big_clone_bench/dataset_infos.json
@@ -0,0 +1 @@
+{"default": {"description": "CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench\n\nGiven two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.\nThe dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.", "citation": "@inproceedings{svajlenko2014towards,\ntitle={Towards a big data curated benchmark of inter-project code clones},\nauthor={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},\nbooktitle={2014 IEEE International Conference on Software Maintenance and Evolution},\npages={476--480},\nyear={2014},\norganization={IEEE}\n}\n\n@inproceedings{wang2020detecting,\ntitle={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},\nauthor={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},\nbooktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},\npages={261--271},\nyear={2020},\norganization={IEEE}\n}", "homepage": "https://github.com/madlag/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench", "license": "", "features": {"id": {"dtype": "int32", "id": null, "_type": "Value"}, "id1": {"dtype": "int32", "id": null, "_type": "Value"}, "id2": {"dtype": "int32", "id": null, "_type": "Value"}, "func1": {"dtype": "string", "id": null, "_type": "Value"}, "func2": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": {"input": "label", "output": ""}, "task_templates": null, "builder_name": "code_x_glue_cc_clone_detection_big_clone_bench", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2888035757, "num_examples": 901028, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "validation": {"name": "validation", "num_bytes": 1371399694, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "test": {"name": "test", "num_bytes": 1220662901, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}}, "download_checksums": {"https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/train.txt": {"num_bytes": 17043552, "checksum": "29119bfa94673374249c3424809fbe6baaa1f0e87a13e3c727bbd6cdf1224b77"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/data.jsonl": {"num_bytes": 15174797, "checksum": "d8bc51e62deddcc45bd26c5b57f5add2a2cf377f13b9f6c2fb656fbc8fca4dd2"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/valid.txt": {"num_bytes": 7861019, "checksum": "e59e8c1321df59b6ab0143165cb603030c55800c00e2d782e06810517b8de1e4"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/test.txt": {"num_bytes": 7876506, "checksum": "a6c0cf79be34e582fdc64007aa894ed094e4f9ff2e5395a8d2b5c39eeef2737a"}}, "download_size": 47955874, "post_processing_size": null, "dataset_size": 5480098352, "size_in_bytes": 5528054226}}
diff --git a/datasets/code_x_glue_cc_clone_detection_big_clone_bench/dummy/default/0.0.0/dummy_data.zip b/datasets/code_x_glue_cc_clone_detection_big_clone_bench/dummy/default/0.0.0/dummy_data.zip
diff --git a/datasets/code_x_glue_cc_clone_detection_big_clone_bench/generated_definitions.py b/datasets/code_x_glue_cc_clone_detection_big_clone_bench/generated_definitions.py
@@ -0,0 +1,12 @@
+DEFINITIONS = {
+    "default": {
+        "class_name": "CodeXGlueCcCloneDetectionBigCloneBench",
+        "dataset_type": "Code-Code",
+        "description": "CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench",
+        "dir_name": "Clone-detection-BigCloneBench",
+        "name": "default",
+        "project_url": "https://github.com/madlag/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench",
+        "raw_url": "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset",
+        "sizes": {"test": 415416, "train": 901028, "validation": 415416},
+    }
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"default": {"description": "CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench\n\nGiven two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.\nThe dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.", "citation": "@inproceedings{svajlenko2014towards,\ntitle={Towards a big data curated benchmark of inter-project code clones},\nauthor={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},\nbooktitle={2014 IEEE International Conference on Software Maintenance and Evolution},\npages={476--480},\nyear={2014},\norganization={IEEE}\n}\n\n@inproceedings{wang2020detecting,\ntitle={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},\nauthor={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},\nbooktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},\npages={261--271},\nyear={2020},\norganization={IEEE}\n}", "homepage": "https://github.com/madlag/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench", "license": "", "features": {"id": {"dtype": "int32", "id": null, "_type": "Value"}, "id1": {"dtype": "int32", "id": null, "_type": "Value"}, "id2": {"dtype": "int32", "id": null, "_type": "Value"}, "func1": {"dtype": "string", "id": null, "_type": "Value"}, "func2": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"dtype": "bool", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": {"input": "label", "output": ""}, "task_templates": null, "builder_name": "code_x_glue_cc_clone_detection_big_clone_bench", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 2888035757, "num_examples": 901028, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "validation": {"name": "validation", "num_bytes": 1371399694, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}, "test": {"name": "test", "num_bytes": 1220662901, "num_examples": 415416, "dataset_name": "code_x_glue_cc_clone_detection_big_clone_bench"}}, "download_checksums": {"https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/train.txt": {"num_bytes": 17043552, "checksum": "29119bfa94673374249c3424809fbe6baaa1f0e87a13e3c727bbd6cdf1224b77"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/data.jsonl": {"num_bytes": 15174797, "checksum": "d8bc51e62deddcc45bd26c5b57f5add2a2cf377f13b9f6c2fb656fbc8fca4dd2"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/valid.txt": {"num_bytes": 7861019, "checksum": "e59e8c1321df59b6ab0143165cb603030c55800c00e2d782e06810517b8de1e4"}, "https://raw.githubusercontent.com/madlag/CodeXGLUE/main/Code-Code/Clone-detection-BigCloneBench/dataset/test.txt": {"num_bytes": 7876506, "checksum": "a6c0cf79be34e582fdc64007aa894ed094e4f9ff2e5395a8d2b5c39eeef2737a"}}, "download_size": 47955874, "post_processing_size": null, "dataset_size": 5480098352, "size_in_bytes": 5528054226}}