Microsoft CodeXGlue #997

madlag · 2020-12-02T11:21:18Z

Datasets from https://github.com/microsoft/CodeXGLUE

This contains 13 datasets:

code_x_glue_cc_clone_detection_big_clone_bench
code_x_glue_cc_clone_detection_poj_104
code_x_glue_cc_cloze_testing_all
code_x_glue_cc_cloze_testing_maxmin
code_x_glue_cc_code_completion_line
code_x_glue_cc_code_completion_token
code_x_glue_cc_code_refinement
code_x_glue_cc_code_to_code_trans
code_x_glue_cc_defect_detection
code_x_glue_ct_code_to_text
code_x_glue_tc_nl_code_search_adv
code_x_glue_tc_text_to_code
code_x_glue_tt_text_to_text

lhoestq · 2020-12-02T16:50:21Z

#978 is working on adding code refinement

maybe we should keep the CodeXGlue benchmark (as glue) and don't merge the code_refinement dataset proposed in #978 ?

cc @reshinthadithyan

lhoestq

I started to take a look at this very impressive PR !

I only added comments about the first dataset since the comments may apply to the others as well.

My main concern here is the dataset names. They are too long and complicated. Could you simplify that ?

About readability can you add comments and docstrings ?

and run make style fo format the code

lhoestq · 2020-12-11T18:09:41Z

datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md

+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)


Can you re-add all the sections that were removed ?

You can leave their content with [More Information Needed]

For the names, I took the original names, but I understand.
For this one, 'cxgcc_clone_detect_big' would be ok ? (cxg for CodeXGlue, cc for code to code, and a shorten version of the dataset name (and same thing for the others).
For the "make style", it was part of my post-processing script, but there may some subtlety somewhere.

(and I will re-add the missing sections)

(I just ran make style, and there are no changes to the code, can you confirm that there is an issue with the code style ?)

Here a proposal list of names:

cxgcc_defect_detect
cxgcc_clone_detect_big
cxgcc_refine
cxgcc_complete_token
cxgcc_code_to_code
cxgcc_code_complete_line
cxgcc_clone_detect_poj
cxgcc_cloze_test_maxmin
cxgcc_cloze_test_all
cxgct_code_to_text
cxgtt_text_to_text
cxgtc_search_web_query
cxgtc_text_to_code
cxgtc_search_adv

Looks good !
You can even make the name clearer with

code_xglue_<tt|cc|ct|tc>_<subdataset_name>

(just change the beginning from cxg to code_xglue_)

The remaining code style issues are these ones:

src/datasets/commands/dummy_data.py:132:25: E722 do not use bare 'except' src/datasets/commands/dummy_data.py:141:21: F841 local variable 'e' is assigned to but never used

So all the code in your datasets is fine, the only errors come from dummy_data.py

datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md

lhoestq · 2020-12-11T18:14:04Z

..._x_glue_cc_clone_detection_big_clone_bench/code_x_glue_cc_clone_detection_big_clone_bench.py

+    _DESCRIPTION = """Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
+The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree."""
+
+    _CITATION = """@inproceedings{svajlenko2014towards,
+  title={Towards a big data curated benchmark of inter-project code clones},
+  author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
+  booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
+  pages={476--480},
+  year={2014},
+  organization={IEEE}
+}
+
+@inproceedings{wang2020detecting,
+  title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
+  author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
+  booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
+  pages={261--271},
+  year={2020},
+  organization={IEEE}
+}"""


Can you have the citation and description as global variables ?

It will be a bit tricky, as everything is generated, and the base class is handling this. Why is it better to have a global variable ?

The template for datasets scripts we're using has these variable as global variables. It's the same for all the datasets. Moreover we're parsing the python scripts to extract the description and citation in moon landing to show them in the huggingface.co dataset pages, and the parsing currently expects them to be in global variables.

So it's possible to support them to be non-global but it may require changes on moon landing side. We probably want to keep the parsing as simple as possible, but it's not a major issue.

What you can do is have them as both global variables and class attribute:

_CITATION = """\ insert citation here """ class MyDataset(datasets.GeneratorBasedBuilder): _CITATION = _CITATION

this way the main class has access to the citation and the citation is also defined as a global variable.

What do you think ?

lhoestq · 2020-12-11T18:15:05Z

..._x_glue_cc_clone_detection_big_clone_bench/code_x_glue_cc_clone_detection_big_clone_bench.py

+}
+
+
+class CodeXGlueCCCloneDetectionBigCloneBenchMain(datasets.GeneratorBasedBuilder):


The name of the class should be the camel case version of the dataset script name so

CodeXGlueCcCloneDetectionBigCloneBench

Also note that after doing changes in dataset class names or dataset script names, you have to regenerate the dataset_infos.json file.

Yes, I have a single script that runs the whole thing, so no problem.
Maybe we should ensure this constraint in the dataset loading code, so it does not work if the class has not the right name ?

We should indeed have something to check this constraint. It doesn't raise error but it's better to have names that match for debugging or to have consistent cache file names.

lhoestq · 2020-12-11T18:19:20Z

..._x_glue_cc_clone_detection_big_clone_bench/code_x_glue_cc_clone_detection_big_clone_bench.py

+    def _generate_examples(self, split_name, file_pathes):
+        return self.child._generate_examples(split_name, file_pathes)


Suggested change

def _generate_examples(self, split_name, file_pathes):

return self.child._generate_examples(split_name, file_pathes)

def _generate_examples(self, split_name, file_paths):

return self.child._generate_examples(split_name, file_paths)

typo

lhoestq

Also It would be nice to separate the addition of the datasets and all the tools you developed to make things easier in different PRs, so that we can review them separately !

The tool to make the dataset card looks very promising. Looking forward to be able to generate everything with one click ^^

lhoestq · 2020-12-11T18:21:53Z

templates/generate_dataset_card.py

@@ -0,0 +1,464 @@
+#!/usr/bin/env python3


can you move this script in a separate PR if you don't mind ?

Sorry for that, I actually moved the code to an external repository, and I should have removed this. We will work with @yjernite to merge some features into the datasets-tagging app.

lhoestq · 2020-12-11T18:22:03Z

templates/README.template.md

@@ -0,0 +1,59 @@
+# Dataset Card for "{{dataset_name}}"


same for this

lhoestq · 2020-12-11T18:22:58Z

src/datasets/commands/dummy_data.py

            # Line by line text file (txt, csv etc.)
            if is_line_by_line_text_file:
                Path(dst_path).parent.mkdir(exist_ok=True, parents=True)
                with open(src_path, "r", encoding=encoding) as src_file:


Can you remove these changes ?

If you feel like they can be useful, please open another PR

Yes, sorry, it was not intended to add this in the same PR, I was waiting to add it another one, and it lacks some check on the Exception. I must have done a "commit -am" instead of a "commit -m" somewhere. It had been a long long time to add all the dummy_data files, I must have lowered the guard at some point...

lhoestq

Hi :)
Looks like we can merge this one pretty soon !
Can you just remove the changes in dummy_data.py and maybe remove README.template.md as well as generate_dataset_card.py ?
Could you also update this branch by merging master to it ?

lhoestq · 2021-01-29T10:36:36Z

datasets/code_x_glue_tc_nl_code_search_adv/README.md

@@ -0,0 +1,171 @@
+--


Suggested change

--

---

lhoestq · 2021-01-29T10:37:36Z

datasets/code_x_glue_cc_defect_detection/README.md

+language_creators:
+- found
+languages:
+- C


Suggested change

- C

- code

lhoestq · 2021-01-29T10:37:52Z

datasets/code_x_glue_cc_clone_detection_poj_104/README.md

+language_creators:
+- found
+languages:
+- C++


Suggested change

- C++

- code

lhoestq · 2021-01-29T10:38:13Z

datasets/code_x_glue_cc_code_to_code_trans/README.md

+- java
+- C#


Suggested change

- java

- C#

- code

ncoop57 · 2021-05-12T00:13:03Z

Hi @madlag and @lhoestq , I am extremely interested in getting this dataset into HF's library as I research in this area a lot. I see that it hasn't been updated in a while, but it is very close to being finished. If no one is currently working on this, I'd be happy to do any final touches that might be needed to get this merged.

lhoestq · 2021-05-12T15:02:52Z

Hi @ncoop57 ! Thanks for your interest and sorry for the inactivity on this PR.
Sure feel free to create another PR to continue this one ! This one was really close to being merged so I think it won't require that much changes. In addition to my previous comments, there should also be a "Contributions" subsection (see the template of the README here)

Co-authored-by: Quentin Lhoest <[email protected]>

…ate.

madlag · 2021-06-08T13:42:24Z

Superseded by #2357 .

madlag force-pushed the microsoft-codexglue-code-to-code-trans branch 2 times, most recently from 366774d to 02800cf Compare December 9, 2020 15:40

madlag requested a review from lhoestq December 10, 2020 10:14

lhoestq reviewed Dec 11, 2020

View reviewed changes

lhoestq reviewed Jan 29, 2021

View reviewed changes

ncoop57 mentioned this pull request May 13, 2021

Adding Microsoft CodeXGlue Datasets #2357

Merged

madlag and others added 5 commits June 3, 2021 11:28

Microsoft Code X Glue datasets.

9192484

Fix in READMEs.

b28d150

Changing language type to "code"

335042f

Co-authored-by: Quentin Lhoest <[email protected]>

Changing the dataset to original (=not in the datasets repository)

5e9908a

Co-authored-by: Quentin Lhoest <[email protected]>

Code style update.

91d92fc

madlag force-pushed the microsoft-codexglue-code-to-code-trans branch from 56792f9 to 91d92fc Compare June 3, 2021 09:29

madlag added 2 commits June 4, 2021 17:28

Updated READMEs: removed spaces and re-added missing parts from templ…

3559294

…ate.

Moving CITATION and DESCRIPTION to globals.

b8383d9

madlag closed this Jun 8, 2021

		}


		class CodeXGlueCCCloneDetectionBigCloneBenchMain(datasets.GeneratorBasedBuilder):

		def _generate_examples(self, split_name, file_pathes):
		return self.child._generate_examples(split_name, file_pathes)

Microsoft CodeXGlue #997

Microsoft CodeXGlue #997

Uh oh!

Conversation

madlag commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madlag Dec 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncoop57 commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 12, 2021

Uh oh!

madlag commented Jun 8, 2021

Uh oh!

Reviewers

Assignees

madlag commented Dec 2, 2020 •

edited

Loading

lhoestq commented Dec 2, 2020 •

edited

Loading

lhoestq left a comment •

edited

Loading

madlag Dec 14, 2020 •

edited

Loading

ncoop57 commented May 12, 2021 •

edited

Loading