Skip to content

Conversation

@madlag
Copy link
Contributor

@madlag madlag commented Dec 2, 2020

Datasets from https://github.com/microsoft/CodeXGLUE

This contains 13 datasets:

code_x_glue_cc_clone_detection_big_clone_bench
code_x_glue_cc_clone_detection_poj_104
code_x_glue_cc_cloze_testing_all
code_x_glue_cc_cloze_testing_maxmin
code_x_glue_cc_code_completion_line
code_x_glue_cc_code_completion_token
code_x_glue_cc_code_refinement
code_x_glue_cc_code_to_code_trans
code_x_glue_cc_defect_detection
code_x_glue_ct_code_to_text
code_x_glue_tc_nl_code_search_adv
code_x_glue_tc_text_to_code
code_x_glue_tt_text_to_text

@lhoestq
Copy link
Member

lhoestq commented Dec 2, 2020

#978 is working on adding code refinement

maybe we should keep the CodeXGlue benchmark (as glue) and don't merge the code_refinement dataset proposed in #978 ?

cc @reshinthadithyan

@madlag madlag force-pushed the microsoft-codexglue-code-to-code-trans branch 2 times, most recently from 366774d to 02800cf Compare December 9, 2020 15:40
@madlag madlag requested a review from lhoestq December 10, 2020 10:14
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to take a look at this very impressive PR !

I only added comments about the first dataset since the comments may apply to the others as well.

My main concern here is the dataset names. They are too long and complicated. Could you simplify that ?

About readability can you add comments and docstrings ?

and run make style fo format the code

- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you re-add all the sections that were removed ?

You can leave their content with [More Information Needed]

Copy link
Contributor Author

@madlag madlag Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the names, I took the original names, but I understand.
For this one, 'cxgcc_clone_detect_big' would be ok ? (cxg for CodeXGlue, cc for code to code, and a shorten version of the dataset name (and same thing for the others).
For the "make style", it was part of my post-processing script, but there may some subtlety somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and I will re-add the missing sections)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I just ran make style, and there are no changes to the code, can you confirm that there is an issue with the code style ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a proposal list of names:

cxgcc_defect_detect
cxgcc_clone_detect_big
cxgcc_refine
cxgcc_complete_token
cxgcc_code_to_code
cxgcc_code_complete_line
cxgcc_clone_detect_poj
cxgcc_cloze_test_maxmin
cxgcc_cloze_test_all
cxgct_code_to_text
cxgtt_text_to_text
cxgtc_search_web_query
cxgtc_text_to_code
cxgtc_search_adv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !
You can even make the name clearer with

code_xglue_<tt|cc|ct|tc>_<subdataset_name>

(just change the beginning from cxg to code_xglue_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining code style issues are these ones:

src/datasets/commands/dummy_data.py:132:25: E722 do not use bare 'except'
src/datasets/commands/dummy_data.py:141:21: F841 local variable 'e' is assigned to but never used

So all the code in your datasets is fine, the only errors come from dummy_data.py

Comment on lines 10 to 35
_DESCRIPTION = """Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree."""

_CITATION = """@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}
@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have the citation and description as global variables ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a bit tricky, as everything is generated, and the base class is handling this. Why is it better to have a global variable ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template for datasets scripts we're using has these variable as global variables. It's the same for all the datasets. Moreover we're parsing the python scripts to extract the description and citation in moon landing to show them in the huggingface.co dataset pages, and the parsing currently expects them to be in global variables.

So it's possible to support them to be non-global but it may require changes on moon landing side. We probably want to keep the parsing as simple as possible, but it's not a major issue.

What you can do is have them as both global variables and class attribute:

_CITATION = """\
insert citation here
"""

class MyDataset(datasets.GeneratorBasedBuilder):
    _CITATION = _CITATION

this way the main class has access to the citation and the citation is also defined as a global variable.

What do you think ?

}


class CodeXGlueCCCloneDetectionBigCloneBenchMain(datasets.GeneratorBasedBuilder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the class should be the camel case version of the dataset script name so

CodeXGlueCcCloneDetectionBigCloneBench

Also note that after doing changes in dataset class names or dataset script names, you have to regenerate the dataset_infos.json file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have a single script that runs the whole thing, so no problem.
Maybe we should ensure this constraint in the dataset loading code, so it does not work if the class has not the right name ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should indeed have something to check this constraint. It doesn't raise error but it's better to have names that match for debugging or to have consistent cache file names.

Comment on lines 89 to 90
def _generate_examples(self, split_name, file_pathes):
return self.child._generate_examples(split_name, file_pathes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _generate_examples(self, split_name, file_pathes):
return self.child._generate_examples(split_name, file_pathes)
def _generate_examples(self, split_name, file_paths):
return self.child._generate_examples(split_name, file_paths)

typo

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also It would be nice to separate the addition of the datasets and all the tools you developed to make things easier in different PRs, so that we can review them separately !

The tool to make the dataset card looks very promising. Looking forward to be able to generate everything with one click ^^

@@ -0,0 +1,464 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this script in a separate PR if you don't mind ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that, I actually moved the code to an external repository, and I should have removed this. We will work with @yjernite to merge some features into the datasets-tagging app.

@@ -0,0 +1,59 @@
# Dataset Card for "{{dataset_name}}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this

# Line by line text file (txt, csv etc.)
if is_line_by_line_text_file:
Path(dst_path).parent.mkdir(exist_ok=True, parents=True)
with open(src_path, "r", encoding=encoding) as src_file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these changes ?

If you feel like they can be useful, please open another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, it was not intended to add this in the same PR, I was waiting to add it another one, and it lacks some check on the Exception. I must have done a "commit -am" instead of a "commit -m" somewhere. It had been a long long time to add all the dummy_data files, I must have lowered the guard at some point...

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi :)
Looks like we can merge this one pretty soon !
Can you just remove the changes in dummy_data.py and maybe remove README.template.md as well as generate_dataset_card.py ?
Could you also update this branch by merging master to it ?

@@ -0,0 +1,171 @@
--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--
---

language_creators:
- found
languages:
- C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- C
- code

language_creators:
- found
languages:
- C++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- C++
- code

Comment on lines 7 to 8
- java
- C#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- java
- C#
- code

@ncoop57
Copy link

ncoop57 commented May 12, 2021

Hi @madlag and @lhoestq , I am extremely interested in getting this dataset into HF's library as I research in this area a lot. I see that it hasn't been updated in a while, but it is very close to being finished. If no one is currently working on this, I'd be happy to do any final touches that might be needed to get this merged.

@lhoestq
Copy link
Member

lhoestq commented May 12, 2021

Hi @ncoop57 ! Thanks for your interest and sorry for the inactivity on this PR.
Sure feel free to create another PR to continue this one ! This one was really close to being merged so I think it won't require that much changes. In addition to my previous comments, there should also be a "Contributions" subsection (see the template of the README here)

@madlag madlag force-pushed the microsoft-codexglue-code-to-code-trans branch from 56792f9 to 91d92fc Compare June 3, 2021 09:29
@madlag
Copy link
Contributor Author

madlag commented Jun 8, 2021

Superseded by #2357 .

@madlag madlag closed this Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants