-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Microsoft CodeXGlue #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Microsoft CodeXGlue #997
Conversation
366774d to
02800cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to take a look at this very impressive PR !
I only added comments about the first dataset since the comments may apply to the others as well.
My main concern here is the dataset names. They are too long and complicated. Could you simplify that ?
About readability can you add comments and docstrings ?
and run make style fo format the code
| - [Additional Information](#additional-information) | ||
| - [Dataset Curators](#dataset-curators) | ||
| - [Licensing Information](#licensing-information) | ||
| - [Citation Information](#citation-information) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-add all the sections that were removed ?
You can leave their content with [More Information Needed]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the names, I took the original names, but I understand.
For this one, 'cxgcc_clone_detect_big' would be ok ? (cxg for CodeXGlue, cc for code to code, and a shorten version of the dataset name (and same thing for the others).
For the "make style", it was part of my post-processing script, but there may some subtlety somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and I will re-add the missing sections)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I just ran make style, and there are no changes to the code, can you confirm that there is an issue with the code style ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a proposal list of names:
cxgcc_defect_detect
cxgcc_clone_detect_big
cxgcc_refine
cxgcc_complete_token
cxgcc_code_to_code
cxgcc_code_complete_line
cxgcc_clone_detect_poj
cxgcc_cloze_test_maxmin
cxgcc_cloze_test_all
cxgct_code_to_text
cxgtt_text_to_text
cxgtc_search_web_query
cxgtc_text_to_code
cxgtc_search_adv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good !
You can even make the name clearer with
code_xglue_<tt|cc|ct|tc>_<subdataset_name>
(just change the beginning from cxg to code_xglue_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining code style issues are these ones:
src/datasets/commands/dummy_data.py:132:25: E722 do not use bare 'except'
src/datasets/commands/dummy_data.py:141:21: F841 local variable 'e' is assigned to but never used
So all the code in your datasets is fine, the only errors come from dummy_data.py
datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
Outdated
Show resolved
Hide resolved
datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
Outdated
Show resolved
Hide resolved
datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
Outdated
Show resolved
Hide resolved
| _DESCRIPTION = """Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. | ||
| The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.""" | ||
|
|
||
| _CITATION = """@inproceedings{svajlenko2014towards, | ||
| title={Towards a big data curated benchmark of inter-project code clones}, | ||
| author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun}, | ||
| booktitle={2014 IEEE International Conference on Software Maintenance and Evolution}, | ||
| pages={476--480}, | ||
| year={2014}, | ||
| organization={IEEE} | ||
| } | ||
| @inproceedings{wang2020detecting, | ||
| title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree}, | ||
| author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi}, | ||
| booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)}, | ||
| pages={261--271}, | ||
| year={2020}, | ||
| organization={IEEE} | ||
| }""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have the citation and description as global variables ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a bit tricky, as everything is generated, and the base class is handling this. Why is it better to have a global variable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template for datasets scripts we're using has these variable as global variables. It's the same for all the datasets. Moreover we're parsing the python scripts to extract the description and citation in moon landing to show them in the huggingface.co dataset pages, and the parsing currently expects them to be in global variables.
So it's possible to support them to be non-global but it may require changes on moon landing side. We probably want to keep the parsing as simple as possible, but it's not a major issue.
What you can do is have them as both global variables and class attribute:
_CITATION = """\
insert citation here
"""
class MyDataset(datasets.GeneratorBasedBuilder):
_CITATION = _CITATIONthis way the main class has access to the citation and the citation is also defined as a global variable.
What do you think ?
| } | ||
|
|
||
|
|
||
| class CodeXGlueCCCloneDetectionBigCloneBenchMain(datasets.GeneratorBasedBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the class should be the camel case version of the dataset script name so
CodeXGlueCcCloneDetectionBigCloneBench
Also note that after doing changes in dataset class names or dataset script names, you have to regenerate the dataset_infos.json file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have a single script that runs the whole thing, so no problem.
Maybe we should ensure this constraint in the dataset loading code, so it does not work if the class has not the right name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should indeed have something to check this constraint. It doesn't raise error but it's better to have names that match for debugging or to have consistent cache file names.
| def _generate_examples(self, split_name, file_pathes): | ||
| return self.child._generate_examples(split_name, file_pathes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _generate_examples(self, split_name, file_pathes): | |
| return self.child._generate_examples(split_name, file_pathes) | |
| def _generate_examples(self, split_name, file_paths): | |
| return self.child._generate_examples(split_name, file_paths) |
typo
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also It would be nice to separate the addition of the datasets and all the tools you developed to make things easier in different PRs, so that we can review them separately !
The tool to make the dataset card looks very promising. Looking forward to be able to generate everything with one click ^^
| @@ -0,0 +1,464 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this script in a separate PR if you don't mind ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that, I actually moved the code to an external repository, and I should have removed this. We will work with @yjernite to merge some features into the datasets-tagging app.
| @@ -0,0 +1,59 @@ | |||
| # Dataset Card for "{{dataset_name}}" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this
| # Line by line text file (txt, csv etc.) | ||
| if is_line_by_line_text_file: | ||
| Path(dst_path).parent.mkdir(exist_ok=True, parents=True) | ||
| with open(src_path, "r", encoding=encoding) as src_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove these changes ?
If you feel like they can be useful, please open another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, it was not intended to add this in the same PR, I was waiting to add it another one, and it lacks some check on the Exception. I must have done a "commit -am" instead of a "commit -m" somewhere. It had been a long long time to add all the dummy_data files, I must have lowered the guard at some point...
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi :)
Looks like we can merge this one pretty soon !
Can you just remove the changes in dummy_data.py and maybe remove README.template.md as well as generate_dataset_card.py ?
Could you also update this branch by merging master to it ?
| @@ -0,0 +1,171 @@ | |||
| -- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| -- | |
| --- |
| language_creators: | ||
| - found | ||
| languages: | ||
| - C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - C | |
| - code |
| language_creators: | ||
| - found | ||
| languages: | ||
| - C++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - C++ | |
| - code |
| - java | ||
| - C# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - java | |
| - C# | |
| - code |
|
Hi @madlag and @lhoestq , I am extremely interested in getting this dataset into HF's library as I research in this area a lot. I see that it hasn't been updated in a while, but it is very close to being finished. If no one is currently working on this, I'd be happy to do any final touches that might be needed to get this merged. |
|
Hi @ncoop57 ! Thanks for your interest and sorry for the inactivity on this PR. |
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
56792f9 to
91d92fc
Compare
|
Superseded by #2357 . |
Datasets from https://github.com/microsoft/CodeXGLUE
This contains 13 datasets:
code_x_glue_cc_clone_detection_big_clone_bench
code_x_glue_cc_clone_detection_poj_104
code_x_glue_cc_cloze_testing_all
code_x_glue_cc_cloze_testing_maxmin
code_x_glue_cc_code_completion_line
code_x_glue_cc_code_completion_token
code_x_glue_cc_code_refinement
code_x_glue_cc_code_to_code_trans
code_x_glue_cc_defect_detection
code_x_glue_ct_code_to_text
code_x_glue_tc_nl_code_search_adv
code_x_glue_tc_text_to_code
code_x_glue_tt_text_to_text