-
Notifications
You must be signed in to change notification settings - Fork 3k
Adding Microsoft CodeXGlue Datasets #2357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Microsoft CodeXGlue Datasets #2357
Conversation
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
|
Oh one other thing. Mentioned in the PR was that I would need to regenerate the dataset_infos.json once the camel casing was done. However, I am unsure why this is the case since there is no reference to any object names in the dataset_infos.json file. If it needs to be reran, I can try it do it on my own machine, but I've had a memory issues with a previous dataset due to my compute constraints so I'd prefer to hopefully avoid it all together if not necessary to regenerate. |
…ub.com/ncoop57/datasets into microsoft-codexglue-code-to-code-trans
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you !
To fix the CI you just have to fix the class_name of the definitions file of code_x_glue_cc_clone_detection_poj_104 (see my first comment), and also to add the encoding= parameter to all the calls to open (for example open(..., encoding="utf-8")).
Regarding the camel case conversion: the dataset_infos.json contains the name of the dataset (builder_name field). The builder name is the snake case version of the dataset builder class. It must be equal to the dataset script name, which is also snake case.
For example
CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main
Since the class name changed, then we must update all the builder_name fields of the datasets_infos.json. To do so you can try to regenerate it, or maybe you can just do it by hand if you don't want to wait (but in this case please pay attention to not do any typo ^^).
Finally I also added a comment about a file that is written on disk in code_x_glue_cc_code_completion_token. It would be nice to not have this file written since it won't work with the dataset streaming feature that we're implementing.
datasets/code_x_glue_cc_clone_detection_poj_104/generated_definitions.py
Outdated
Show resolved
Hide resolved
| with open(os.path.join(root_path, f"{split_name}.jsonl"), "w") as f: | ||
| for i in range(*range_info): | ||
| items = files(os.path.join(root_path, "ProgramData/{}".format(i))) | ||
| for item in items: | ||
| js = {} | ||
| js["label"] = item.split("/")[1] | ||
| js["index"] = str(cont) | ||
| js["code"] = open(item, encoding="latin-1").read() | ||
| f.write(json.dumps(js) + "\n") | ||
| cont += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to not have to write new files. This loop should be used to yield example instead of writing to a jsonl file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about if this was all moved to the _split_generators function since this is essentially trying to do some data wrangling to get it into an easier to use format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move it to the _generate_examples method actually. It is the one that yields example
…nitions.py Co-authored-by: Quentin Lhoest <[email protected]>
|
Was just reviewing the
|
If it's already in this format then it's fine thanks ! It's all good then To fix the CI you just need to add the |
|
@lhoestq I think everything should be good to go besides the code styling, which seem to be due to missing or unsupported metadata tags for the READMEs, is this something I should worry about since all the other datasets seem to be failing as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi ! Yes we have to fix all the dataset card validation issues. I did most of them (see comments).
Moreover we updated the structure of the dataset cards to add more sections.
The new table of content looks like this:
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)Can you update all the table of contents ? And also add the missing sections or subsections in the code of each dataset card ?
Finally feel free to merge master into this branch to get the latest updates regarding dataset card validation.
EDIT: changed to Supported Tasks and Leaderboards and Data Splits
datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Quentin Lhoest <[email protected]>
|
Hey @lhoestq, just finalizing the READMEs and testing them against the automated test. For the non, WIN tests, it seems like there is some dependency issue that doesn't have to do with the new datasets. For the WIN tests, it looks like some of the headings are mislabeled such as "Supported Tasks and Leaderboards" -> "Supported Tasks" in the TOC you posted. Should I base my TOC on the one you posted or on the one that the test script is using? Also, it throws errors for some of the fields being empty, such as "Source Data" in the |
|
Yes you're right, it is I also noticed the same for the splits section: we have to use |
|
Some subsections are also missing: You can see the template of the readme here: |
Sounds good, as long as they all share a prefix! maybe I don't think we currently have |
|
We don't use |
|
Hi guys, I just started working on #997 this morning and I just realized that you were finishing it... You may want to get the dataset cards from https://github.com/madlag/datasets, and maybe some code too, as I did a few things like moving _CITATION and _DESCRIPTION to globals. |
…language producers where I could
|
I am renaming the main classes to match the dataset names, for example : CodeXGlueTcTextToCodeMain -> CodeXGlueTcTextToCode . And I am regenerating the dataset_infos.json accordingly. |
…_detection_poj104 (to have consistent names in dataset_infos.json) Changing class names for the same reason. Regenerated dataset_infos.json and checked names coherency.
|
Thanks for renaming the classes and updating the dataset_infos.json ! This looks all clean now :) This PR looks all good to me :) One just needs to merge master into this branch to make sure the CI is green with the latest changes. It should also fix the current CI issues that are not related to this PR |
|
Thanks @ncoop57 for you contribution! It will be really cool to see those datasets used as soon as they are released ! |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, this is a new pull request to get the CodeXGlue datasets into the awesome HF datasets lib. Most of the work has been done in this PR #997 by the awesome @madlag. However, that PR has been stale for a while now and so I spoke with @lhoestq about finishing up the final mile and so he told me to open a new PR with the final changes 😄.
I believe I've met all of the changes still left in the old PR to do, except for the change to the languages. I believe the READMEs should include the different programming languages used rather than just using the tag "code" as when searching for datasets, SE researchers may specifically be looking only for what type of programming language and so being able to quickly filter will be very valuable. Let me know what you think of that or if you still believe it should be the "code" tag @lhoestq.