Adding Microsoft CodeXGlue Datasets #2357

ncoop57 · 2021-05-13T00:43:01Z

Hi there, this is a new pull request to get the CodeXGlue datasets into the awesome HF datasets lib. Most of the work has been done in this PR #997 by the awesome @madlag. However, that PR has been stale for a while now and so I spoke with @lhoestq about finishing up the final mile and so he told me to open a new PR with the final changes 😄.

I believe I've met all of the changes still left in the old PR to do, except for the change to the languages. I believe the READMEs should include the different programming languages used rather than just using the tag "code" as when searching for datasets, SE researchers may specifically be looking only for what type of programming language and so being able to quickly filter will be very valuable. Let me know what you think of that or if you still believe it should be the "code" tag @lhoestq.

Co-authored-by: Quentin Lhoest <[email protected]>

…de-trans

ncoop57 · 2021-05-13T00:45:56Z

Oh one other thing. Mentioned in the PR was that I would need to regenerate the dataset_infos.json once the camel casing was done. However, I am unsure why this is the case since there is no reference to any object names in the dataset_infos.json file.

If it needs to be reran, I can try it do it on my own machine, but I've had a memory issues with a previous dataset due to my compute constraints so I'd prefer to hopefully avoid it all together if not necessary to regenerate.

…ub.com/ncoop57/datasets into microsoft-codexglue-code-to-code-trans

lhoestq

Thank you !

To fix the CI you just have to fix the class_name of the definitions file of code_x_glue_cc_clone_detection_poj_104 (see my first comment), and also to add the encoding= parameter to all the calls to open (for example open(..., encoding="utf-8")).

Regarding the camel case conversion: the dataset_infos.json contains the name of the dataset (builder_name field). The builder name is the snake case version of the dataset builder class. It must be equal to the dataset script name, which is also snake case.

For example

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main

Since the class name changed, then we must update all the builder_name fields of the datasets_infos.json. To do so you can try to regenerate it, or maybe you can just do it by hand if you don't want to wait (but in this case please pay attention to not do any typo ^^).

Finally I also added a comment about a file that is written on disk in code_x_glue_cc_code_completion_token. It would be nice to not have this file written since it won't work with the dataset streaming feature that we're implementing.

datasets/code_x_glue_cc_clone_detection_poj_104/generated_definitions.py

lhoestq · 2021-05-14T12:13:44Z

datasets/code_x_glue_cc_clone_detection_poj_104/code_x_glue_cc_clone_detection_poj_104.py

+            with open(os.path.join(root_path, f"{split_name}.jsonl"), "w") as f:
+                for i in range(*range_info):
+                    items = files(os.path.join(root_path, "ProgramData/{}".format(i)))
+                    for item in items:
+                        js = {}
+                        js["label"] = item.split("/")[1]
+                        js["index"] = str(cont)
+                        js["code"] = open(item, encoding="latin-1").read()
+                        f.write(json.dumps(js) + "\n")
+                        cont += 1


It would be nice to not have to write new files. This loop should be used to yield example instead of writing to a jsonl file

What about if this was all moved to the _split_generators function since this is essentially trying to do some data wrangling to get it into an easier to use format?

I'd move it to the _generate_examples method actually. It is the one that yields example

…nitions.py Co-authored-by: Quentin Lhoest <[email protected]>

ncoop57 · 2021-05-14T18:48:33Z

Was just reviewing the builder_names of each dataset and it seems like it is already following this format:

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main Is there a location I am missing?

lhoestq · 2021-05-17T13:45:28Z

Was just reviewing the builder_names of each dataset and it seems like it is already following this format:

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main Is there a location I am missing?

If it's already in this format then it's fine thanks ! It's all good then

To fix the CI you just need to add the encoding= parameters to the open() calls

…o disk

…de-trans

ncoop57 · 2021-05-18T00:16:56Z

@lhoestq I think everything should be good to go besides the code styling, which seem to be due to missing or unsupported metadata tags for the READMEs, is this something I should worry about since all the other datasets seem to be failing as well?

lhoestq

Hi ! Yes we have to fix all the dataset card validation issues. I did most of them (see comments).

Moreover we updated the structure of the dataset cards to add more sections.
The new table of content looks like this:

## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

Can you update all the table of contents ? And also add the missing sections or subsections in the code of each dataset card ?

Finally feel free to merge master into this branch to get the latest updates regarding dataset card validation.

EDIT: changed to Supported Tasks and Leaderboards and Data Splits

datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md

datasets/code_x_glue_cc_clone_detection_poj_104/README.md

datasets/code_x_glue_cc_cloze_testing_all/README.md

datasets/code_x_glue_cc_cloze_testing_maxmin/README.md

datasets/code_x_glue_tc_text_to_code/README.md

datasets/code_x_glue_ct_code_to_text/README.md

datasets/code_x_glue_tt_text_to_text/README.md

Co-authored-by: Quentin Lhoest <[email protected]>

…taset

…ributions

ncoop57 · 2021-05-30T22:36:03Z

Hey @lhoestq, just finalizing the READMEs and testing them against the automated test. For the non, WIN tests, it seems like there is some dependency issue that doesn't have to do with the new datasets. For the WIN tests, it looks like some of the headings are mislabeled such as "Supported Tasks and Leaderboards" -> "Supported Tasks" in the TOC you posted. Should I base my TOC on the one you posted or on the one that the test script is using? Also, it throws errors for some of the fields being empty, such as "Source Data" in the code_x_glue_tt_text_to_text dataset. However, I am not familiar with this dataset, so I put the [More Information Needed] stub, similar to the other sections I couldn't easily answer. For some of the sections like "Source Data", is this info required?

lhoestq · 2021-05-31T09:44:52Z

Yes you're right, it is Supported Tasks and Leaderboards that we need to use, sorry about that

I also noticed the same for the splits section: we have to use Data Splits (not Data Splits Sample Size)

lhoestq · 2021-05-31T09:57:53Z

Some subsections are also missing: Initial Data Collection and Normalization, Who are the source language producers?.
If you are interested you can fill those sections as well, or leave them empty for now.
This will also fix the error regarding "Source Data"

You can see the template of the readme here:
https://github.com/huggingface/datasets/blob/9d8bf36fdb861d9b2922d7c782fb58f9f542997c/templates/README.md

yjernite · 2021-06-01T13:48:44Z

Also, I see that we are having to only use the code tag instead of individual langs and I get that is required for indexing or showing available tags on the datasets hub. However, as a future feature, it might be good to add tags for individual programming languages to make it easier to search.

Yes I agree. We'll be able to reuse the tags per programming language from this PR when we allow this feature

cc @yjernite what do you think about extending our languages taxonomy to programming languages ?

Sounds good, as long as they all share a prefix! maybe code_cpp, code_java, etc. ?

I don't think we currently have _ in language codes/names, but also don't see what it would break a priori

lhoestq · 2021-06-01T15:57:31Z

We don't use _ but there are some languages that use - though like en-US. Let's use - maybe, to match the same hierarchy pattern ?

madlag · 2021-06-04T16:42:21Z

Hi guys, I just started working on #997 this morning and I just realized that you were finishing it... You may want to get the dataset cards from https://github.com/madlag/datasets, and maybe some code too, as I did a few things like moving _CITATION and _DESCRIPTION to globals.

…language producers where I could

madlag · 2021-06-07T10:06:09Z

I am renaming the main classes to match the dataset names, for example : CodeXGlueTcTextToCodeMain -> CodeXGlueTcTextToCode . And I am regenerating the dataset_infos.json accordingly.

…_detection_poj104 (to have consistent names in dataset_infos.json) Changing class names for the same reason. Regenerated dataset_infos.json and checked names coherency.

lhoestq · 2021-06-07T16:44:00Z

Thanks for renaming the classes and updating the dataset_infos.json ! This looks all clean now :)

This PR looks all good to me :) One just needs to merge master into this branch to make sure the CI is green with the latest changes. It should also fix the current CI issues that are not related to this PR

…de-trans

ncoop57 · 2021-06-08T00:35:46Z

Woot woot 🚀! All green, looks like it is ready for showtime. Thank you both @lhoestq and especially @madlag, I think these datasets are going to be a great new addition to 🤗 datasets and I can't wait to use them in my research 🤓.

madlag · 2021-06-08T09:23:18Z

Thanks @ncoop57 for you contribution! It will be really cool to see those datasets used as soon as they are released !

lhoestq

Good job ! Thanks a lot for adding all these datasets @madlag and @ncoop57 :)

madlag and others added 13 commits December 9, 2020 16:39

Microsoft Code X Glue datasets.

02800cf

Fix in READMEs.

3529e93

Changing language type to "code"

b2595ea

Co-authored-by: Quentin Lhoest <[email protected]>

Changing the dataset to original (=not in the datasets repository)

56792f9

Co-authored-by: Quentin Lhoest <[email protected]>

Remove template script

9d843c0

Revert changes to dummy_data.py script

33514d3

Remove additional readme template

80dbab6

Add contribution subsection to readmes

b1082dd

Fix camel case

bc2e755

Update desc and cites to use global vars

29c03d3

Fix typos

7284a61

Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…

5afdbe0

…de-trans

Remove extra lines and update contributions

8f8d7e2

ncoop57 added 3 commits May 12, 2021 21:13

Fix typos and camel case

c37ae3a

Fix styling

5392aa7

Merge branch 'microsoft-codexglue-code-to-code-trans' of https://gith…

67afd5f

…ub.com/ncoop57/datasets into microsoft-codexglue-code-to-code-trans

lhoestq reviewed May 14, 2021

View reviewed changes

Nathan Cooper and others added 2 commits May 14, 2021 14:38

Update datasets/code_x_glue_cc_clone_detection_poj_104/generated_defi…

22d2da5

…nitions.py Co-authored-by: Quentin Lhoest <[email protected]>

Add encodings to all open calls

a6cca2f

ncoop57 added 5 commits May 17, 2021 19:31

Convert clone detection poj dataset to use yield instead of writing t…

f720467

…o disk

Fix styling

f451cc1

Remove marker file being written

38e4d76

Fix styling

4f2e277

Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…

20c96c6

…de-trans

lhoestq reviewed May 18, 2021

View reviewed changes

Update datasets/code_x_glue_cc_clone_detection_big_clone_bench/README.md

18104c4

Co-authored-by: Quentin Lhoest <[email protected]>

ncoop57 added 5 commits May 30, 2021 17:08

Fill in new README sections for the Code Search dataset

b6a38e7

Update task tags

67826ad

Fill in new README sections for the Code Generation dataset

ef41c23

Fill in new README sections for the Code Documentation Translation da…

70ba739

…taset

Fix heading format, update task tags, and add missing spaces for cont…

1920c9c

…ributions

ncoop57 added 9 commits June 5, 2021 14:42

Rename sections to proper names

27c1c93

Add additional source data subsubsections and add new data on source …

2825aaf

…language producers where I could

Add missing subsubsections of Annotations

3437146

Fix missing source_data yaml tag

4186048

Update source_data tag to valid one

95fa8ee

Fix additional information subsection heading

5e10c18

Fix heading format

8d91fb1

Fix language tag with code

905ce59

Hopefully fix codec issue

98ed142

Moving code_x_glue_cc_clone_detection_poj_104 to code_x_glue_cc_clone…

1805d08

…_detection_poj104 (to have consistent names in dataset_infos.json) Changing class names for the same reason. Regenerated dataset_infos.json and checked names coherency.

Nathan Cooper added 2 commits June 7, 2021 19:47

Merge branch 'huggingface:master' into microsoft-codexglue-code-to-co…

020899b

…de-trans

Fix headings and remove special chars

b95a4a9

lhoestq approved these changes Jun 8, 2021

View reviewed changes

lhoestq merged commit ed98957 into huggingface:master Jun 8, 2021

madlag mentioned this pull request Jun 8, 2021

Microsoft CodeXGlue #997

Closed

Adding Microsoft CodeXGlue Datasets #2357

Adding Microsoft CodeXGlue Datasets #2357

Uh oh!

Conversation

ncoop57 commented May 13, 2021

Uh oh!

ncoop57 commented May 13, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq May 14, 2021

Choose a reason for hiding this comment

Uh oh!

ncoop57 May 14, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq May 17, 2021

Choose a reason for hiding this comment

Uh oh!

ncoop57 commented May 14, 2021

Uh oh!

lhoestq commented May 17, 2021

Uh oh!

ncoop57 commented May 18, 2021

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ncoop57 commented May 30, 2021

Uh oh!

lhoestq commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yjernite commented Jun 1, 2021

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

madlag commented Jun 4, 2021

Uh oh!

madlag commented Jun 7, 2021

Uh oh!

lhoestq commented Jun 7, 2021

Uh oh!

ncoop57 commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madlag commented Jun 8, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq left a comment •

edited

Loading

lhoestq commented May 31, 2021 •

edited

Loading

lhoestq commented May 31, 2021 •

edited

Loading

ncoop57 commented Jun 8, 2021 •

edited

Loading