Skip to content

Conversation

@ncoop57
Copy link

@ncoop57 ncoop57 commented May 13, 2021

Hi there, this is a new pull request to get the CodeXGlue datasets into the awesome HF datasets lib. Most of the work has been done in this PR #997 by the awesome @madlag. However, that PR has been stale for a while now and so I spoke with @lhoestq about finishing up the final mile and so he told me to open a new PR with the final changes 😄.

I believe I've met all of the changes still left in the old PR to do, except for the change to the languages. I believe the READMEs should include the different programming languages used rather than just using the tag "code" as when searching for datasets, SE researchers may specifically be looking only for what type of programming language and so being able to quickly filter will be very valuable. Let me know what you think of that or if you still believe it should be the "code" tag @lhoestq.

@ncoop57
Copy link
Author

ncoop57 commented May 13, 2021

Oh one other thing. Mentioned in the PR was that I would need to regenerate the dataset_infos.json once the camel casing was done. However, I am unsure why this is the case since there is no reference to any object names in the dataset_infos.json file.

If it needs to be reran, I can try it do it on my own machine, but I've had a memory issues with a previous dataset due to my compute constraints so I'd prefer to hopefully avoid it all together if not necessary to regenerate.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !

To fix the CI you just have to fix the class_name of the definitions file of code_x_glue_cc_clone_detection_poj_104 (see my first comment), and also to add the encoding= parameter to all the calls to open (for example open(..., encoding="utf-8")).

Regarding the camel case conversion: the dataset_infos.json contains the name of the dataset (builder_name field). The builder name is the snake case version of the dataset builder class. It must be equal to the dataset script name, which is also snake case.

For example

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main

Since the class name changed, then we must update all the builder_name fields of the datasets_infos.json. To do so you can try to regenerate it, or maybe you can just do it by hand if you don't want to wait (but in this case please pay attention to not do any typo ^^).

Finally I also added a comment about a file that is written on disk in code_x_glue_cc_code_completion_token. It would be nice to not have this file written since it won't work with the dataset streaming feature that we're implementing.

Comment on lines 52 to 61
with open(os.path.join(root_path, f"{split_name}.jsonl"), "w") as f:
for i in range(*range_info):
items = files(os.path.join(root_path, "ProgramData/{}".format(i)))
for item in items:
js = {}
js["label"] = item.split("/")[1]
js["index"] = str(cont)
js["code"] = open(item, encoding="latin-1").read()
f.write(json.dumps(js) + "\n")
cont += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to not have to write new files. This loop should be used to yield example instead of writing to a jsonl file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if this was all moved to the _split_generators function since this is essentially trying to do some data wrangling to get it into an easier to use format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move it to the _generate_examples method actually. It is the one that yields example

@ncoop57
Copy link
Author

ncoop57 commented May 14, 2021

Was just reviewing the builder_names of each dataset and it seems like it is already following this format:

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main Is there a location I am missing?

@lhoestq
Copy link
Member

lhoestq commented May 17, 2021

Was just reviewing the builder_names of each dataset and it seems like it is already following this format:

CodeXGlueCcCloneDetectionBigCloneBenchMain -> code_x_glue_cc_clone_detection_big_clone_bench_main Is there a location I am missing?

If it's already in this format then it's fine thanks ! It's all good then

To fix the CI you just need to add the encoding= parameters to the open() calls

@ncoop57
Copy link
Author

ncoop57 commented May 18, 2021

@lhoestq I think everything should be good to go besides the code styling, which seem to be due to missing or unsupported metadata tags for the READMEs, is this something I should worry about since all the other datasets seem to be failing as well?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! Yes we have to fix all the dataset card validation issues. I did most of them (see comments).

Moreover we updated the structure of the dataset cards to add more sections.
The new table of content looks like this:

## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

Can you update all the table of contents ? And also add the missing sections or subsections in the code of each dataset card ?

Finally feel free to merge master into this branch to get the latest updates regarding dataset card validation.

EDIT: changed to Supported Tasks and Leaderboards and Data Splits

@ncoop57
Copy link
Author

ncoop57 commented May 30, 2021

Hey @lhoestq, just finalizing the READMEs and testing them against the automated test. For the non, WIN tests, it seems like there is some dependency issue that doesn't have to do with the new datasets. For the WIN tests, it looks like some of the headings are mislabeled such as "Supported Tasks and Leaderboards" -> "Supported Tasks" in the TOC you posted. Should I base my TOC on the one you posted or on the one that the test script is using? Also, it throws errors for some of the fields being empty, such as "Source Data" in the code_x_glue_tt_text_to_text dataset. However, I am not familiar with this dataset, so I put the [More Information Needed] stub, similar to the other sections I couldn't easily answer. For some of the sections like "Source Data", is this info required?

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

Yes you're right, it is Supported Tasks and Leaderboards that we need to use, sorry about that

I also noticed the same for the splits section: we have to use Data Splits (not Data Splits Sample Size)

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

Some subsections are also missing: Initial Data Collection and Normalization, Who are the source language producers?.
If you are interested you can fill those sections as well, or leave them empty for now.
This will also fix the error regarding "Source Data"

You can see the template of the readme here:
https://github.com/huggingface/datasets/blob/9d8bf36fdb861d9b2922d7c782fb58f9f542997c/templates/README.md

@yjernite
Copy link
Member

yjernite commented Jun 1, 2021

Also, I see that we are having to only use the code tag instead of individual langs and I get that is required for indexing or showing available tags on the datasets hub. However, as a future feature, it might be good to add tags for individual programming languages to make it easier to search.

Yes I agree. We'll be able to reuse the tags per programming language from this PR when we allow this feature

cc @yjernite what do you think about extending our languages taxonomy to programming languages ?

Sounds good, as long as they all share a prefix! maybe code_cpp, code_java, etc. ?

I don't think we currently have _ in language codes/names, but also don't see what it would break a priori

@lhoestq
Copy link
Member

lhoestq commented Jun 1, 2021

We don't use _ but there are some languages that use - though like en-US. Let's use - maybe, to match the same hierarchy pattern ?

@madlag
Copy link
Contributor

madlag commented Jun 4, 2021

Hi guys, I just started working on #997 this morning and I just realized that you were finishing it... You may want to get the dataset cards from https://github.com/madlag/datasets, and maybe some code too, as I did a few things like moving _CITATION and _DESCRIPTION to globals.

@madlag
Copy link
Contributor

madlag commented Jun 7, 2021

I am renaming the main classes to match the dataset names, for example : CodeXGlueTcTextToCodeMain -> CodeXGlueTcTextToCode . And I am regenerating the dataset_infos.json accordingly.

…_detection_poj104 (to have consistent names in dataset_infos.json)

Changing class names for the same reason.
Regenerated dataset_infos.json and checked names coherency.
@lhoestq
Copy link
Member

lhoestq commented Jun 7, 2021

Thanks for renaming the classes and updating the dataset_infos.json ! This looks all clean now :)

This PR looks all good to me :) One just needs to merge master into this branch to make sure the CI is green with the latest changes. It should also fix the current CI issues that are not related to this PR

@ncoop57
Copy link
Author

ncoop57 commented Jun 8, 2021

Woot woot 🚀! All green, looks like it is ready for showtime. Thank you both @lhoestq and especially @madlag, I think these datasets are going to be a great new addition to 🤗 datasets and I can't wait to use them in my research 🤓.

@madlag
Copy link
Contributor

madlag commented Jun 8, 2021

Thanks @ncoop57 for you contribution! It will be really cool to see those datasets used as soon as they are released !

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job ! Thanks a lot for adding all these datasets @madlag and @ncoop57 :)

@lhoestq lhoestq merged commit ed98957 into huggingface:master Jun 8, 2021
@madlag madlag mentioned this pull request Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants