Skip to content

Conversation

@thomasw21
Copy link
Contributor

@thomasw21 thomasw21 commented Jul 28, 2021

We correct the values of en subset concerning the expected validation values (both num_bytes and num_examples.

Instead of having:

{"name": "validation", "num_bytes": 828589180707, "num_examples": 364868892, "dataset_name": "c4"}

We replace with correct values:

{"name": "validation", "num_bytes": 825767266, "num_examples": 364608, "dataset_name": "c4"}

There are still issues with validation with other subsets, but I can't download all the files, unzip to check for the correct number of bytes. (If you have a fast way to obtain those values for other subsets, I can do this in this PR ... otherwise I can't spend those resources)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! I also fixed the validation metadata of the other configurations

@lhoestq lhoestq merged commit 0a0227f into huggingface:master Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants