-
Notifications
You must be signed in to change notification settings - Fork 17
Update tagging_app.py #11
Conversation
Add more size categories in the tagging app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! Should we change 1T to 1B ?
EDIT: actually it looks like the 1B is missing, and then we can have 1T for trillion
|
@lhoestq 🙈 I missed it. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
|
cc @yjernite can you take a look at these changes ? |
|
Looks good to me, although I'm not sure how many datasets we have that fall in any of these categories (the size here is the total number of rows in the table in a single config) @gchhablani did you have specific datasets in mind? |
|
Hi @yjernite, In the existing datasets, there are some above a billion, some above 10 million, I just thought it would be better for such datasets. You can check the PR I made on datasets, a lot of the datasets(configs) are between 1M and 10M and some above 10M, where it was previous labelled as "n>1M". I don't have anything specific in mind for 1T, but I think we can have it there, just in case something comes along. Up to you and @lhoestq to decide :) I'll change everything back to "n>1M" if you'd like on the PR. |
* basic validation * ci script and test change * color is better * check all option * validate size cats & multiling, point to reference file urls on error * add validation to ci and rename files * spurrious change to trigger CI * add qa reqs * disallow empty lists * better error msg: show all invalid values rather than first one * some code shuffling & better error msg for langcodes * add pyyaml to qa reqs * fix package file loading * include json resources * reflect changes to size cats from huggingface/datasets-tagging#11 * trying another format for package_data * ci works! fixing the readme like a good citizen 🤗 * escape validation everywhere it's allowed in the tagging app * code review: more json files, conditional import * pointers to integrate readme metadata in class (wip) * no pydantic * fix docs? * Revert "fix docs?" This reverts commit ab82a6c. * remove pointers to add readme to loader * Get rid of langcodes, some refactor * Update languages.json * Refactor, add tests * I said, tests!! Co-authored-by: theo <[email protected]> Co-authored-by: SBrandeis <[email protected]>
* basic validation * ci script and test change * color is better * check all option * validate size cats & multiling, point to reference file urls on error * add validation to ci and rename files * spurrious change to trigger CI * add qa reqs * disallow empty lists * better error msg: show all invalid values rather than first one * some code shuffling & better error msg for langcodes * add pyyaml to qa reqs * fix package file loading * include json resources * reflect changes to size cats from huggingface/datasets-tagging#11 * trying another format for package_data * ci works! fixing the readme like a good citizen 🤗 * escape validation everywhere it's allowed in the tagging app * code review: more json files, conditional import * pointers to integrate readme metadata in class (wip) * no pydantic * fix docs? * Revert "fix docs?" This reverts commit ab82a6cbb1dd5fbc7f0ea70e98156d7419c54bf1. * remove pointers to add readme to loader * Get rid of langcodes, some refactor * Update languages.json * Refactor, add tests * I said, tests!! Co-authored-by: theo <[email protected]> Co-authored-by: SBrandeis <[email protected]>
* basic validation * ci script and test change * color is better * check all option * validate size cats & multiling, point to reference file urls on error * add validation to ci and rename files * spurrious change to trigger CI * add qa reqs * disallow empty lists * better error msg: show all invalid values rather than first one * some code shuffling & better error msg for langcodes * add pyyaml to qa reqs * fix package file loading * include json resources * reflect changes to size cats from huggingface/datasets-tagging#11 * trying another format for package_data * ci works! fixing the readme like a good citizen 🤗 * escape validation everywhere it's allowed in the tagging app * code review: more json files, conditional import * pointers to integrate readme metadata in class (wip) * no pydantic * fix docs? * Revert "fix docs?" This reverts commit ab82a6cbb1dd5fbc7f0ea70e98156d7419c54bf1. * remove pointers to add readme to loader * Get rid of langcodes, some refactor * Update languages.json * Refactor, add tests * I said, tests!! Co-authored-by: theo <[email protected]> Co-authored-by: SBrandeis <[email protected]>
Add more size categories in the tagging app.