-
Notifications
You must be signed in to change notification settings - Fork 3k
Do not sort splits in dataset info #5201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not sort splits in dataset info #5201
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
It would be coherent with huggingface/dataset-viewer#614 (comment) |
|
I think we started working on this issue nearly at the same time... 😅
Related issue: |
|
@albertvillanova yeah I noticed it right after the PR 😄 thank you! the fix of the dataset info yaml fixes tests on CI, but in general order of splits in yaml influences the order in which they are displayed in the viewer, if I understand it correctly. So I suggest not to sort splits in yaml initially to avoid this for other datasets in the future. I think this change should work for it. Changes to tests here maybe can be reverted considering that order in yaml now corresponds to the one in tests, thanks to your change in the dataset info. |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, @polinaeterna.
I agree we should not sort splits alphabetically, but keep them in their original order.
However, I disagree we should add sorted to our tests: I think we have to test the returned order (see comment below).
tests/test_inspect.py
Outdated
| info = infos[expected_config] | ||
| assert info.config_name == expected_config | ||
| assert list(info.splits.keys()) == expected_splits_in_first_config | ||
| assert sorted(info.splits.keys()) == sorted(expected_splits_in_first_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to avoid testing the order: as already discussed by you @polinaeterna and @severo, splits are not alphabetically sorted.
Therefore, it makes sense to test that the order returned by get_dataset_infos is the expected one.
Anyway, if finally we decide to sort them, we should do it also in test_get_dataset_config_info, which was also failing besides test_get_dataset_info and test_get_dataset_split_names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agree with you! reverted sorting in tests a063c6f
|
Hehe, @polinaeterna, we make comments nearly at the same time as well... 😆 |
This reverts commit fe51b19.
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggested improvement.
Just a comment below.
src/datasets/splits.py
Outdated
|
|
||
| def get_list_sliced_split_info(self): | ||
| return list(sorted(self._splits.values(), key=lambda x: x.split_info.name)) | ||
| return self._splits.values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not used anywhere else... But, just in case, better returning a list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't find any usages either... applied your suggestion 6e10c33, thank you!
I suggest not to sort splits by their names in dataset_info in README so that they are displayed in the order specified in the loading script. Otherwise
testsplit is displayed first, see this repo: https://huggingface.co/datasets/pawsWhat do you think?
But I added sorting in tests to fix CI (for the same dataset).