Skip to content

Conversation

@polinaeterna
Copy link
Contributor

@polinaeterna polinaeterna commented Nov 4, 2022

I suggest not to sort splits by their names in dataset_info in README so that they are displayed in the order specified in the loading script. Otherwise test split is displayed first, see this repo: https://huggingface.co/datasets/paws
What do you think?

But I added sorting in tests to fix CI (for the same dataset).

@polinaeterna polinaeterna changed the title Do not sort splits in Splits info Do not sort splits in dataset info Nov 4, 2022
@polinaeterna polinaeterna requested a review from lhoestq November 4, 2022 10:48
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 4, 2022

The documentation is not available anymore as the PR was closed or merged.

@severo
Copy link
Collaborator

severo commented Nov 4, 2022

It would be coherent with huggingface/dataset-viewer#614 (comment)

@albertvillanova
Copy link
Member

albertvillanova commented Nov 4, 2022

I think we started working on this issue nearly at the same time... 😅

Related issue:

@polinaeterna
Copy link
Contributor Author

@albertvillanova yeah I noticed it right after the PR 😄 thank you! the fix of the dataset info yaml fixes tests on CI, but in general order of splits in yaml influences the order in which they are displayed in the viewer, if I understand it correctly. So I suggest not to sort splits in yaml initially to avoid this for other datasets in the future. I think this change should work for it.

Changes to tests here maybe can be reverted considering that order in yaml now corresponds to the one in tests, thanks to your change in the dataset info.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @polinaeterna.

I agree we should not sort splits alphabetically, but keep them in their original order.

However, I disagree we should add sorted to our tests: I think we have to test the returned order (see comment below).

info = infos[expected_config]
assert info.config_name == expected_config
assert list(info.splits.keys()) == expected_splits_in_first_config
assert sorted(info.splits.keys()) == sorted(expected_splits_in_first_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to avoid testing the order: as already discussed by you @polinaeterna and @severo, splits are not alphabetically sorted.

Therefore, it makes sense to test that the order returned by get_dataset_infos is the expected one.

Anyway, if finally we decide to sort them, we should do it also in test_get_dataset_config_info, which was also failing besides test_get_dataset_info and test_get_dataset_split_names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree with you! reverted sorting in tests a063c6f

@albertvillanova
Copy link
Member

Hehe, @polinaeterna, we make comments nearly at the same time as well... 😆

This reverts commit fe51b19.
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggested improvement.

Just a comment below.


def get_list_sliced_split_info(self):
return list(sorted(self._splits.values(), key=lambda x: x.split_info.name))
return self._splits.values()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not used anywhere else... But, just in case, better returning a list here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't find any usages either... applied your suggestion 6e10c33, thank you!

@polinaeterna polinaeterna marked this pull request as ready for review November 4, 2022 14:03
@polinaeterna polinaeterna merged commit 18e1e14 into huggingface:main Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants