Skip to content

The order of data split names is nondeterministic #5728

@albertvillanova

Description

@albertvillanova

After this CI error: https://github.com/huggingface/datasets/actions/runs/4639528358/jobs/8210492953?pr=5718

FAILED tests/test_data_files.py::test_get_data_files_patterns[data_file_per_split4] - AssertionError: assert ['random', 'train'] == ['train', 'random']
  At index 0 diff: 'random' != 'train'
  Full diff:
  - ['train', 'random']
  + ['random', 'train']

I have checked locally and found out that the data split order is nondeterministic.

This is caused by the use of set for sharded splits.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions