Skip to content

Dataset order is not deterministic with ZIP archives and iter_files #5145

@fxmarty

Description

@fxmarty

Describe the bug

For the beans dataset (did not try on other), the order of samples is not the same on different machines. Tested on my local laptop, github actions machine, and ec2 instance. The three yield a different order.

Steps to reproduce the bug

In a clean docker container or conda environment with datasets==2.6.1, run

from datasets import load_dataset
from pprint import pprint

data = load_dataset("beans", split="validation")

pprint(data["image_file_path"])

Expected behavior

The order of the images is the same on all machines.

Environment info

On the EC2 instance:

- `datasets` version: 2.6.1
- Platform: Linux-4.14.291-218.527.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- PyArrow version: 9.0.0
- Pandas version: 1.3.5
- Numpy version: not checked

On my local laptop:

- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- PyArrow version: 7.0.0
- Pandas version: 1.3.5
- Numpy version: 1.23.1

On github actions:

- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-1022-azure-x86_64-with-glibc2.2.5
- Python version: 3.8.14
- PyArrow version: 9.0.0
- Pandas version: 1.5.1
- Numpy version: 1.23.4

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions