Skip to content

Dataset sharding non-contiguous? #4570

@cakiki

Description

@cakiki

Describe the bug

I'm not sure if this is a bug; more likely normal behavior but i wanted to double check.
Is it normal that datasets.shard does not produce chunks that, when concatenated produce the original ordering of the sharded dataset?

This might be related to this pull request (#4466) but I have to admit I did not properly look into the changes made.

Steps to reproduce the bug

max_shard_size = convert_file_size_to_int('300MB')
dataset_nbytes = dataset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"{num_shards=}")
for shard_index in range(num_shards):
    shard = dataset.shard(num_shards=num_shards, index=shard_index)
    shard.to_parquet(f"tokenized/tokenized-{shard_index:03d}.parquet")
os.listdir('tokenized/')

Expected results

I expected the shards to match the order of the data of the original dataset; i.e. dataset[10] being the same as shard_1[10] for example

Actual results

Only the first element is the same; i.e. dataset[0] is the same as shard_1[0]

Environment info

  • datasets version: 2.3.2
  • Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.31
  • Python version: 3.10.4
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions