-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm not sure if this is a bug; more likely normal behavior but i wanted to double check.
Is it normal that datasets.shard does not produce chunks that, when concatenated produce the original ordering of the sharded dataset?
This might be related to this pull request (#4466) but I have to admit I did not properly look into the changes made.
Steps to reproduce the bug
max_shard_size = convert_file_size_to_int('300MB')
dataset_nbytes = dataset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"{num_shards=}")
for shard_index in range(num_shards):
shard = dataset.shard(num_shards=num_shards, index=shard_index)
shard.to_parquet(f"tokenized/tokenized-{shard_index:03d}.parquet")
os.listdir('tokenized/')Expected results
I expected the shards to match the order of the data of the original dataset; i.e. dataset[10] being the same as shard_1[10] for example
Actual results
Only the first element is the same; i.e. dataset[0] is the same as shard_1[0]
Environment info
datasetsversion: 2.3.2- Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.31
- Python version: 3.10.4
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working