Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 26, 2022

Following #4724 (needs to be merged first)

It's good practice to shard parquet files to enable parallelism with spark/dask/etc.

I added the max_shard_size parameter to download_and_prepare (default to 500MB for parquet, and None for arrow).

from datasets import *

output_dir = "./output_dir"  # also supports "s3://..."
builder = load_dataset_builder("squad")
builder.download_and_prepare(output_dir, file_format="parquet", max_shard_size="5MB")

Implementation details

The examples are written to a parquet file until ParquetWriter._num_bytes > max_shard_size. When this happens, a new writer is instantiated to start writing the next shard. At the end, all the shards are renamed to include the total number of shards in their names: {builder.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.parquet

I also added the MAX_SHARD_SIZE config variable (default to 500MB)

TODO:

  • docstrings
  • docs
  • tests

cc @severo

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 26, 2022

The documentation is not available anymore as the PR was closed or merged.

Base automatically changed from dl-and-pp-as-parquet to main September 5, 2022 17:25
@lhoestq lhoestq marked this pull request as ready for review September 6, 2022 13:38
@lhoestq lhoestq requested a review from mariosasko September 12, 2022 17:28
@lhoestq
Copy link
Member Author

lhoestq commented Sep 12, 2022

This is ready for review cc @mariosasko :) please let me know what you think !

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

disable=not logging.is_progress_bar_enabled(),
desc=f"Generating {split_info.name} split",
):
if max_shard_size is not None and writer._num_bytes > max_shard_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final shard size can easily be off (significantly) using this logic, no? By default, writer._num_bytes is only updated every 10k examples (WRITER_BATCH_SIZE), so if the max_shard_size is small...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now this is the builder's responsibility to specify if 10,000 is too much (DEFAULT_WRITER_BATCH_SIZE class attribute).

Though I agree this is not ideal. I think it's ok to have it this way in this PR (since it would be off if the max_shard_size is very very small), but it would be nice to have something smarter in general

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. Besides that, all looks good!

@lhoestq lhoestq merged commit 38c8c72 into main Sep 15, 2022
@lhoestq lhoestq deleted the shard-parquet branch September 15, 2022 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants