-
Notifications
You must be signed in to change notification settings - Fork 3k
Shard parquet in download_and_prepare
#4747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
Co-authored-by: Mario Šaško <[email protected]>
|
This is ready for review cc @mariosasko :) please let me know what you think ! |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| disable=not logging.is_progress_bar_enabled(), | ||
| desc=f"Generating {split_info.name} split", | ||
| ): | ||
| if max_shard_size is not None and writer._num_bytes > max_shard_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final shard size can easily be off (significantly) using this logic, no? By default, writer._num_bytes is only updated every 10k examples (WRITER_BATCH_SIZE), so if the max_shard_size is small...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right now this is the builder's responsibility to specify if 10,000 is too much (DEFAULT_WRITER_BATCH_SIZE class attribute).
Though I agree this is not ideal. I think it's ok to have it this way in this PR (since it would be off if the max_shard_size is very very small), but it would be nice to have something smarter in general
Co-authored-by: Mario Šaško <[email protected]>
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit. Besides that, all looks good!
Co-authored-by: Mario Šaško <[email protected]>
Following #4724 (needs to be merged first)
It's good practice to shard parquet files to enable parallelism with spark/dask/etc.
I added the
max_shard_sizeparameter todownload_and_prepare(default to 500MB for parquet, and None for arrow).Implementation details
The examples are written to a parquet file until
ParquetWriter._num_bytes > max_shard_size. When this happens, a new writer is instantiated to start writing the next shard. At the end, all the shards are renamed to include the total number of shards in their names:{builder.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.parquetI also added the
MAX_SHARD_SIZEconfig variable (default to 500MB)TODO:
cc @severo