Shard parquet in `download_and_prepare` #4747

lhoestq · 2022-07-26T18:05:01Z

Following #4724 (needs to be merged first)

It's good practice to shard parquet files to enable parallelism with spark/dask/etc.

I added the max_shard_size parameter to download_and_prepare (default to 500MB for parquet, and None for arrow).

from datasets import *

output_dir = "./output_dir"  # also supports "s3://..."
builder = load_dataset_builder("squad")
builder.download_and_prepare(output_dir, file_format="parquet", max_shard_size="5MB")

Implementation details

The examples are written to a parquet file until ParquetWriter._num_bytes > max_shard_size. When this happens, a new writer is instantiated to start writing the next shard. At the end, all the shards are renamed to include the total number of shards in their names: {builder.name}-{split}-{shard_id:05d}-of-{num_shards:05d}.parquet

I also added the MAX_SHARD_SIZE config variable (default to 500MB)

TODO:

docstrings
docs
tests

cc @severo

HuggingFaceDocBuilderDev · 2022-07-26T18:11:37Z

The documentation is not available anymore as the PR was closed or merged.

Co-authored-by: Mario Šaško <[email protected]>

lhoestq · 2022-09-12T17:29:26Z

This is ready for review cc @mariosasko :) please let me know what you think !

mariosasko

Thanks!

src/datasets/builder.py

mariosasko · 2022-09-13T15:10:10Z

src/datasets/builder.py

+                disable=not logging.is_progress_bar_enabled(),
+                desc=f"Generating {split_info.name} split",
+            ):
+                if max_shard_size is not None and writer._num_bytes > max_shard_size:


The final shard size can easily be off (significantly) using this logic, no? By default, writer._num_bytes is only updated every 10k examples (WRITER_BATCH_SIZE), so if the max_shard_size is small...

I think right now this is the builder's responsibility to specify if 10,000 is too much (DEFAULT_WRITER_BATCH_SIZE class attribute).

Though I agree this is not ideal. I think it's ok to have it this way in this PR (since it would be off if the max_shard_size is very very small), but it would be nice to have something smarter in general

Co-authored-by: Mario Šaško <[email protected]>

mariosasko

One nit. Besides that, all looks good!

src/datasets/builder.py

Co-authored-by: Mario Šaško <[email protected]>

lhoestq and others added 19 commits July 20, 2022 14:03

use fsspec for caching

6d89cfb

add parquet writer

606a48f

add file_format argument

cdf8dcd

style

ad91270

use "gs" instead of "gcs" for apache beam + use is_remote_filesystem

742d2a9

typo

aed8ce6

fix test

93d5660

test ArrowWriter with filesystem

65c2037

test parquet writer

84d8397

more tests

4c46349

Merge branch 'main' into dl-and-pp-as-parquet

ee7e3f5

more tests

033a3b8

fix nullcontext on 3.6

ce8d7f9

parquet_writer.write_batch is not available in pyarrow 6

15dccf9

remove reference to open file

3a3d784

fix test

3eef46d

docs

b480549

Merge branch 'main' into dl-and-pp-as-parquet

713f83c

shard parquet in download_and_prepare

4757930

lhoestq and others added 10 commits July 27, 2022 11:55

typing, docs, docstrings

32f5bf8

docs: dask from parquet files

1db12b9

Apply suggestions from code review

874b2a0

Co-authored-by: Mario Šaško <[email protected]>

use contextlib.nullcontext

f6ecb64

Merge branch 'main' into dl-and-pp-as-parquet

b0e4222

fix missing import

e7f3ac4

Use unstrip_protocol to merge protocol and path

df0343a

remove bad "raise" and add TODOs

a0f84f4

Merge branch 'dl-and-pp-as-parquet' into shard-parquet

56f68bd

Merge branch 'main' into dl-and-pp-as-parquet

509ff3f

lhoestq added 8 commits August 26, 2022 18:30

fix wording in the docs: load -> download and prepare

fdf7252

style

22aaf7b

fix

c051b31

simplify incomplete_dir

e0a7742

fix tests

53d46cc

albert's comments

606951f

Merge branch 'dl-and-pp-as-parquet' into shard-parquet

81a34ce

set arrow to default

bd94afb

Base automatically changed from dl-and-pp-as-parquet to main September 5, 2022 17:25

lhoestq added 5 commits September 5, 2022 19:45

Merge branch 'main' into shard-parquet

e66ce41

style

0ea79c1

add config.MAX_SHARD_SIZE

0a950fd

nit

fbc8fe1

style

c88a797

lhoestq marked this pull request as ready for review September 6, 2022 13:38

lhoestq added 6 commits September 6, 2022 16:17

fix for relative output_dir

d59b68f

typo

211b38b

fix test

361e32a

Merge branch 'main' into shard-parquet

3e21213

fix test

aa146ea

fix win tests

11bd133

lhoestq requested a review from mariosasko September 12, 2022 17:28

Merge branch 'main' into shard-parquet

a9f3fb8

mariosasko reviewed Sep 13, 2022

View reviewed changes

Update src/datasets/builder.py

6e480e0

Co-authored-by: Mario Šaško <[email protected]>

mariosasko approved these changes Sep 14, 2022

View reviewed changes

src/datasets/builder.py Show resolved Hide resolved

Update src/datasets/builder.py

d335742

Co-authored-by: Mario Šaško <[email protected]>

lhoestq merged commit 38c8c72 into main Sep 15, 2022

lhoestq deleted the shard-parquet branch September 15, 2022 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shard parquet in `download_and_prepare` #4747

Shard parquet in `download_and_prepare` #4747

Uh oh!

lhoestq commented Jul 26, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 26, 2022 •

edited

Loading

Uh oh!

lhoestq commented Sep 12, 2022 •

edited

Loading

Uh oh!

mariosasko left a comment

Uh oh!

Uh oh!

mariosasko Sep 13, 2022

Uh oh!

lhoestq Sep 13, 2022

Uh oh!

mariosasko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shard parquet in download_and_prepare #4747

Shard parquet in download_and_prepare #4747

Uh oh!

Conversation

lhoestq commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation details

Uh oh!

HuggingFaceDocBuilderDev commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mariosasko Sep 13, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Sep 13, 2022

Choose a reason for hiding this comment

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shard parquet in `download_and_prepare` #4747

Shard parquet in `download_and_prepare` #4747

lhoestq commented Jul 26, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 26, 2022 •

edited

Loading

lhoestq commented Sep 12, 2022 •

edited

Loading