Fix multiprocessing with spawn in iterable datasets #6165

bruno-hays · 2023-08-22T10:07:23Z

The "Spawn" method is preferred when multiprocessing on macOS or Windows systems, instead of the "Fork" method on linux systems.

This causes some methods of Iterable Datasets to break when using a dataloader with more than 0 workers.

I fixed the issue by replacing lambda and local methods which are not pickle-able.

See the example below:

from datasets import load_dataset
from torch.utils.data import DataLoader


if __name__ == "__main__":
    dataset = load_dataset("lhoestq/demo1", split="train")
    dataset = dataset.to_iterable_dataset(num_shards=3)

    dataset = dataset.remove_columns(["package_name"])
    dataset = dataset.rename_columns({
        "review": "review1"
    })
    dataset = dataset.rename_column("date", "date1")
    for sample in DataLoader(dataset, batch_size=None, num_workers=3):
        print(sample)

To notice the fix on a linux system, adding these lines should do the trick:

import multiprocessing
multiprocessing.set_start_method('spawn')

I also removed what looks like code duplication between rename_colums and rename_column

HuggingFaceDocBuilderDev · 2023-08-22T10:13:01Z

The documentation is not available anymore as the PR was closed or merged.

bruno-hays · 2023-08-22T11:53:49Z

@lhoestq
A test is failing, but I don't think it is due to my changes

lhoestq · 2023-08-22T12:57:03Z

Good catch ! Could you add a test to make sure transformed IterableDataset objects are still picklable ?

Something like test_pickle_after_many_transforms in in test_iterable_dataset.py that does a bunch or rename, map, take on a dataset and checks that the dataset can be pickled at the end and the reloaded dataset returns the same elements

bruno-hays · 2023-08-25T12:50:55Z

@lhoestq
I added the test and fixed one last method

…atasets

lhoestq

Awesome thanks !

github-actions · 2023-08-29T13:27:13Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006537 / 0.011353 (-0.004816)	0.003960 / 0.011008 (-0.007048)	0.085135 / 0.038508 (0.046627)	0.079271 / 0.023109 (0.056162)	0.383743 / 0.275898 (0.107845)	0.414622 / 0.323480 (0.091143)	0.004202 / 0.007986 (-0.003784)	0.003537 / 0.004328 (-0.000791)	0.065758 / 0.004250 (0.061508)	0.054225 / 0.037052 (0.017173)	0.395715 / 0.258489 (0.137226)	0.438985 / 0.293841 (0.145144)	0.030590 / 0.128546 (-0.097956)	0.008754 / 0.075646 (-0.066892)	0.288415 / 0.419271 (-0.130857)	0.051863 / 0.043533 (0.008330)	0.382501 / 0.255139 (0.127363)	0.414428 / 0.283200 (0.131228)	0.024084 / 0.141683 (-0.117599)	1.478726 / 1.452155 (0.026572)	1.544763 / 1.492716 (0.052047)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.285143 / 0.018006 (0.267136)	0.603859 / 0.000490 (0.603369)	0.004330 / 0.000200 (0.004131)	0.000108 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027856 / 0.037411 (-0.009555)	0.081963 / 0.014526 (0.067437)	0.104106 / 0.176557 (-0.072451)	0.151378 / 0.737135 (-0.585757)	0.096476 / 0.296338 (-0.199862)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.402938 / 0.215209 (0.187729)	4.042312 / 2.077655 (1.964657)	2.068421 / 1.504120 (0.564301)	1.877870 / 1.541195 (0.336675)	1.947643 / 1.468490 (0.479153)	0.482031 / 4.584777 (-4.102746)	3.554747 / 3.745712 (-0.190965)	3.307811 / 5.269862 (-1.962050)	2.082886 / 4.565676 (-2.482791)	0.056853 / 0.424275 (-0.367422)	0.007535 / 0.007607 (-0.000072)	0.483694 / 0.226044 (0.257649)	4.827906 / 2.268929 (2.558978)	2.567572 / 55.444624 (-52.877052)	2.167206 / 6.876477 (-4.709271)	2.414442 / 2.142072 (0.272369)	0.579472 / 4.805227 (-4.225755)	0.132976 / 6.500664 (-6.367688)	0.059315 / 0.075469 (-0.016154)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.260086 / 1.841788 (-0.581702)	19.438297 / 8.074308 (11.363989)	14.188161 / 10.191392 (3.996769)	0.168534 / 0.680424 (-0.511890)	0.018070 / 0.534201 (-0.516131)	0.394241 / 0.579283 (-0.185043)	0.411057 / 0.434364 (-0.023307)	0.461123 / 0.540337 (-0.079215)	0.626844 / 1.386936 (-0.760092)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006896 / 0.011353 (-0.004457)	0.004207 / 0.011008 (-0.006801)	0.064981 / 0.038508 (0.026473)	0.080261 / 0.023109 (0.057152)	0.399403 / 0.275898 (0.123505)	0.433099 / 0.323480 (0.109619)	0.005697 / 0.007986 (-0.002288)	0.003601 / 0.004328 (-0.000728)	0.065924 / 0.004250 (0.061673)	0.058868 / 0.037052 (0.021815)	0.403705 / 0.258489 (0.145216)	0.439218 / 0.293841 (0.145377)	0.032789 / 0.128546 (-0.095757)	0.008675 / 0.075646 (-0.066971)	0.071217 / 0.419271 (-0.348055)	0.048487 / 0.043533 (0.004954)	0.399878 / 0.255139 (0.144739)	0.412816 / 0.283200 (0.129616)	0.023905 / 0.141683 (-0.117778)	1.541402 / 1.452155 (0.089247)	1.588080 / 1.492716 (0.095364)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.322863 / 0.018006 (0.304856)	0.530291 / 0.000490 (0.529802)	0.004862 / 0.000200 (0.004662)	0.000097 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032697 / 0.037411 (-0.004715)	0.092416 / 0.014526 (0.077891)	0.107355 / 0.176557 (-0.069201)	0.160217 / 0.737135 (-0.576918)	0.109286 / 0.296338 (-0.187052)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.437375 / 0.215209 (0.222166)	4.362644 / 2.077655 (2.284990)	2.335404 / 1.504120 (0.831284)	2.173215 / 1.541195 (0.632020)	2.254061 / 1.468490 (0.785571)	0.493906 / 4.584777 (-4.090871)	3.609025 / 3.745712 (-0.136687)	3.352380 / 5.269862 (-1.917481)	2.074185 / 4.565676 (-2.491492)	0.057863 / 0.424275 (-0.366412)	0.007297 / 0.007607 (-0.000310)	0.512464 / 0.226044 (0.286420)	5.135921 / 2.268929 (2.866993)	2.788889 / 55.444624 (-52.655736)	2.479097 / 6.876477 (-4.397379)	2.717848 / 2.142072 (0.575776)	0.590442 / 4.805227 (-4.214785)	0.133721 / 6.500664 (-6.366943)	0.061491 / 0.075469 (-0.013978)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.429564 / 1.841788 (-0.412224)	20.628733 / 8.074308 (12.554425)	15.299571 / 10.191392 (5.108179)	0.171032 / 0.680424 (-0.509392)	0.019995 / 0.534201 (-0.514206)	0.401283 / 0.579283 (-0.178000)	0.416504 / 0.434364 (-0.017860)	0.471219 / 0.540337 (-0.069118)	0.641299 / 1.386936 (-0.745637)

* fixed remove columns and rename columns * fixed rename column, removed code duplication * linting * typo * added pickle test * fixed rename column not being picklable * linting * added verif that the pickling process does not change the data --------- Co-authored-by: Bruno Hays <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

Bruno Hays added 3 commits August 22, 2023 11:50

fixed remove columns and rename columns

b4f5537

fixed rename column, removed code duplication

260f6fe

linting

b9df254

Bruno Hays added 5 commits August 25, 2023 09:53

typo

cb5e4fe

added pickle test

db8c18f

fixed rename column not being picklable

25e357e

linting

0a92baf

added verif that the pickling process does not change the data

ddc3c3a

Merge branch 'main' into fix_multiprocessing_with_spawn_in_iterable_d…

4f718f0

…atasets

lhoestq approved these changes Aug 29, 2023

View reviewed changes

lhoestq merged commit 5503e7b into huggingface:main Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multiprocessing with spawn in iterable datasets #6165

Fix multiprocessing with spawn in iterable datasets #6165

Uh oh!

bruno-hays commented Aug 22, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 22, 2023 •

edited

Loading

Uh oh!

bruno-hays commented Aug 22, 2023

Uh oh!

lhoestq commented Aug 22, 2023 •

edited

Loading

Uh oh!

bruno-hays commented Aug 25, 2023

Uh oh!

lhoestq left a comment

Uh oh!

github-actions bot commented Aug 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix multiprocessing with spawn in iterable datasets #6165

Fix multiprocessing with spawn in iterable datasets #6165

Uh oh!

Conversation

bruno-hays commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruno-hays commented Aug 22, 2023

Uh oh!

lhoestq commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruno-hays commented Aug 25, 2023

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bruno-hays commented Aug 22, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 22, 2023 •

edited

Loading

lhoestq commented Aug 22, 2023 •

edited

Loading