-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix multiprocessing with spawn in iterable datasets #6165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiprocessing with spawn in iterable datasets #6165
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
@lhoestq |
|
Good catch ! Could you add a test to make sure transformed IterableDataset objects are still picklable ? Something like |
|
@lhoestq |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks !
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* fixed remove columns and rename columns * fixed rename column, removed code duplication * linting * typo * added pickle test * fixed rename column not being picklable * linting * added verif that the pickling process does not change the data --------- Co-authored-by: Bruno Hays <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

The "Spawn" method is preferred when multiprocessing on macOS or Windows systems, instead of the "Fork" method on linux systems.
This causes some methods of Iterable Datasets to break when using a dataloader with more than 0 workers.
I fixed the issue by replacing lambda and local methods which are not pickle-able.
See the example below:
To notice the fix on a linux system, adding these lines should do the trick:
I also removed what looks like code duplication between rename_colums and rename_column