-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix parallel downloads for datasets without scripts #6551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the useful enhancement.
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
|
@lhoestq |
|
@lhoestq i was thinking uninstalling Now instead of showing progress bars one after another it seems to be downloading the dataset way way way faster (like 4 mins instead of 58, thank you very much) but does not show any progress bars related to downloading at all.
|




Enable parallel downloads using multiprocessing when
num_procis passed toload_dataset.It was enabled for datasets with scripts already (if they passed lists to
dl_manager.download) but not for no-script datasets (we pass dicts {split: [list of files]} todl_manager.downloadfor those ones).I fixed this by parallelising on the lists contained in the data files dicts when possible.
I also added a context manager
stack_multiprocessing_download_progress_barsinDownloadManagerto stack the progress bard of the downloads (fromcached_path(...)calls). Otherwise the progress bars overlap each other with an annoying flickering effect.