-
Notifications
You must be signed in to change notification settings - Fork 3k
Multiprocessed dataset builder #5107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
lhoestq
merged 46 commits into
huggingface:main
from
TevenLeScao:multiprocessed_dataset_prep
Nov 9, 2022
Merged
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
a802ba5
multiprocessing-compatible naming scheme and refactor
TevenLeScao ea56329
multiprocessed shard writing for GeneratorBasedBuilder
TevenLeScao 9536184
multiprocessed shard writing for ArrowBasedBuilder
TevenLeScao 31d8395
style
TevenLeScao 9c5843a
multiprocessed dataset loading
TevenLeScao 328112e
compatibility with non-sharded datasets
TevenLeScao 9dc8539
bugfix
TevenLeScao 21a603a
bugfix
TevenLeScao 55cb365
Merge remote-tracking branch 'origin/multiprocessed_dataset_prep' int…
TevenLeScao 94efbdb
removed unused import
TevenLeScao bac2b2f
fixed bad ordering
TevenLeScao 3e4f337
less misleading tqdm
TevenLeScao b2f634d
fix gen_kwargs distribution + read shards
lhoestq 296302f
minor
lhoestq 9b312d4
minor2
lhoestq d2e70f2
support beam datasets
lhoestq e3a30fa
docstrings + minor
lhoestq cf6fd25
add iflatmap_unordered for parallel write & progress updates
lhoestq 3e5d0cc
use 1 tqdm bar receiving updates from subprocesses
lhoestq 09c13a7
docs
lhoestq a2e83d5
add test_iflatmap_unordered
lhoestq e3bc7a7
style
lhoestq e8923e2
test arrow_reader.py
lhoestq ef9c7f1
fix test_iflatmap_unordered
lhoestq 088dbb1
add Beam test_download_and_prepare_sharded
lhoestq eb1fc58
test gen_kwargs distribution
lhoestq e035339
test download_and_prepare with num_proc
lhoestq 06c5d33
Merge branch 'main' into multiprocessed_dataset_prep
lhoestq e50ec74
style
lhoestq 525c829
improve test
lhoestq eae6491
don't close the pool
lhoestq 93f355d
Merge branch 'main' into multiprocessed_dataset_prep
lhoestq b321c61
fix multiprocessing on windows
lhoestq b05e551
keep multiprocessing disabled by default
lhoestq 020eb89
again + docs
lhoestq 142f822
more docs
lhoestq f22c162
more docs
lhoestq 08b8626
Merge remote-tracking branch 'upstream/main' into multiprocessed_data…
lhoestq 4ce2d12
some var renaming
lhoestq e05ad83
style
lhoestq c621cb6
Apply suggestions from code review
lhoestq 22d965e
Apply suggestions from code review
lhoestq dc0ef15
added utils/sharding.py
lhoestq 95cdd0b
Merge remote-tracking branch 'upstream/main' into multiprocessed_data…
lhoestq 12d69f3
style
lhoestq db45b3b
style
lhoestq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.