Skip to content

Conversation

@mxschmdt
Copy link
Contributor

@mxschmdt mxschmdt commented Jul 6, 2021

Fixes #2600

Also I moved the check for num_proc larger than dataset size added in #2566 up so that multiprocessing is not used with one process.

mxschmdt added 2 commits July 6, 2021 18:55
If `num_proc` is larger than the actual length of the dataset then `num_proc` is reduced to the number of samples in the dataset.
However this is currently done after initializing multiprocessing hence doing multiprocessing with one process.
Therefore I guess the check can be safely moved before multiprocessing is initialized.
@mxschmdt mxschmdt marked this pull request as draft July 6, 2021 17:35
@mxschmdt mxschmdt marked this pull request as ready for review July 6, 2021 18:08
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines +1656 to +1661
if num_proc is not None and num_proc > len(self):
num_proc = len(self)
logger.warning(
f"num_proc must be <= {len(self)}. Reducing num_proc to {num_proc} for dataset of size {len(self)}."
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch to avoid multiprocessing with only 1 process, thanks.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! It's better this way indeed

@lhoestq lhoestq merged commit 5de6041 into huggingface:master Jul 7, 2021
@mxschmdt mxschmdt deleted the fix-filter-multiprocessing branch July 7, 2021 12:53
@albertvillanova albertvillanova added this to the 1.10 milestone Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash when using multiprocessing (num_proc > 1) on filter and all samples are discarded

3 participants