Skip to content

Conversation

@mariosasko
Copy link
Collaborator

Iterate over data files outside dl_manager.iter_files to allow parallelization in streaming mode.

(The issue reported here)

PS: Another option would be to override FilesIterable.__getitem__ to make it indexable and check for that type in _shard_kwargs and n_shards, but IMO this solution adds too much unnecessary complexity.

@mariosasko mariosasko requested a review from lhoestq July 4, 2022 13:16
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 4, 2022

The documentation is not available anymore as the PR was closed or merged.

@mariosasko mariosasko marked this pull request as draft July 4, 2022 13:25
@lhoestq
Copy link
Member

lhoestq commented Jul 4, 2022

Cool thanks ! Yup it sounds like the right solution.

It looks like _generate_tables needs to be updated as well to fix the CI

@mariosasko mariosasko marked this pull request as ready for review July 4, 2022 17:09
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks !

@mariosasko mariosasko merged commit 7feeb56 into master Jul 5, 2022
@mariosasko mariosasko deleted the allow-loaders-parallelization branch July 5, 2022 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants