-
Notifications
You must be signed in to change notification settings - Fork 3k
Support streaming compressed files #2786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| compression = fsspec.core.get_compression(file, "infer") | ||
| if not compression or compression in ["gzip", "zip"]: | ||
| file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open() | ||
| file_obj = _add_retries_to_file_obj_read_method(file_obj) | ||
| else: | ||
| file_obj = fsspec.open(file, mode=mode, compression=compression, *args, **kwargs) | ||
| file_obj = _add_retries_to_fsspec_open_file(file_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xopen is an extension of open to make it work with remote files.
Here you change its behavior for compressed files: you automatically uncompress them. Therefore if you try to open a compressed file and then use gzip (or any other compressing tool) to uncompress it, it won't work since it's already uncompressed.
I think we should revert this change, and explicitly use some tool in the dataset scripts to uncompress the files as we do in standard python. Otherwise we may end up with code that works in streaming mode but not in standard mode and vice-versa.
Let me know what you think @albertvillanova
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, fsspec.open (even if passed the compression parameter) does not uncompress the file immediately: it returns an OpenFile instance, which will return a file-object wrapped with a decompressor instance (when called within a context manager), which will decompress on the fly... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, at the end, the result (after having called dl_manager.download_and_extract will be an uncompressed file, either streaming or not. That is the objective! 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is: how do you make that StreamingDownloadManager.extract() passes the parameter compression=compression to fsspec.open(urlpath, compression=compression) if they can communicate only through the parameter urlpath?
Because of this, I always pass compression="infer", which assumes that all datasets scripts have called .extract (or .download_and_extract) before calling fsspec.open. This assumption is sensible and will work for all dataset scripts, except for oscar (as you told me yesterday), because you changed oscar with a call: gzip.open(open()).
| ) | ||
| def test_load_dataset_streaming_compressed_files(path): | ||
| repo_id = "albertvillanova/datasets-tests-compression" | ||
| data_files = f"https://huggingface.co/datasets/{repo_id}/resolve/main/{path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice feature @albertvillanova !
I think the glob logic has to be moved in a data files resolution module, as done in #2662
Line 228 in 04c2a4b
| def _resolve_data_files_locally_or_by_urls( |
The current implementation may not be robust enough to work with path manipulations by users in compressed files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not touched the glob logic in this PR though... 🤔
Add support to stream compressed files (current options in fsspec):
cc: @lewtun