Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Aug 11, 2021

Add support to stream compressed files (current options in fsspec):

  • bz2
  • lz4
  • xz
  • zstd

cc: @lewtun

@albertvillanova albertvillanova changed the title Support streaming zstd compressed files Support streaming compressed files Aug 11, 2021
@albertvillanova albertvillanova marked this pull request as ready for review August 15, 2021 06:57
@albertvillanova albertvillanova merged commit 98159c4 into huggingface:master Aug 16, 2021
Comment on lines +91 to +97
compression = fsspec.core.get_compression(file, "infer")
if not compression or compression in ["gzip", "zip"]:
file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
file_obj = _add_retries_to_file_obj_read_method(file_obj)
else:
file_obj = fsspec.open(file, mode=mode, compression=compression, *args, **kwargs)
file_obj = _add_retries_to_fsspec_open_file(file_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xopen is an extension of open to make it work with remote files.

Here you change its behavior for compressed files: you automatically uncompress them. Therefore if you try to open a compressed file and then use gzip (or any other compressing tool) to uncompress it, it won't work since it's already uncompressed.

I think we should revert this change, and explicitly use some tool in the dataset scripts to uncompress the files as we do in standard python. Otherwise we may end up with code that works in streaming mode but not in standard mode and vice-versa.

Let me know what you think @albertvillanova

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, fsspec.open (even if passed the compression parameter) does not uncompress the file immediately: it returns an OpenFile instance, which will return a file-object wrapped with a decompressor instance (when called within a context manager), which will decompress on the fly... ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, at the end, the result (after having called dl_manager.download_and_extract will be an uncompressed file, either streaming or not. That is the objective! 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is: how do you make that StreamingDownloadManager.extract() passes the parameter compression=compression to fsspec.open(urlpath, compression=compression) if they can communicate only through the parameter urlpath?

Because of this, I always pass compression="infer", which assumes that all datasets scripts have called .extract (or .download_and_extract) before calling fsspec.open. This assumption is sensible and will work for all dataset scripts, except for oscar (as you told me yesterday), because you changed oscar with a call: gzip.open(open()).

)
def test_load_dataset_streaming_compressed_files(path):
repo_id = "albertvillanova/datasets-tests-compression"
data_files = f"https://huggingface.co/datasets/{repo_id}/resolve/main/{path}"
Copy link
Member

@lhoestq lhoestq Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice feature @albertvillanova !

I think the glob logic has to be moved in a data files resolution module, as done in #2662

def _resolve_data_files_locally_or_by_urls(

The current implementation may not be robust enough to work with path manipulations by users in compressed files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not touched the glob logic in this PR though... 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants