-
Notifications
You must be signed in to change notification settings - Fork 3k
Support streaming compressed files #2786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
28aca39
303b3a4
f3c8fc2
24d7e2c
543e62b
d2284fa
4795f23
c00825d
2c08778
3c50e5f
d411b75
853f461
e5acb8b
ac5761c
bf372c2
ab3fcf0
e9d8084
5983fbc
0bc13f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -247,6 +247,24 @@ def test_load_dataset_streaming_gz_json(jsonl_gz_path): | |||
| assert ds_item == {"col_1": "0", "col_2": 0, "col_3": 0.0} | ||||
|
|
||||
|
|
||||
| @require_streaming | ||||
| @pytest.mark.parametrize( | ||||
| "path", ["sample.jsonl", "sample.jsonl.gz", "sample.tar", "sample.jsonl.xz", "sample.zip", "sample.jsonl.zst"] | ||||
| ) | ||||
| def test_load_dataset_streaming_compressed_files(path): | ||||
| repo_id = "albertvillanova/datasets-tests-compression" | ||||
| data_files = f"https://huggingface.co/datasets/{repo_id}/resolve/main/{path}" | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a really nice feature @albertvillanova ! I think the glob logic has to be moved in a data files resolution module, as done in #2662 Line 228 in 04c2a4b
The current implementation may not be robust enough to work with path manipulations by users in compressed files
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have not touched the glob logic in this PR though... 🤔 |
||||
| ds = load_dataset("json", split="train", data_files=data_files, streaming=True) | ||||
| assert isinstance(ds, IterableDataset) | ||||
| ds_item = next(iter(ds)) | ||||
| assert ds_item == { | ||||
| "tokens": ["Ministeri", "de", "Justícia", "d'Espanya"], | ||||
| "ner_tags": [1, 2, 2, 2], | ||||
| "langs": ["ca", "ca", "ca", "ca"], | ||||
| "spans": ["PER: Ministeri de Justícia d'Espanya"], | ||||
| } | ||||
|
|
||||
|
|
||||
| def test_loading_from_the_datasets_hub(): | ||||
| with tempfile.TemporaryDirectory() as tmp_dir: | ||||
| dataset = load_dataset(SAMPLE_DATASET_IDENTIFIER, cache_dir=tmp_dir) | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xopenis an extension ofopento make it work with remote files.Here you change its behavior for compressed files: you automatically uncompress them. Therefore if you try to open a compressed file and then use
gzip(or any other compressing tool) to uncompress it, it won't work since it's already uncompressed.I think we should revert this change, and explicitly use some tool in the dataset scripts to uncompress the files as we do in standard python. Otherwise we may end up with code that works in streaming mode but not in standard mode and vice-versa.
Let me know what you think @albertvillanova
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No,
fsspec.open(even if passed thecompressionparameter) does not uncompress the file immediately: it returns anOpenFileinstance, which will return a file-object wrapped with a decompressor instance (when called within a context manager), which will decompress on the fly... ;)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, at the end, the result (after having called
dl_manager.download_and_extractwill be an uncompressed file, either streaming or not. That is the objective! 😉There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is: how do you make that
StreamingDownloadManager.extract()passes the parametercompression=compressiontofsspec.open(urlpath, compression=compression)if they can communicate only through the parameterurlpath?Because of this, I always pass
compression="infer", which assumes that all datasets scripts have called.extract(or.download_and_extract) before callingfsspec.open. This assumption is sensible and will work for all dataset scripts, except for oscar (as you told me yesterday), because you changed oscar with a call:gzip.open(open()).