Support Zstandard compressed files #2578

albertvillanova · 2021-07-01T20:22:34Z

lhoestq

Thanks ! This can be useful for datasets like The Pile

tests/conftest.py

tests/test_file_utils.py

albertvillanova · 2021-07-03T15:38:39Z

What if people want to run some tests without having zstandard ?
Usually what we do is add a decorator @require_zstandard for example

@lhoestq I think I'm missing something here...

Tests are a development tool (to ensure we deliver a good quality lib), not something we offer to the end users of the lib. Users of the lib just pip install datasets and no tests are delivered with the lib (tests directory is outside the src code dir).

On the contrary, developers (contributors) of the lib do need to be able to run tests (TDD). And because of that, they are required to install datasets differently: pip install -e .[dev], so that all required developing (and testing) dependencies are properly installed (included zstandard).

Apart from zsatandard, there are many other dev/test required dependencies for running tests, and we do not have a @require_toto for each and every of these dependencies in our tests:

pytest and absl-py (they are not dependencies in install_requires, but only in TEST_REQUIRE extras_require),
boto3 (in test_filesystem.py),
seqeval (in test_metric_common.py),
bs4 (used by eli5 and tested in test_hf_gcp.py)
...

So IMHO, to run tests you should previously install datasets with dev or tests dependencies: either pip install -e .[dev] or pip install -e .[tests] (the latter to be used in CI testing-only part of the development cycle). And the tests should be written accordingly, assuming all tests dependencies are installed.

lhoestq · 2021-07-05T08:49:17Z

Hi !
I was saying that because the other dependencies you mentioned are only required for some tests. While here zstd is required for all tests since it's imported in the conftest.py
Feel free to keep it as it is right now, or maybe move the fixture to test_file_utils.py to allow users without zstd to run tests for their builders, dataset card etc. without issues

lhoestq · 2021-07-05T10:50:21Z

Thank you ! I think we can merge now

Shashi456 · 2021-07-05T12:06:45Z

@lhoestq does this mean that the pile could have streaming support in the future? Afaik streaming doesnt support zstandard compressed type

lewtun · 2021-08-09T18:03:44Z

@lhoestq does this mean that the pile could have streaming support in the future? Afaik streaming doesnt support zstandard compressed type

just for reference, i tried to stream one of the .zst files from the pile using

data_files = ["https://the-eye.eu/public/AI/pile/train/00.jsonl.zst"]
streamed_dataset = load_dataset('json', split='train', data_files=data_files, streaming=True)

and got the following error:

Using custom data configuration default-4e71acadc389c254
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_1187680/10848115.py in <module>
      1 data_files = ["https://the-eye.eu/public/AI/pile/train/00.jsonl.zst"]
      2 
----> 3 streamed_dataset = load_dataset('json', split='train', data_files=data_files, streaming=True)
      4 

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    835         # this extends the open and os.path.join functions for data streaming
    836         extend_module_for_streaming(builder_instance.__module__, use_auth_token=use_auth_token)
--> 837         return builder_instance.as_streaming_dataset(
    838             split=split,
    839             use_auth_token=use_auth_token,

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/builder.py in as_streaming_dataset(self, split, base_path, use_auth_token)
    922             data_dir=self.config.data_dir,
    923         )
--> 924         splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
    925         # By default, return all splits
    926         if split is None:

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py in _split_generators(self, dl_manager)
     50         if not self.config.data_files:
     51             raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
---> 52         data_files = dl_manager.download_and_extract(self.config.data_files)
     53         if isinstance(data_files, (str, list, tuple)):
     54             files = data_files

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in download_and_extract(self, url_or_urls)
    140 
    141     def download_and_extract(self, url_or_urls):
--> 142         return self.extract(self.download(url_or_urls))

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in extract(self, path_or_paths)
    115 
    116     def extract(self, path_or_paths):
--> 117         urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
    118         return urlpaths
    119 

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, types)
    202         num_proc = 1
    203     if num_proc <= 1 or len(iterable) <= num_proc:
--> 204         mapped = [
    205             _single_map_nested((function, obj, types, None, True))
    206             for obj in utils.tqdm(iterable, disable=disable_tqdm)

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in <listcomp>(.0)
    203     if num_proc <= 1 or len(iterable) <= num_proc:
    204         mapped = [
--> 205             _single_map_nested((function, obj, types, None, True))
    206             for obj in utils.tqdm(iterable, disable=disable_tqdm)
    207         ]

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in _single_map_nested(args)
    141     # Singleton first to spare some computation
    142     if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 143         return function(data_struct)
    144 
    145     # Reduce logging to keep things readable in multiprocessing with tqdm

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in _extract(self, urlpath)
    119 
    120     def _extract(self, urlpath):
--> 121         protocol = self._get_extraction_protocol(urlpath)
    122         if protocol is None:
    123             # no extraction

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
    140 
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://the-eye.eu/public/AI/pile/train/00.jsonl.zst is not implemented yet

i'm not sure whether @Shashi456 is referring to a fundamental limitation with "streaming" zstandard compression files or simply that we need to support the protocol in the streaming api of datasets

albertvillanova · 2021-08-09T19:08:59Z

@lewtun our streaming mode patches the Python open function. I could have a look tomorrow if it is easily implementable for this case.

albertvillanova · 2021-08-11T09:02:43Z

@lewtun, I have tested and yes, it is easily implementable. I've created a draft Pull Request with an implementation proposal: #2786.

lewtun · 2021-08-11T14:46:24Z

thanks a lot @albertvillanova - now i can stream the pile :)

albertvillanova added 5 commits July 1, 2021 22:15

Test Zstandard extractor

a4451b5

Implement Zstandard extractor

3fdc519

Test cached_path extracts Zstandard

45a1ba4

Implement cached_path extracts Zstandard

d290f1a

Minor refactoring

2515d31

lhoestq reviewed Jul 2, 2021

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

tests/test_file_utils.py Show resolved Hide resolved

lhoestq approved these changes Jul 5, 2021

View reviewed changes

Move zstd_path fixture to test_file_utils

1554702

lhoestq merged commit 474d46b into huggingface:master Jul 5, 2021

albertvillanova mentioned this pull request Jul 6, 2021

Create ExtractManager #2295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Zstandard compressed files #2578

Support Zstandard compressed files #2578

Uh oh!

albertvillanova commented Jul 1, 2021

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova commented Jul 3, 2021

Uh oh!

lhoestq commented Jul 5, 2021

Uh oh!

lhoestq commented Jul 5, 2021

Uh oh!

Shashi456 commented Jul 5, 2021

Uh oh!

lewtun commented Aug 9, 2021

Uh oh!

albertvillanova commented Aug 9, 2021

Uh oh!

albertvillanova commented Aug 11, 2021

Uh oh!

lewtun commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support Zstandard compressed files #2578

Support Zstandard compressed files #2578

Uh oh!

Conversation

albertvillanova commented Jul 1, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova commented Jul 3, 2021

Uh oh!

lhoestq commented Jul 5, 2021

Uh oh!

lhoestq commented Jul 5, 2021

Uh oh!

Shashi456 commented Jul 5, 2021

Uh oh!

lewtun commented Aug 9, 2021

Uh oh!

albertvillanova commented Aug 9, 2021

Uh oh!

albertvillanova commented Aug 11, 2021

Uh oh!

lewtun commented Aug 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants