Skip to content

Conversation

@albertvillanova
Copy link
Member

Close #2572.

cc: @thomwolf

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! This can be useful for datasets like The Pile

@albertvillanova
Copy link
Member Author

What if people want to run some tests without having zstandard ?
Usually what we do is add a decorator @require_zstandard for example

@lhoestq I think I'm missing something here...

Tests are a development tool (to ensure we deliver a good quality lib), not something we offer to the end users of the lib. Users of the lib just pip install datasets and no tests are delivered with the lib (tests directory is outside the src code dir).

On the contrary, developers (contributors) of the lib do need to be able to run tests (TDD). And because of that, they are required to install datasets differently: pip install -e .[dev], so that all required developing (and testing) dependencies are properly installed (included zstandard).

Apart from zsatandard, there are many other dev/test required dependencies for running tests, and we do not have a @require_toto for each and every of these dependencies in our tests:

  • pytest and absl-py (they are not dependencies in install_requires, but only in TEST_REQUIRE extras_require),
  • boto3 (in test_filesystem.py),
  • seqeval (in test_metric_common.py),
  • bs4 (used by eli5 and tested in test_hf_gcp.py)
  • ...

So IMHO, to run tests you should previously install datasets with dev or tests dependencies: either pip install -e .[dev] or pip install -e .[tests] (the latter to be used in CI testing-only part of the development cycle). And the tests should be written accordingly, assuming all tests dependencies are installed.

@lhoestq
Copy link
Member

lhoestq commented Jul 5, 2021

Hi !
I was saying that because the other dependencies you mentioned are only required for some tests. While here zstd is required for all tests since it's imported in the conftest.py
Feel free to keep it as it is right now, or maybe move the fixture to test_file_utils.py to allow users without zstd to run tests for their builders, dataset card etc. without issues

@lhoestq
Copy link
Member

lhoestq commented Jul 5, 2021

Thank you ! I think we can merge now

@lhoestq lhoestq merged commit 474d46b into huggingface:master Jul 5, 2021
@Shashi456
Copy link

@lhoestq does this mean that the pile could have streaming support in the future? Afaik streaming doesnt support zstandard compressed type

@lewtun
Copy link
Member

lewtun commented Aug 9, 2021

@lhoestq does this mean that the pile could have streaming support in the future? Afaik streaming doesnt support zstandard compressed type

just for reference, i tried to stream one of the .zst files from the pile using

data_files = ["https://the-eye.eu/public/AI/pile/train/00.jsonl.zst"]
streamed_dataset = load_dataset('json', split='train', data_files=data_files, streaming=True)

and got the following error:

Using custom data configuration default-4e71acadc389c254
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_1187680/10848115.py in <module>
      1 data_files = ["https://the-eye.eu/public/AI/pile/train/00.jsonl.zst"]
      2 
----> 3 streamed_dataset = load_dataset('json', split='train', data_files=data_files, streaming=True)
      4 

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    835         # this extends the open and os.path.join functions for data streaming
    836         extend_module_for_streaming(builder_instance.__module__, use_auth_token=use_auth_token)
--> 837         return builder_instance.as_streaming_dataset(
    838             split=split,
    839             use_auth_token=use_auth_token,

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/builder.py in as_streaming_dataset(self, split, base_path, use_auth_token)
    922             data_dir=self.config.data_dir,
    923         )
--> 924         splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
    925         # By default, return all splits
    926         if split is None:

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py in _split_generators(self, dl_manager)
     50         if not self.config.data_files:
     51             raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
---> 52         data_files = dl_manager.download_and_extract(self.config.data_files)
     53         if isinstance(data_files, (str, list, tuple)):
     54             files = data_files

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in download_and_extract(self, url_or_urls)
    140 
    141     def download_and_extract(self, url_or_urls):
--> 142         return self.extract(self.download(url_or_urls))

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in extract(self, path_or_paths)
    115 
    116     def extract(self, path_or_paths):
--> 117         urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
    118         return urlpaths
    119 

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, types)
    202         num_proc = 1
    203     if num_proc <= 1 or len(iterable) <= num_proc:
--> 204         mapped = [
    205             _single_map_nested((function, obj, types, None, True))
    206             for obj in utils.tqdm(iterable, disable=disable_tqdm)

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in <listcomp>(.0)
    203     if num_proc <= 1 or len(iterable) <= num_proc:
    204         mapped = [
--> 205             _single_map_nested((function, obj, types, None, True))
    206             for obj in utils.tqdm(iterable, disable=disable_tqdm)
    207         ]

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/py_utils.py in _single_map_nested(args)
    141     # Singleton first to spare some computation
    142     if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 143         return function(data_struct)
    144 
    145     # Reduce logging to keep things readable in multiprocessing with tqdm

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in _extract(self, urlpath)
    119 
    120     def _extract(self, urlpath):
--> 121         protocol = self._get_extraction_protocol(urlpath)
    122         if protocol is None:
    123             # no extraction

~/miniconda3/envs/hf/lib/python3.8/site-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
    140 
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://the-eye.eu/public/AI/pile/train/00.jsonl.zst is not implemented yet

i'm not sure whether @Shashi456 is referring to a fundamental limitation with "streaming" zstandard compression files or simply that we need to support the protocol in the streaming api of datasets

@albertvillanova
Copy link
Member Author

@lewtun our streaming mode patches the Python open function. I could have a look tomorrow if it is easily implementable for this case.

@albertvillanova
Copy link
Member Author

@lewtun, I have tested and yes, it is easily implementable. I've created a draft Pull Request with an implementation proposal: #2786.

@lewtun
Copy link
Member

lewtun commented Aug 11, 2021

thanks a lot @albertvillanova - now i can stream the pile :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Zstandard compressed files

4 participants