diff --git a/docs/source/index.rst b/docs/source/index.rst index e57b5f762e9..47c783d6de0 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -59,8 +59,8 @@ The documentation is organized in six parts: :maxdepth: 2 :caption: Adding new datasets/metrics - add_dataset share_dataset + add_dataset add_metric .. toctree:: diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst index 83086170e97..260ac2ef3f6 100644 --- a/docs/source/loading_datasets.rst +++ b/docs/source/loading_datasets.rst @@ -3,16 +3,16 @@ Loading a Dataset A :class:`datasets.Dataset` can be created from various sources of data: -- from the `HuggingFace Hub `__, -- from local files, e.g. CSV/JSON/text/pandas files, or +- from the `Hugging Face Hub `__, +- from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or - from in-memory data like python dict or a pandas dataframe. In this section we study each option. -From the HuggingFace Hub +From the Hugging Face Hub ------------------------------------------------- -Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `HuggingFace Hub `__ and can be viewed and explored online with the `🤗 Datasets viewer `__. +Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `Hugging Face Hub `__ and can be viewed and explored online with the `🤗 Datasets viewer `__. .. note:: @@ -25,7 +25,7 @@ All the datasets currently available on the `Hub >> from datasets import list_datasets >>> datasets_list = list_datasets() >>> len(datasets_list) - 1067 + 1103 >>> print(', '.join(dataset for dataset in datasets_list)) acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, @@ -46,7 +46,7 @@ Let's load the **SQuAD dataset for Question Answering**. You can explore this da This call to :func:`datasets.load_dataset` does the following steps under the hood: -1. Download and import in the library the **SQuAD python processing script** from HuggingFace github repository or AWS bucket if it's not already stored in the library. +1. Download and import in the library the **SQuAD python processing script** from Hugging Face github repository or AWS bucket if it's not already stored in the library. .. note:: @@ -158,22 +158,60 @@ Apart from :obj:`name` and :obj:`split`, the :func:`datasets.load_dataset` metho The use of these arguments is discussed in the :ref:`load_dataset_cache_management` section below. You can also find the full details on these arguments on the package reference page for :func:`datasets.load_dataset`. +From a community dataset on the Hugging Face Hub +----------------------------------------------------------- + +The community shares hundreds of datasets on the Hugging Face Hub using **dataset repositories**. +A dataset repository is a versioned repository of data files. +Everyone can create a dataset repository on the Hugging Face Hub and upload their data. + +For example we have created a demo dataset at https://huggingface.co/datasets/lhoestq/demo1. +In this dataset repository we uploaded some CSV files, and you can load the dataset with: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('lhoestq/demo1') + +You can even choose which files to load from a dataset repository. +For example you can load a subset of the **C4 dataset for language modeling**, hosted by AllenAI on the Hub. +You can browse the dataset repository at https://huggingface.co/datasets/allenai/c4 + +In the following example we specify which subset of the files to use with the ``data_files`` parameter: + +.. code-block:: + + >>> from datasets import load_dataset + >>> c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz') + + +You can also specify custom splits: + +.. code-block:: + + >>> data_files = {"validation": "en/c4-validation.*.json.gz"} + >>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation") + +In these examples, ``load_dataset`` will return all the files that match the Unix style pattern passed in ``data_files``. +If you don't specify which data files to use, it will use all the data files (here all C4 is about 13TB of data). + .. _loading-from-local-files: -From local files +From local or remote files ----------------------------------------------------------- -It's also possible to create a dataset from local files. +It's also possible to create a dataset from your own local or remote files. Generic loading scripts are provided for: - CSV files (with the :obj:`csv` script), - JSON files (with the :obj:`json` script), - text files (read as a line-by-line dataset with the :obj:`text` script), +- parquet files (with the :obj:`parquet` script). - pandas pickled dataframe (with the :obj:`pandas` script). -If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub `__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter. +If you want more fine-grained control on how your files are loaded or if you have a file format that matches the format for one of the datasets provided on the `Hugging Face Hub `__, it can be more simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` section. The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several data source files. This argument currently accepts three types of inputs: @@ -190,12 +228,19 @@ Let's see an example of all the various ways you can provide files to :func:`dat >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], 'test': 'my_test_file.csv'}) + >>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/' + >>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'}) .. note:: The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split. +.. note:: + + If you use a private dataset repository on the Hub, you just need to pass ``use_auth_token=True`` to ``load_dataset`` after logging in with the ``huggingface-cli login`` bash command. Alternatively you can pass your `API token `__ in ``use_auth_token``. + + CSV files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -218,6 +263,13 @@ Here is an example loading two CSV file to create a ``train`` split (default spl >>> from datasets import load_dataset >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv']) +You can also provide the URLs of remote csv files: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv") + The ``csv`` loading script provides a few simple access options to control parsing and reading the CSV files: - :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0) @@ -226,12 +278,6 @@ The ``csv`` loading script provides a few simple access options to control parsi - :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``). - :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation ` for more details). -If you want more control, the ``csv`` script provides full control on reading, parsing and converting through the Apache Arrow `pyarrow.csv.ReadOptions `__, `pyarrow.csv.ParseOptions `__ and `pyarrow.csv.ConvertOptions `__ - - - :obj:`read_options` — Can be provided with a `pyarrow.csv.ReadOptions `__ to control all the reading options. If :obj:`skiprows`, :obj:`column_names` or :obj:`autogenerate_column_names` are also provided (see above), they will take priority over the attributes in :obj:`read_options`. - - :obj:`parse_options` — Can be provided with a `pyarrow.csv.ParseOptions `__ to control all the parsing options. If :obj:`delimiter` or :obj:`quote_char` are also provided (see above), they will take priority over the attributes in :obj:`parse_options`. - - :obj:`convert_options` — Can be provided with a `pyarrow.csv.ConvertOptions `__ to control all the conversion options. - JSON files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -258,6 +304,13 @@ You can load such a dataset direcly with: >>> from datasets import load_dataset >>> dataset = load_dataset('json', data_files='my_file.json') +You can also provide the URLs of remote JSON files: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('json', data_files='https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz') + In real-life though, JSON files can have diverse format and the ``json`` script will accordingly fallback on using python JSON loading methods to handle various JSON file format. One common occurence is to have a JSON file with a single root dictionary where the dataset is contained in a specific field, as a list of dicts or a dict of lists. @@ -289,6 +342,13 @@ This is simply done using the ``text`` loading script which will generate a data >>> from datasets import load_dataset >>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'}) +You can also provide the URLs of remote text files: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('text', data_files={'train': 'https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt'}) + Specifying the features of the dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -465,8 +525,8 @@ For example, run the following to skip integrity verifications when loading the Loading datasets offline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Each dataset builder (e.g. "squad") is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub `__. -Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads. +Each dataset builder (e.g. "squad") is a Python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `Hugging Face Hub `__. +Only the ``text``, ``csv``, ``json``, ``parquet`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads. Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached. Indeed, if you've already loaded the dataset once before (when you had an internet connection), then the dataset is reloaded from the cache and you can use it offline. diff --git a/docs/source/share_dataset.rst b/docs/source/share_dataset.rst index c03f30c19d9..3f8fca5dcfb 100644 --- a/docs/source/share_dataset.rst +++ b/docs/source/share_dataset.rst @@ -1,10 +1,17 @@ Sharing your dataset ============================================= -Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub `__. There are two options to do that: +Once you have your dataset, you may want to share it with the community for instance on the `HuggingFace Hub `__. There are two options to do that: -- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets `__, - directly upload it on the Hub as a community provided dataset. +- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗 Datasets `__, + +Both options offer the same features such as: + +- dataset versioning +- commit history and diffs +- metadata for discoverability +- dataset cards for documentation, licensing, limitations, etc. Here are the main differences between these two options. @@ -12,7 +19,7 @@ Here are the main differences between these two options. * are faster to share (no reviewing process) * can contain the data files themselves on the Hub * are identified under the namespace of a user or organization: ``thomwolf/my_dataset`` or ``huggingface/our_dataset`` - * are flagged as ``unsafe`` by default because a dataset contains executable code so the users need to inspect and opt-in to use the datasets + * are flagged as ``unsafe`` by default because a dataset may contain executable code so the users need to inspect and opt-in to use the datasets - **Canonical** datasets: * are slower to add (need to go through the reviewing process on the githup repo) @@ -22,81 +29,7 @@ Here are the main differences between these two options. .. note:: - The distinctions between "canonical" and "community provided" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self. - -.. _canonical-dataset: - -Sharing a "canonical" dataset --------------------------------- - -To add a "canonical" dataset to the library, you need to go through the following steps: - -**1. Fork the** `🤗 Datasets repository `__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account. - -**2. Clone your fork** to your local disk, and add the base repository as a remote: - -.. code:: - - git clone https://github.com//datasets - cd datasets - git remote add upstream https://github.com/huggingface/datasets.git - - -**3. Create a new branch** to hold your development changes: - -.. code:: - - git checkout -b my-new-dataset - -.. note:: - - **Do not** work on the ``master`` branch. - -**4. Set up a development environment** by running the following command **in a virtual environment**: - -.. code:: - - pip install -e ".[dev]" - -.. note:: - - If 🤗 Datasets was already installed in the virtual environment, remove - it with ``pip uninstall datasets`` before reinstalling it in editable - mode with the ``-e`` flag. - -**5. Create a new folder with your dataset name** inside the `datasets folder `__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page. - -**6. Format your code.** Run black and isort so that your newly added files look nice with the following command: - -.. code:: - - make style - make quality - - -**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**: - -.. code:: - - git add datasets/ - git commit - -It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes: - -.. code:: - - git fetch upstream - git rebase upstream/master - -Push the changes to your account using: - -.. code:: - - git push -u origin my-new-dataset - -**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so. - -**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository `__ for review. + The distinctions between "community provided" and "canonical" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self. .. _community-dataset: @@ -114,6 +47,18 @@ In this page, we will show you how to share a dataset with the community on the Prepare your dataset for uploading ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +You can either have your dataset in a supported format (csv/jsonl/json/parquet/txt), or use a dataset script to define how to load your data. + +If your dataset is in a supported format, you're all set ! +Otherwise, you need a dataset script. It simply is a python script and its role is to define: + +- the feature types of your data +- how your dataset is split into train/validation/test (or any other splits) +- how to download the data +- how to process the data + +The dataset script is mandatory if your dataset is not in the supported formats, or if you need more control on how to define our dataset. + We have seen in the :doc:`dataset script tutorial `: how to write a dataset loading script. Let's see how you can share it on the `🤗 Datasets Hub `__. @@ -209,10 +154,10 @@ Check the directory before pushing to the 🤗 Datasets Hub. Make sure there are no garbage files in the directory you'll upload. It should only have: -- a `your_dataset_name.py` file, which is the dataset script; +- a `your_dataset_name.py` file, which is the dataset script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt); +- the raw data files (json, csv, txt, mp3, png, etc.) that you need for your dataset - an optional `dataset_infos.json` file, which contains metadata about your dataset like the split sizes; - optional dummy data files, which contains only a small subset from the dataset for tests and preview; -- your raw data files (json, csv, txt, etc.) that you need for your dataset Other files can safely be deleted. @@ -276,6 +221,18 @@ Anyone can load it from code: >>> dataset = load_dataset("namespace/your_dataset_name") +If your dataset doesn't have a dataset script, then by default all your data will be loaded in the "train" split. +You can specify which files goes to which split by specifying the ``data_files`` parameter. + +Let's say your dataset repository contains one CSV file for the train split, and one CSV file for your test split. Then you can load it with: + + +.. code-block:: + + >>> data_files = {"train": "train.csv", "test": "test.csv"} + >>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files) + + You may specify a version by using the ``script_version`` flag in the ``load_dataset`` function: .. code-block:: @@ -285,11 +242,90 @@ You may specify a version by using the ``script_version`` flag in the ``load_dat >>> script_version="main" # tag name, or branch name, or commit hash >>> ) +You can find more information in the guide on :doc:`how to load a dataset ` + +.. _canonical-dataset: + +Sharing a "canonical" dataset +-------------------------------- + +Add your dataset to the GitHub repository +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To add a "canonical" dataset to the library, you need to go through the following steps: + +**1. Fork the** `🤗 Datasets repository `__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account. + +**2. Clone your fork** to your local disk, and add the base repository as a remote: + +.. code:: + + git clone https://github.com//datasets + cd datasets + git remote add upstream https://github.com/huggingface/datasets.git + + +**3. Create a new branch** to hold your development changes: + +.. code:: + + git checkout -b my-new-dataset + +.. note:: + + **Do not** work on the ``master`` branch. + +**4. Set up a development environment** by running the following command **in a virtual environment**: + +.. code:: + + pip install -e ".[dev]" + +.. note:: + + If 🤗 Datasets was already installed in the virtual environment, remove + it with ``pip uninstall datasets`` before reinstalling it in editable + mode with the ``-e`` flag. + +**5. Create a new folder with your dataset name** inside the `datasets folder `__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page. + +**6. Format your code.** Run black and isort so that your newly added files look nice with the following command: + +.. code:: + + make style + make quality + + +**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**: + +.. code:: + + git add datasets/ + git commit + +It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes: + +.. code:: + + git fetch upstream + git rebase upstream/master + +Push the changes to your account using: + +.. code:: + + git push -u origin my-new-dataset + +**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so. + +**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository `__ for review. + .. _adding-tests: Adding tests and metadata to the dataset ---------------------------------------------- +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We recommend adding testing data and checksum metadata to your dataset so its behavior can be tested and verified, and the generated dataset can be certified. In this section we'll explain how you can add two objects to the repository to do just that: @@ -302,7 +338,7 @@ We recommend adding testing data and checksum metadata to your dataset so its be In the rest of this section, you should make sure that you run all of the commands **from the root** of your local ``datasets`` repository. 1. Adding metadata -^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~ You can check that the new dataset loading script works correctly and create the ``dataset_infos.json`` file at the same time by running the command: @@ -373,7 +409,7 @@ If the command was succesful, you should now have a ``dataset_infos.json`` file } 2. Adding dummy data -^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~ Now that we have the metadata prepared we can also create some dummy data for automated testing. You can use the following command to get in-detail instructions on how to create the dummy data: @@ -465,7 +501,7 @@ Usage of the command: 3. Testing -^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~ Now test that both the real data and the dummy data work correctly. Go back to the root of your datasets folder and use the following command: @@ -496,3 +532,56 @@ and make sure you follow the exact instructions provided by the command. - Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function ``_split_generators(...)`` and expected by the function ``_generate_examples(...)`` of your dataset script. Also take a look at `tests/README.md` which lists different possible cases of how the dummy data should be created. - If the dummy data tests still fail, open a PR in the main repository on github and make a remark in the description that you need help creating the dummy data and we will be happy to help you. + + +Add a Dataset Card +-------------------------------- + +Once your dataset is ready for sharing, feel free to write and add a Dataset Card to document your dataset. + +The Dataset Card is a file ``README.md`` file that you may add in your dataset repository. + +At the top of the Dataset Card, you can define the metadata of your dataset for discoverability: + +- annotations_creators +- language_creators +- languages +- licenses +- multilinguality +- pretty_name +- size_categories +- source_datasets +- task_categories +- task_ids +- paperswithcode_id + +It may contain diverse sections to document all the relevant aspects of your dataset: + +- Dataset Description + - Dataset Summary + - Supported Tasks and Leaderboards + - Languages +- Dataset Structure + - Data Instances + - Data Fields + - Data Splits +- Dataset Creation + - Curation Rationale + - Source Data + - Initial Data Collection and Normalization + - Who are the source language producers? + - Annotations + - Annotation process + - Who are the annotators? + - Personal and Sensitive Information +- Considerations for Using the Data + - Social Impact of Dataset + - Discussion of Biases + - Other Known Limitations +- Additional Information + - Dataset Curators + - Licensing Information + - Citation Information + - Contributions + +You can find more information about each section in the `Dataset Card guide `_. diff --git a/setup.py b/setup.py index 8a02f800f15..a0f2ee332e3 100644 --- a/setup.py +++ b/setup.py @@ -97,9 +97,11 @@ "importlib_metadata;python_version<'3.8'", # to save datasets locally or on any filesystem # minimum 2021.05.0 to have the AbstractArchiveFileSystem - "fsspec>=2021.05.0", + "fsspec[http]>=2021.05.0", + # for data streaming via http + "aiohttp", # To get datasets from the Datasets Hub on huggingface.co - "huggingface_hub<0.1.0", + "huggingface_hub>=0.0.14,<0.1.0", # Utilities from PyPA to e.g., compare versions "packaging", ] @@ -117,7 +119,6 @@ "pytest", "pytest-xdist", # optional dependencies - "aiohttp", "apache-beam>=2.26.0", "elasticsearch", "aiobotocore==1.2.2", @@ -192,7 +193,7 @@ "botocore==1.19.52", "s3fs", ], - "streaming": ["aiohttp"], + "streaming": [], # for backward compatibility "dev": TESTS_REQUIRE + QUALITY_REQUIRE, "tests": TESTS_REQUIRE, "quality": QUALITY_REQUIRE, diff --git a/src/datasets/builder.py b/src/datasets/builder.py index 5854bdb10d8..4b9977f2655 100644 --- a/src/datasets/builder.py +++ b/src/datasets/builder.py @@ -26,10 +26,12 @@ import urllib from dataclasses import dataclass from functools import partial -from typing import Dict, Mapping, Optional, Sequence, Tuple, Union +from pathlib import PurePath +from typing import Dict, List, Mapping, Optional, Sequence, Tuple, Union from datasets.features import Features from datasets.utils.mock_download_manager import MockDownloadManager +from datasets.utils.py_utils import map_nested from . import config, utils from .arrow_dataset import Dataset @@ -43,9 +45,10 @@ from .splits import Split, SplitDict, SplitGenerator from .utils import logging from .utils.download_manager import DownloadManager, GenerateMode -from .utils.file_utils import DownloadConfig, is_remote_url, request_etags +from .utils.file_utils import DownloadConfig, is_relative_path, is_remote_url, request_etags, url_or_path_join from .utils.filelock import FileLock from .utils.info_utils import get_size_checksum_dict, verify_checksums, verify_splits +from .utils.streaming_download_manager import StreamingDownloadManager logger = logging.get_logger(__name__) @@ -109,6 +112,7 @@ def create_config_id( config_kwargs: dict, custom_features: Optional[Features] = None, use_auth_token: Optional[Union[bool, str]] = None, + base_path: Optional[Union[bool, str]] = None, ) -> str: """ The config id is used to build the cache directory. @@ -163,6 +167,12 @@ def create_config_id( } else: raise ValueError("Please provide a valid `data_files` in `DatasetBuilder`") + + def abspath(data_file) -> str: + data_file = data_file.as_posix() if isinstance(data_file, PurePath) else str(data_file) + return url_or_path_join(base_path, data_file) if is_relative_path(data_file) else data_file + + data_files: Dict[str, List[str]] = map_nested(abspath, data_files) remote_urls = [ data_file for key in data_files for data_file in data_files[key] if is_remote_url(data_file) ] @@ -392,7 +402,10 @@ def _create_builder_config(self, name=None, custom_features=None, **config_kwarg # compute the config id that is going to be used for caching config_id = builder_config.create_config_id( - config_kwargs, custom_features=custom_features, use_auth_token=self.use_auth_token + config_kwargs, + custom_features=custom_features, + use_auth_token=self.use_auth_token, + base_path=self.base_path if self.base_path is not None else "", ) is_custom = config_id not in self.builder_configs if is_custom: @@ -590,12 +603,17 @@ def incomplete_dir(dirname): # Print is intentional: we want this to always go to stdout so user has # information needed to cancel download/preparation if needed. # This comes right before the progress bar. - print( - f"Downloading and preparing dataset {self.info.builder_name}/{self.info.config_name} " - f"(download: {utils.size_str(self.info.download_size)}, generated: {utils.size_str(self.info.dataset_size)}, " - f"post-processed: {utils.size_str(self.info.post_processing_size)}, " - f"total: {utils.size_str(self.info.size_in_bytes)}) to {self._cache_dir}..." - ) + if self.info.size_in_bytes: + print( + f"Downloading and preparing dataset {self.info.builder_name}/{self.info.config_name} " + f"(download: {utils.size_str(self.info.download_size)}, generated: {utils.size_str(self.info.dataset_size)}, " + f"post-processed: {utils.size_str(self.info.post_processing_size)}, " + f"total: {utils.size_str(self.info.size_in_bytes)}) to {self._cache_dir}..." + ) + else: + print( + f"Downloading and preparing dataset {self.info.builder_name}/{self.info.config_name} to {self._cache_dir}..." + ) self._check_manual_download(dl_manager) @@ -917,13 +935,6 @@ def as_streaming_dataset( ) -> Union[Dict[str, IterableDataset], IterableDataset]: if not isinstance(self, (GeneratorBasedBuilder, ArrowBasedBuilder)): raise ValueError(f"Builder {self.name} is not streamable.") - if not config.AIOHTTP_AVAILABLE: - raise ImportError( - f"To be able to use dataset streaming, you need to install dependencies like aiohttp " - f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance' - ) - - from .utils.streaming_download_manager import StreamingDownloadManager dl_manager = StreamingDownloadManager( base_path=base_path or self.base_path, diff --git a/src/datasets/config.py b/src/datasets/config.py index 6b29c6ac8ef..ebfbda3cf73 100644 --- a/src/datasets/config.py +++ b/src/datasets/config.py @@ -21,7 +21,8 @@ REPO_METRICS_URL = "https://raw.githubusercontent.com/huggingface/datasets/{version}/metrics/{path}/{name}" # Hub -HUB_DATASETS_URL = "https://huggingface.co/datasets/{path}/resolve/{version}/{name}" +HF_ENDPOINT = os.environ.get("HF_ENDPOINT", "https://huggingface.co") +HUB_DATASETS_URL = HF_ENDPOINT + "/datasets/{path}/resolve/{version}/{name}" HUB_DEFAULT_VERSION = "main" PY_VERSION = version.parse(platform.python_version()) @@ -184,6 +185,5 @@ # Streaming -AIOHTTP_AVAILABLE = importlib.util.find_spec("aiohttp") is not None STREAMING_READ_MAX_RETRIES = 3 STREAMING_READ_RETRY_INTERVAL = 1 diff --git a/src/datasets/load.py b/src/datasets/load.py index 20005f06257..acca6d925b9 100644 --- a/src/datasets/load.py +++ b/src/datasets/load.py @@ -16,6 +16,7 @@ # Lint as: python3 """Access datasets.""" import filecmp +import glob import importlib import inspect import json @@ -23,11 +24,14 @@ import re import shutil import time -from pathlib import Path -from typing import List, Mapping, Optional, Sequence, Tuple, Type, Union +from collections import Counter +from pathlib import Path, PurePath +from typing import Dict, List, Mapping, Optional, Sequence, Tuple, Type, Union from urllib.parse import urlparse import fsspec +import huggingface_hub +from huggingface_hub import HfApi from . import config from .arrow_dataset import Dataset @@ -37,18 +41,21 @@ from .filesystems import extract_path_from_uri, is_remote_filesystem from .iterable_dataset import IterableDataset from .metric import Metric -from .packaged_modules import _PACKAGED_DATASETS_MODULES, hash_python_lines +from .naming import camelcase_to_snakecase +from .packaged_modules import _EXTENSION_TO_MODULE, _PACKAGED_DATASETS_MODULES, hash_python_lines from .splits import Split +from .streaming import extend_module_for_streaming from .tasks import TaskTemplate from .utils.download_manager import GenerateMode from .utils.file_utils import ( DownloadConfig, cached_path, head_hf_s3, - hf_bucket_url, hf_github_url, hf_hub_url, init_hf_modules, + is_relative_path, + is_remote_url, relative_to_absolute_path, url_or_path_join, url_or_path_parent, @@ -56,13 +63,10 @@ from .utils.filelock import FileLock from .utils.info_utils import is_small_dataset from .utils.logging import get_logger +from .utils.py_utils import NestedDataStructure from .utils.version import Version -if config.AIOHTTP_AVAILABLE: - from .streaming import extend_module_for_streaming - - logger = get_logger(__name__) @@ -221,6 +225,120 @@ def get_imports(file_path: str): return imports +def _resolve_data_files_locally_or_by_urls( + base_path: str, patterns: Union[str, List[str], Dict], allowed_extensions: Optional[list] = None +) -> Union[List[Path], Dict]: + """ + Return the absolute paths to all the files that match the given patterns. + It also supports absolute paths in patterns. + If an URL is passed, it is returned as is.""" + data_files_ignore = ["README.md", "config.json"] + if isinstance(patterns, str): + if is_remote_url(patterns): + return [patterns] + if is_relative_path(patterns): + glob_iter = list(Path(base_path).rglob(patterns)) + else: + glob_iter = [Path(filepath) for filepath in glob.glob(patterns)] + + matched_paths = [ + filepath.resolve() + for filepath in glob_iter + if filepath.name not in data_files_ignore and not filepath.name.startswith(".") and filepath.is_file() + ] + if allowed_extensions is not None: + out = [ + filepath + for filepath in matched_paths + if any(suffix[1:] in allowed_extensions for suffix in filepath.suffixes) + ] + if len(out) < len(matched_paths): + invalid_matched_files = list(set(matched_paths) - set(out)) + logger.info( + f"Some files matched the pattern '{patterns}' at {Path(base_path).resolve()} but don't have valid data file extensions: {invalid_matched_files}" + ) + else: + out = matched_paths + if not out: + error_msg = f"Unable to resolve any data file that matches '{patterns}' at {Path(base_path).resolve()}" + if allowed_extensions is not None: + error_msg += f" with any supported extension {list(allowed_extensions)}" + raise FileNotFoundError(error_msg) + return out + elif isinstance(patterns, dict): + return { + k: _resolve_data_files_locally_or_by_urls(base_path, v, allowed_extensions=allowed_extensions) + for k, v in patterns.items() + } + else: + return sum( + [ + _resolve_data_files_locally_or_by_urls(base_path, pattern, allowed_extensions=allowed_extensions) + for pattern in patterns + ], + [], + ) + + +def _resolve_data_files_in_dataset_repository( + dataset_info: huggingface_hub.hf_api.DatasetInfo, + patterns: Union[str, List[str], Dict], + allowed_extensions: Optional[list] = None, +) -> Union[List[PurePath], Dict]: + data_files_ignore = ["README.md", "config.json"] + if isinstance(patterns, str): + all_data_files = [ + PurePath("/" + dataset_file.rfilename) for dataset_file in dataset_info.siblings + ] # add a / at the beginning to make the pattern **/* match files at the root + matched_paths = [ + filepath.relative_to("/") + for filepath in all_data_files + if filepath.name not in data_files_ignore + and not filepath.name.startswith(".") + and filepath.match(patterns) + ] + if allowed_extensions is not None: + out = [ + filepath + for filepath in matched_paths + if any(suffix[1:] in allowed_extensions for suffix in filepath.suffixes) + ] + if len(out) < len(matched_paths): + invalid_matched_files = list(set(matched_paths) - set(out)) + logger.info( + f"Some files matched the pattern {patterns} in dataset repository {dataset_info.id} but don't have valid data file extensions: {invalid_matched_files}" + ) + else: + out = matched_paths + if not out: + error_msg = f"Unable to resolve data_file {patterns} in dataset repository {dataset_info.id}" + if allowed_extensions is not None: + error_msg += f" with any supported extension {list(allowed_extensions)}" + raise FileNotFoundError(error_msg) + return out + elif isinstance(patterns, dict): + return { + k: _resolve_data_files_in_dataset_repository(dataset_info, v, allowed_extensions=allowed_extensions) + for k, v in patterns.items() + } + else: + return sum( + [ + _resolve_data_files_in_dataset_repository(dataset_info, pattern, allowed_extensions=allowed_extensions) + for pattern in patterns + ], + [], + ) + + +def _infer_module_for_data_files(data_files: Union[PurePath, List[PurePath], Dict]) -> Optional[str]: + extensions_counter = Counter( + suffix[1:] for filepath in NestedDataStructure(data_files).flatten() for suffix in filepath.suffixes + ) + if extensions_counter: + return _EXTENSION_TO_MODULE[extensions_counter.most_common(1)[0][0]] + + def prepare_module( path: str, script_version: Optional[Union[str, Version]] = None, @@ -230,6 +348,8 @@ def prepare_module( force_local_path: Optional[str] = None, dynamic_modules_path: Optional[str] = None, return_resolved_file_path: bool = False, + return_associated_base_path: bool = False, + data_files: Optional[Union[Dict, List, str]] = None, **download_kwargs, ) -> Union[Tuple[str, str], Tuple[str, str, Optional[str]]]: r""" @@ -239,10 +359,31 @@ def prepare_module( and using cloudpickle (among other things). Args: - path (str): - path to the dataset or metric script, can be either: - - a path to a local directory containing the dataset processing python script - - an url to a github or S3 directory with a dataset processing python script + + path (str): Path or name of the dataset, or path to a metric script. + Depending on ``path``, the module that is returned id either generic moduler (csv, json, text etc.) or a module defined defined a dataset or metric script (a python file). + + For local datasets: + + - if ``path`` is a local directory (but doesn't contain a dataset script) + -> load a generic module (csv, json, text etc.) based on the content of the directory + e.g. ``'./path/to/directory/with/my/csv/data'``. + - if ``path`` is a local dataset or metric script or a directory containing a local dataset or metric script (if the script has the same name as the directory): + -> load the module from the dataset or metric script + e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``. + + For datasets on the Hugging Face Hub (list all available datasets and ids with ``datasets.list_datasets()``) + + - if ``path`` is a canonical dataset or metric on the HF Hub (ex: `glue`, `squad`) + -> load the module from the dataset or metric script in the github repository at huggingface/datasets + e.g. ``'squad'`` or ``'glue'`` or ``accuracy``. + - if ``path`` is a dataset repository on the HF hub (without a dataset script) + -> load a generic module (csv, text etc.) based on the content of the repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing your data files. + - if ``path`` is a dataset repository on the HF hub with a dataset script (if the script has the same name as the directory) + -> load the module from the dataset script in the dataset repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`. + script_version (Optional ``Union[str, datasets.Version]``): If specified, the module will be loaded from the datasets repository at this version. By default: @@ -259,6 +400,10 @@ def prepare_module( By default the datasets and metrics are stored inside the `datasets_modules` module. return_resolved_file_path (Optional bool, defaults to False): If True, the url or path to the resolved dataset or metric script is returned with the other ouputs + return_associated_base_path (Optional bool, defaults to False): + If True, the base path associated to the dataset is returned with the other ouputs. + It corresponds to the directory or base url where the dataset script/dataset repo is at. + data_files (:obj:`Union[Dict, List, str]`, optional): Defining the data_files of the dataset configuration. download_kwargs: optional attributes for DownloadConfig() which will override the attributes in download_config if supplied. Returns: @@ -282,15 +427,20 @@ def prepare_module( short_name = name[:-3] # first check if the module is packaged with the `datasets` package - if dataset and path in _PACKAGED_DATASETS_MODULES: + def prepare_packaged_module(name): try: - head_hf_s3(path, filename=name, dataset=dataset, max_retries=download_config.max_retries) + head_hf_s3(name, filename=name + ".py", dataset=dataset, max_retries=download_config.max_retries) except Exception: - logger.debug(f"Couldn't head HF s3 for packaged dataset module '{path}'. Running in offline mode.") - module_path, hash = _PACKAGED_DATASETS_MODULES[path] + logger.debug(f"Couldn't head HF s3 for packaged dataset module '{name}'. Running in offline mode.") + return _PACKAGED_DATASETS_MODULES[name] + + if dataset and path in _PACKAGED_DATASETS_MODULES: + output = prepare_packaged_module(path) if return_resolved_file_path: - return module_path, hash, None - return module_path, hash + output += (None,) + if return_associated_base_path: + output += (None,) + return output # otherwise the module is added to the dynamic modules dynamic_modules_path = dynamic_modules_path if dynamic_modules_path else init_dynamic_modules() @@ -305,22 +455,52 @@ def prepare_module( else: main_folder_path = force_local_path - # We have three ways to find the processing file: - # - if os.path.join(path, name) is a file or a remote url - # - if path is a file or a remote url - # - otherwise we assume path/name is a path to our S3 bucket - combined_path = path if path.endswith(name) else os.path.join(path, name) - - if os.path.isfile(combined_path): + # We have several ways to find the processing file: + # - if os.path.join(path, name) is a local python file + # -> use the module from the python file + # - if path is a local directory (but no python file) + # -> use a packaged module (csv, text etc.) based on content of the directory + # - if path has no "/" and is a module on github (in /datasets or in /metrics) + # -> use the module from the python file on github + # - if path has one "/" and is dataset repository on the HF hub with a python file + # -> the module from the python file in the dataset repository + # - if path has one "/" and is dataset repository on the HF hub without a python file + # -> use a packaged module (csv, text etc.) based on content of the repository + resource_type = "dataset" if dataset else "metric" + combined_path = os.path.join(path, name) + if path.endswith(name): + if os.path.isfile(path): + file_path = path + local_path = path + base_path = os.path.dirname(path) + else: + raise FileNotFoundError(f"Couldn't find a {resource_type} script at {relative_to_absolute_path(path)}") + elif os.path.isfile(combined_path): file_path = combined_path local_path = combined_path + base_path = path elif os.path.isfile(path): file_path = path local_path = path + base_path = os.path.dirname(path) + elif os.path.isdir(path): + resolved_data_files = _resolve_data_files_locally_or_by_urls( + path, data_files or "*", allowed_extensions=_EXTENSION_TO_MODULE.keys() + ) + infered_module_name = _infer_module_for_data_files(resolved_data_files) + if not infered_module_name: + raise FileNotFoundError(f"No data files or {resource_type} script found in local directory {path}") + output = prepare_packaged_module(infered_module_name) + if return_resolved_file_path: + output += (None,) + if return_associated_base_path: + output += (path,) + return output else: - # Try github (canonical datasets/metrics) and then S3 (users datasets/metrics) + # Try github (canonical datasets/metrics) and then HF Hub (community datasets) combined_path_abs = relative_to_absolute_path(combined_path) + expected_dir_for_combined_path_abs = os.path.dirname(combined_path_abs) try: head_hf_s3(path, filename=name, dataset=dataset, max_retries=download_config.max_retries) script_version = str(script_version) if script_version is not None else None @@ -331,9 +511,8 @@ def prepare_module( except FileNotFoundError: if script_version is not None: raise FileNotFoundError( - "Couldn't find remote file with version {} at {}. Please provide a valid version and a valid {} name.".format( - script_version, file_path, "dataset" if dataset else "metric" - ) + f"Couldn't find a directory or a {resource_type} named '{path}' using version {script_version}. " + f"It doesn't exist locally at {expected_dir_for_combined_path_abs} or remotely at {file_path}" ) else: github_file_path = file_path @@ -341,36 +520,55 @@ def prepare_module( try: local_path = cached_path(file_path, download_config=download_config) logger.warning( - "Couldn't find file locally at {}, or remotely at {}.\n" - "The file was picked from the master branch on github instead at {}.".format( - combined_path_abs, github_file_path, file_path - ) + f"Couldn't find a directory or a {resource_type} named '{path}'. " + f"It was picked from the master branch on github instead at {file_path}" ) except FileNotFoundError: raise FileNotFoundError( - "Couldn't find file locally at {}, or remotely at {}.\n" - "The file is also not present on the master branch on github.".format( - combined_path_abs, github_file_path - ) + f"Couldn't find a directory or a {resource_type} named '{path}'. " + f"It doesn't exist locally at {expected_dir_for_combined_path_abs} or remotely at {github_file_path}" ) elif path.count("/") == 1: # users datasets/metrics: s3 path (hub for datasets and s3 for metrics) - if dataset: - file_path = hf_hub_url(path=path, name=name, version=script_version) - else: - file_path = hf_bucket_url(path, filename=name, dataset=False) + file_path = hf_hub_url(path=path, name=name, version=script_version) + if not dataset: + # We don't have community metrics on the HF Hub + raise FileNotFoundError( + f"Couldn't find a {resource_type} in a directory at '{path}'. " + f"It doesn't exist locally at {combined_path_abs}" + ) try: local_path = cached_path(file_path, download_config=download_config) except FileNotFoundError: - raise FileNotFoundError( - "Couldn't find file locally at {}, or remotely at {}. Please provide a valid {} name.".format( - combined_path_abs, file_path, "dataset" if dataset else "metric" + hf_api = HfApi(config.HF_ENDPOINT) + try: + dataset_info = hf_api.dataset_info( + repo_id=path, revision=script_version, token=download_config.use_auth_token + ) + except Exception: + raise FileNotFoundError( + f"Couldn't find a directory or a {resource_type} named '{path}'. " + f"It doesn't exist locally at {expected_dir_for_combined_path_abs} or remotely on {hf_api.endpoint}/datasets" ) + resolved_data_files = _resolve_data_files_in_dataset_repository( + dataset_info, + data_files if data_files is not None else "*", + allowed_extensions=_EXTENSION_TO_MODULE.keys(), ) + infered_module_name = _infer_module_for_data_files(resolved_data_files) + if not infered_module_name: + raise FileNotFoundError( + f"No data files found in dataset repository '{path}'. Local directory at {expected_dir_for_combined_path_abs} doesn't exist either." + ) + output = prepare_packaged_module(infered_module_name) + if return_resolved_file_path: + output += (None,) + if return_associated_base_path: + output += (url_or_path_parent(file_path),) + return output else: raise FileNotFoundError( - "Couldn't find file locally at {}. Please provide a valid {} name.".format( - combined_path_abs, "dataset" if dataset else "metric" - ) + f"Couldn't find a {resource_type} directory at '{path}'. " + f"It doesn't exist locally at {expected_dir_for_combined_path_abs}" ) except Exception as e: # noqa: all the attempts failed, before raising the error we should check if the module already exists. if os.path.isdir(main_folder_path): @@ -389,11 +587,14 @@ def _get_modification_time(module_hash): f"(last modified on {time.ctime(_get_modification_time(hash))}) since it " f"couldn't be found locally at {combined_path_abs}, or remotely ({type(e).__name__})." ) + output = (module_path, hash) if return_resolved_file_path: with open(os.path.join(main_folder_path, hash, short_name + ".json")) as cache_metadata: file_path = json.load(cache_metadata)["original file path"] - return module_path, hash, file_path - return module_path, hash + output += (file_path,) + if return_associated_base_path: + output += (url_or_path_parent(file_path),) + return output raise # Load the module in two steps: @@ -563,9 +764,12 @@ def _get_modification_time(module_hash): # make the new module to be noticed by the import system importlib.invalidate_caches() + output = (module_path, hash) if return_resolved_file_path: - return module_path, hash, file_path - return module_path, hash + output += (file_path,) + if return_associated_base_path: + output += (base_path,) + return output def load_metric( @@ -651,12 +855,31 @@ def load_dataset_builder( Args: - path (:obj:`str`): Path to the dataset processing script with the dataset builder. Can be either: + path (:obj:`str`): Path or name of the dataset. + Depending on ``path``, the dataset builder that is returned id either generic dataset builder (csv, json, text etc.) or a dataset builder defined defined a dataset script (a python file). - - a local path to processing script or the directory containing the script (if the script has the same name as the directory), + For local datasets: + + - if ``path`` is a local directory (but doesn't contain a dataset script) + -> load a generic dataset builder (csv, json, text etc.) based on the content of the directory + e.g. ``'./path/to/directory/with/my/csv/data'``. + - if ``path`` is a local dataset script or a directory containing a local dataset script (if the script has the same name as the directory): + -> load the dataset builder from the dataset script e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``. - - a dataset identifier in the HuggingFace Datasets Hub (list all available datasets and ids with ``datasets.list_datasets()``) - e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``. + + For datasets on the Hugging Face Hub (list all available datasets and ids with ``datasets.list_datasets()``) + + - if ``path`` is a canonical dataset on the HF Hub (ex: `glue`, `squad`) + -> load the dataset builder from the dataset script in the github repository at huggingface/datasets + e.g. ``'squad'`` or ``'glue'``. + - if ``path`` is a dataset repository on the HF hub (without a dataset script) + -> load a generic dataset builder (csv, text etc.) based on the content of the repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing your data files. + - if ``path`` is a dataset repository on the HF hub with a dataset script (if the script has the same name as the directory) + -> load the dataset builder from the dataset script in the dataset repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`. + + name (:obj:`str`, optional): Defining the name of the dataset configuration. data_dir (:obj:`str`, optional): Defining the data_dir of the dataset configuration. data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s). @@ -678,24 +901,51 @@ def load_dataset_builder( """ # Download/copy dataset processing script - module_path, hash, resolved_file_path = prepare_module( + module_path, hash, base_path = prepare_module( path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True, - return_resolved_file_path=True, + return_associated_base_path=True, use_auth_token=use_auth_token, + data_files=data_files, ) # Get dataset builder class from the processing script builder_cls = import_main_class(module_path, dataset=True) - # Set the base path for downloads as the parent of the script location - if resolved_file_path is not None: - base_path = url_or_path_parent(resolved_file_path) - else: - base_path = None + # For packaged builder used to load data from a dataset repository or dataset directory (no dataset script) + if module_path.startswith("datasets.") and path not in _PACKAGED_DATASETS_MODULES: + # Add a nice name to the configuratiom + if name is None: + name = path.split("/")[-1].split(os.sep)[-1] + # Resolve the data files + allowed_extensions = [ + extension + for extension in _EXTENSION_TO_MODULE + if _EXTENSION_TO_MODULE[extension] == camelcase_to_snakecase(builder_cls.__name__) + ] + data_files = data_files if data_files is not None else "*" + if base_path.startswith(config.HF_ENDPOINT): + dataset_info = HfApi(config.HF_ENDPOINT).dataset_info(path, revision=script_version, token=use_auth_token) + data_files = _resolve_data_files_in_dataset_repository( + dataset_info, data_files, allowed_extensions=allowed_extensions + ) + else: # local dir + data_files = _resolve_data_files_locally_or_by_urls( + path, data_files, allowed_extensions=allowed_extensions + ) + elif path in _PACKAGED_DATASETS_MODULES: + if data_files is None: + error_msg = f"Please specify the data files to load for the {path} dataset builder." + example_extensions = [ + extension for extension in _EXTENSION_TO_MODULE if _EXTENSION_TO_MODULE[extension] == path + ] + if example_extensions: + error_msg += f'\nFor example `data_files={{"train": "path/to/data/train/*.{example_extensions[0]}"}}`' + raise ValueError(error_msg) + data_files = _resolve_data_files_locally_or_by_urls(".", data_files) # Instantiate the dataset builder builder_instance: DatasetBuilder = builder_cls( @@ -755,14 +1005,35 @@ def load_dataset( 3. Return a dataset built from the requested splits in ``split`` (default: all). + It also allows to load a dataset from a local directory or a dataset repository on the Hugging Face Hub without dataset script. + In this case, it automatically loads all the data files from the directory or the dataset repository. + Args: - path (:obj:`str`): Path to the dataset processing script with the dataset builder. Can be either: + path (:obj:`str`): Path or name of the dataset. + Depending on ``path``, the dataset builder that is returned id either generic dataset builder (csv, json, text etc.) or a dataset builder defined defined a dataset script (a python file). + + For local datasets: - - a local path to processing script or the directory containing the script (if the script has the same name as the directory), + - if ``path`` is a local directory (but doesn't contain a dataset script) + -> load a generic dataset builder (csv, json, text etc.) based on the content of the directory + e.g. ``'./path/to/directory/with/my/csv/data'``. + - if ``path`` is a local dataset script or a directory containing a local dataset script (if the script has the same name as the directory): + -> load the dataset builder from the dataset script e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'``. - - a dataset identifier in the HuggingFace Datasets Hub (list all available datasets and ids with ``datasets.list_datasets()``) - e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'``. + + For datasets on the Hugging Face Hub (list all available datasets and ids with ``datasets.list_datasets()``) + + - if ``path`` is a canonical dataset on the HF Hub (ex: `glue`, `squad`) + -> load the dataset builder from the dataset script in the github repository at huggingface/datasets + e.g. ``'squad'`` or ``'glue'``. + - if ``path`` is a dataset repository on the HF hub (without a dataset script) + -> load a generic dataset builder (csv, text etc.) based on the content of the repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing your data files. + - if ``path`` is a dataset repository on the HF hub with a dataset script (if the script has the same name as the directory) + -> load the dataset builder from the dataset script in the dataset repository + e.g. ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`. + name (:obj:`str`, optional): Defining the name of the dataset configuration. data_dir (:obj:`str`, optional): Defining the data_dir of the dataset configuration. data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s). @@ -808,27 +1079,19 @@ def load_dataset( """ ignore_verifications = ignore_verifications or save_infos - # Check streaming - if streaming: - if not config.AIOHTTP_AVAILABLE: - raise ImportError( - f"To be able to use dataset streaming, you need to install dependencies like aiohttp " - f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance' - ) - # Download/copy dataset processing script # Create a dataset builder builder_instance = load_dataset_builder( - path, - name, - data_dir, - data_files, - cache_dir, - features, - download_config, - download_mode, - script_version, - use_auth_token, + path=path, + name=name, + data_dir=data_dir, + data_files=data_files, + cache_dir=cache_dir, + features=features, + download_config=download_config, + download_mode=download_mode, + script_version=script_version, + use_auth_token=use_auth_token, **config_kwargs, ) diff --git a/src/datasets/packaged_modules/__init__.py b/src/datasets/packaged_modules/__init__.py index 097aa5d3852..91b115cd49b 100644 --- a/src/datasets/packaged_modules/__init__.py +++ b/src/datasets/packaged_modules/__init__.py @@ -31,3 +31,12 @@ def hash_python_lines(lines: List[str]) -> str: "parquet": (parquet.__name__, hash_python_lines(inspect.getsource(parquet).splitlines())), "text": (text.__name__, hash_python_lines(inspect.getsource(text).splitlines())), } + +_EXTENSION_TO_MODULE = { + "csv": "csv", + "tsv": "csv", + "json": "json", + "jsonl": "json", + "parquet": "parquet", + "txt": "text", +} diff --git a/src/datasets/utils/download_manager.py b/src/datasets/utils/download_manager.py index 33b40ebe995..a1f4c0110d7 100644 --- a/src/datasets/utils/download_manager.py +++ b/src/datasets/utils/download_manager.py @@ -214,6 +214,7 @@ def download(self, url_or_urls): return downloaded_path_or_paths.data def _download(self, url_or_filename: str, download_config: DownloadConfig) -> str: + url_or_filename = str(url_or_filename) if is_relative_path(url_or_filename): # append the relative path to the base_path url_or_filename = url_or_path_join(self._base_path, url_or_filename) diff --git a/src/datasets/utils/file_utils.py b/src/datasets/utils/file_utils.py index 0adfed535fa..38a2c5692f3 100644 --- a/src/datasets/utils/file_utils.py +++ b/src/datasets/utils/file_utils.py @@ -173,7 +173,7 @@ def hf_hub_url(path: str, name: str, version: Optional[str] = None) -> str: def url_or_path_join(base_name: str, *pathnames: str) -> str: if is_remote_url(base_name): - return posixpath.join(base_name, *pathnames) + return posixpath.join(base_name, *(str(pathname).lstrip("/") for pathname in pathnames)) else: return Path(base_name, *pathnames).as_posix() diff --git a/src/datasets/utils/streaming_download_manager.py b/src/datasets/utils/streaming_download_manager.py index 1a684a647fc..14c5d5a0034 100644 --- a/src/datasets/utils/streaming_download_manager.py +++ b/src/datasets/utils/streaming_download_manager.py @@ -120,6 +120,7 @@ def download(self, url_or_urls): return url_or_urls def _download(self, urlpath: str) -> str: + urlpath = str(urlpath) if is_relative_path(urlpath): # append the relative path to the base_path urlpath = url_or_path_join(self._base_path, urlpath) @@ -130,6 +131,7 @@ def extract(self, path_or_paths): return urlpaths def _extract(self, urlpath: str) -> str: + urlpath = str(urlpath) protocol = self._get_extraction_protocol(urlpath) if protocol is None: # no extraction @@ -151,7 +153,7 @@ def _get_extraction_protocol(self, urlpath: str) -> Optional[str]: extension = path.split(".")[-1] if extension in BASE_KNOWN_EXTENSIONS: return None - elif path.endswith(".tar.gz"): + elif path.endswith(".tar.gz") or path.endswith(".tgz"): pass elif extension in COMPRESSION_EXTENSION_TO_PROTOCOL: return COMPRESSION_EXTENSION_TO_PROTOCOL[extension] diff --git a/tests/test_features.py b/tests/test_features.py index d5f803beb39..9f8021bdd8b 100644 --- a/tests/test_features.py +++ b/tests/test_features.py @@ -335,9 +335,10 @@ def test_cast_to_python_objects_jax(self): "col_1": [{"vec": jnp.array(np.arange(1, 4)), "txt": "foo"}] * 3, "col_2": jnp.array(np.arange(1, 7).reshape(3, 2)), } + assert obj["col_2"].dtype == jnp.int32 expected_obj = { - "col_1": [{"vec": np.array([1, 2, 3]), "txt": "foo"}] * 3, - "col_2": np.array([[1, 2], [3, 4], [5, 6]]), + "col_1": [{"vec": np.array([1, 2, 3], dtype=np.int32), "txt": "foo"}] * 3, + "col_2": np.array([[1, 2], [3, 4], [5, 6]], dtype=np.int32), } casted_obj = cast_to_python_objects(obj) dict_diff(casted_obj, expected_obj) diff --git a/tests/test_load.py b/tests/test_load.py index c61dd494cb2..633fde20580 100644 --- a/tests/test_load.py +++ b/tests/test_load.py @@ -6,11 +6,13 @@ import time from functools import partial from hashlib import sha256 +from pathlib import Path, PurePath from unittest import TestCase from unittest.mock import patch import pytest import requests +from huggingface_hub.hf_api import DatasetInfo import datasets from datasets import SCRIPTS_VERSION, load_dataset, load_from_disk @@ -19,15 +21,19 @@ from datasets.dataset_dict import DatasetDict, IterableDatasetDict from datasets.features import Features, Value from datasets.iterable_dataset import IterableDataset -from datasets.load import prepare_module -from datasets.utils.file_utils import DownloadConfig +from datasets.load import ( + _resolve_data_files_in_dataset_repository, + _resolve_data_files_locally_or_by_urls, + prepare_module, +) +from datasets.utils.file_utils import DownloadConfig, is_remote_url from .utils import ( OfflineSimulationMode, assert_arrow_memory_doesnt_increase, assert_arrow_memory_increases, offline, - require_streaming, + set_current_working_directory_to_temp_dir, ) @@ -58,12 +64,14 @@ def _generate_examples(self, filepath, **kwargs): """ SAMPLE_DATASET_IDENTIFIER = "lhoestq/test" +SAMPLE_DATASET_IDENTIFIER2 = "lhoestq/test2" SAMPLE_NOT_EXISTING_DATASET_IDENTIFIER = "lhoestq/_dummy" +SAMPLE_DATASET_NAME_THAT_DOESNT_EXIST = "_dummy" @pytest.fixture def data_dir(tmp_path): - data_dir = tmp_path / "data" + data_dir = tmp_path / "data_dir" data_dir.mkdir() with open(data_dir / "train.txt", "w") as f: f.write("foo\n" * 10) @@ -72,6 +80,22 @@ def data_dir(tmp_path): return str(data_dir) +@pytest.fixture +def complex_data_dir(tmp_path): + data_dir = tmp_path / "complex_data_dir" + data_dir.mkdir() + (data_dir / "data").mkdir() + with open(data_dir / "data" / "train.txt", "w") as f: + f.write("foo\n" * 10) + with open(data_dir / "data" / "test.txt", "w") as f: + f.write("bar\n" * 10) + with open(data_dir / "README.md", "w") as f: + f.write("This is a readme") + with open(data_dir / ".dummy", "w") as f: + f.write("this is a dummy file that is not a data file") + return str(data_dir) + + @pytest.fixture def dataset_loading_script_dir(tmp_path): script_name = DATASET_LOADING_SCRIPT_NAME @@ -188,7 +212,7 @@ def test_load_dataset_users(self): with self.assertRaises(FileNotFoundError) as context: datasets.load_dataset("lhoestq/_dummy") self.assertIn( - "https://huggingface.co/datasets/lhoestq/_dummy/resolve/main/_dummy.py", + "lhoestq/_dummy", str(context.exception), ) for offline_simulation_mode in list(OfflineSimulationMode): @@ -196,18 +220,79 @@ def test_load_dataset_users(self): with self.assertRaises(ConnectionError) as context: datasets.load_dataset("lhoestq/_dummy") self.assertIn( - "https://huggingface.co/datasets/lhoestq/_dummy/resolve/main/_dummy.py", + "lhoestq/_dummy", str(context.exception), ) -def test_load_dataset_builder(dataset_loading_script_dir, data_dir): +def test_load_dataset_builder_for_absolute_script_dir(dataset_loading_script_dir, data_dir): builder = datasets.load_dataset_builder(dataset_loading_script_dir, data_dir=data_dir) assert isinstance(builder, DatasetBuilder) assert builder.name == DATASET_LOADING_SCRIPT_NAME assert builder.info.features == Features({"text": Value("string")}) +def test_load_dataset_builder_for_relative_script_dir(dataset_loading_script_dir, data_dir): + with set_current_working_directory_to_temp_dir(): + relative_script_dir = DATASET_LOADING_SCRIPT_NAME + shutil.copytree(dataset_loading_script_dir, relative_script_dir) + builder = datasets.load_dataset_builder(relative_script_dir, data_dir=data_dir) + assert isinstance(builder, DatasetBuilder) + assert builder.name == DATASET_LOADING_SCRIPT_NAME + assert builder.info.features == Features({"text": Value("string")}) + + +def test_load_dataset_builder_for_script_path(dataset_loading_script_dir, data_dir): + builder = datasets.load_dataset_builder( + os.path.join(dataset_loading_script_dir, DATASET_LOADING_SCRIPT_NAME + ".py"), data_dir=data_dir + ) + assert isinstance(builder, DatasetBuilder) + assert builder.name == DATASET_LOADING_SCRIPT_NAME + assert builder.info.features == Features({"text": Value("string")}) + + +def test_load_dataset_builder_for_absolute_data_dir(complex_data_dir): + builder = datasets.load_dataset_builder(complex_data_dir) + assert isinstance(builder, DatasetBuilder) + assert builder.name == "text" + assert builder.config.name == Path(complex_data_dir).name + assert isinstance(builder.config.data_files, list) + assert len(builder.config.data_files) > 0 + + +def test_load_dataset_builder_for_relative_data_dir(complex_data_dir): + with set_current_working_directory_to_temp_dir(): + relative_data_dir = "relative_data_dir" + shutil.copytree(complex_data_dir, relative_data_dir) + builder = datasets.load_dataset_builder(relative_data_dir) + assert isinstance(builder, DatasetBuilder) + assert builder.name == "text" + assert builder.config.name == relative_data_dir + assert isinstance(builder.config.data_files, list) + assert len(builder.config.data_files) > 0 + + +def test_load_dataset_builder_for_community_dataset_with_script(): + builder = datasets.load_dataset_builder(SAMPLE_DATASET_IDENTIFIER) + assert isinstance(builder, DatasetBuilder) + assert builder.name == SAMPLE_DATASET_IDENTIFIER.split("/")[-1] + assert builder.info.features == Features({"text": Value("string")}) + + +def test_load_dataset_builder_for_community_dataset_without_script(): + builder = datasets.load_dataset_builder(SAMPLE_DATASET_IDENTIFIER2) + assert isinstance(builder, DatasetBuilder) + assert builder.name == "text" + assert builder.config.name == SAMPLE_DATASET_IDENTIFIER2.split("/")[-1] + assert isinstance(builder.config.data_files, list) + assert len(builder.config.data_files) > 0 + + +def test_load_dataset_builder_fail(): + with pytest.raises(FileNotFoundError): + datasets.load_dataset_builder("blabla") + + @pytest.mark.parametrize("keep_in_memory", [False, True]) def test_load_dataset_local(dataset_loading_script_dir, data_dir, keep_in_memory, caplog): with assert_arrow_memory_increases() if keep_in_memory else assert_arrow_memory_doesnt_increase(): @@ -224,12 +309,15 @@ def test_load_dataset_local(dataset_loading_script_dir, data_dir, keep_in_memory assert len(dataset) == 2 assert "Using the latest cached version of the module" in caplog.text with pytest.raises(FileNotFoundError) as exc_info: - datasets.load_dataset("_dummy") - m_combined_path = re.search(fr"\S*{re.escape(os.path.join('_dummy', '_dummy.py'))}\b", str(exc_info.value)) - assert m_combined_path is not None and os.path.isabs(m_combined_path.group()) + datasets.load_dataset(SAMPLE_DATASET_NAME_THAT_DOESNT_EXIST) + m_combined_path = re.search( + fr"http\S*{re.escape(SAMPLE_DATASET_NAME_THAT_DOESNT_EXIST + '/' + SAMPLE_DATASET_NAME_THAT_DOESNT_EXIST + '.py')}\b", + str(exc_info.value), + ) + assert m_combined_path is not None and is_remote_url(m_combined_path.group()) + assert os.path.abspath(SAMPLE_DATASET_NAME_THAT_DOESNT_EXIST) in str(exc_info.value) -@require_streaming def test_load_dataset_streaming(dataset_loading_script_dir, data_dir): dataset = load_dataset(dataset_loading_script_dir, streaming=True, data_dir=data_dir) assert isinstance(dataset, IterableDatasetDict) @@ -238,7 +326,6 @@ def test_load_dataset_streaming(dataset_loading_script_dir, data_dir): assert isinstance(next(iter(dataset["train"])), dict) -@require_streaming def test_load_dataset_streaming_gz_json(jsonl_gz_path): data_files = jsonl_gz_path ds = load_dataset("json", split="train", data_files=data_files, streaming=True) @@ -247,7 +334,6 @@ def test_load_dataset_streaming_gz_json(jsonl_gz_path): assert ds_item == {"col_1": "0", "col_2": 0, "col_3": 0.0} -@require_streaming @pytest.mark.parametrize( "path", ["sample.jsonl", "sample.jsonl.gz", "sample.tar", "sample.jsonl.xz", "sample.zip", "sample.jsonl.zst"] ) @@ -292,7 +378,6 @@ def assert_auth(url, *args, headers, **kwargs): mock_head.assert_called() -@require_streaming def test_loaded_streaming_dataset_has_use_auth_token(dataset_loading_script_dir, data_dir): from datasets.utils.streaming_download_manager import xopen @@ -384,3 +469,98 @@ def test_load_dataset_deletes_extracted_files(deleted, jsonl_gz_path, tmp_path): ds = load_dataset("json", split="train", data_files=data_files, cache_dir=cache_dir) assert ds[0] == {"col_1": "0", "col_2": 0, "col_3": 0.0} assert (sorted((cache_dir / "downloads" / "extracted").iterdir()) == []) is deleted + + +@pytest.mark.parametrize( + "pattern,size", [("*", 2), ("**/*", 2), ("*.txt", 2), ("data/*", 2), ("**/*.txt", 2), ("**/train.txt", 1)] +) +def test_resolve_data_files_locally_or_by_urls(complex_data_dir, pattern, size): + resolved_data_files = _resolve_data_files_locally_or_by_urls(complex_data_dir, pattern) + files_to_ignore = {".dummy", "README.md"} + expected_resolved_data_files = [ + path for path in Path(complex_data_dir).rglob(pattern) if path.name not in files_to_ignore and path.is_file() + ] + assert len(resolved_data_files) == size + assert sorted(resolved_data_files) == sorted(expected_resolved_data_files) + assert all(isinstance(path, Path) for path in resolved_data_files) + assert all(path.is_file() for path in resolved_data_files) + + +def test_resolve_data_files_locally_or_by_urls_with_absolute_path(tmp_path, complex_data_dir): + abs_path = os.path.join(complex_data_dir, "data", "train.txt") + resolved_data_files = _resolve_data_files_locally_or_by_urls(str(tmp_path / "blabla"), abs_path) + assert len(resolved_data_files) == 1 + + +@pytest.mark.parametrize("pattern,size,extensions", [("*", 2, ["txt"]), ("*", 2, None), ("*", 0, ["blablabla"])]) +def test_resolve_data_files_locally_or_by_urls_with_extensions(complex_data_dir, pattern, size, extensions): + if size > 0: + resolved_data_files = _resolve_data_files_locally_or_by_urls( + complex_data_dir, pattern, allowed_extensions=extensions + ) + assert len(resolved_data_files) == size + else: + with pytest.raises(FileNotFoundError): + _resolve_data_files_locally_or_by_urls(complex_data_dir, pattern, allowed_extensions=extensions) + + +def test_fail_resolve_data_files_locally_or_by_urls(complex_data_dir): + with pytest.raises(FileNotFoundError): + _resolve_data_files_locally_or_by_urls(complex_data_dir, "blablabla") + + +@pytest.mark.parametrize( + "pattern,size", [("*", 2), ("**/*", 2), ("*.txt", 2), ("data/*", 2), ("**/*.txt", 2), ("**/train.txt", 1)] +) +def test_resolve_data_files_in_dataset_repository(complex_data_dir, pattern, size): + dataset_info = DatasetInfo( + siblings=[ + {"rfilename": path.relative_to(complex_data_dir).as_posix()} + for path in Path(complex_data_dir).rglob("*") + if path.is_file() + ] + ) + resolved_data_files = _resolve_data_files_in_dataset_repository(dataset_info, pattern) + files_to_ignore = {".dummy", "README.md"} + expected_resolved_data_files = [ + path.relative_to(complex_data_dir) + for path in Path(complex_data_dir).rglob(pattern) + if path.name not in files_to_ignore and path.is_file() + ] + assert len(resolved_data_files) == size + assert sorted(resolved_data_files) == sorted(expected_resolved_data_files) + assert all(isinstance(path, PurePath) for path in resolved_data_files) + assert all((Path(complex_data_dir) / path).is_file() for path in resolved_data_files) + + +@pytest.mark.parametrize("pattern,size,extensions", [("*", 2, ["txt"]), ("*", 2, None), ("*", 0, ["blablabla"])]) +def test_resolve_data_files_in_dataset_repository_with_extensions(complex_data_dir, pattern, size, extensions): + dataset_info = DatasetInfo( + siblings=[ + {"rfilename": path.relative_to(complex_data_dir).as_posix()} + for path in Path(complex_data_dir).rglob("*") + if path.is_file() + ] + ) + if size > 0: + resolved_data_files = _resolve_data_files_in_dataset_repository( + dataset_info, pattern, allowed_extensions=extensions + ) + assert len(resolved_data_files) == size + else: + with pytest.raises(FileNotFoundError): + resolved_data_files = _resolve_data_files_in_dataset_repository( + dataset_info, pattern, allowed_extensions=extensions + ) + + +def test_fail_resolve_data_files_in_dataset_repository(complex_data_dir): + dataset_info = DatasetInfo( + siblings=[ + {"rfilename": path.relative_to(complex_data_dir).as_posix()} + for path in Path(complex_data_dir).rglob("*") + if path.is_file() + ] + ) + with pytest.raises(FileNotFoundError): + _resolve_data_files_in_dataset_repository(dataset_info, "blablabla") diff --git a/tests/test_streaming_download_manager.py b/tests/test_streaming_download_manager.py index 12d7e667850..9ab795cac50 100644 --- a/tests/test_streaming_download_manager.py +++ b/tests/test_streaming_download_manager.py @@ -3,15 +3,15 @@ import pytest from datasets.filesystems import COMPRESSION_FILESYSTEMS +from datasets.utils.streaming_download_manager import xopen -from .utils import require_lz4, require_streaming, require_zstandard +from .utils import require_lz4, require_zstandard TEST_URL = "https://huggingface.co/datasets/lhoestq/test/raw/main/some_text.txt" TEST_URL_CONTENT = "foo\nbar\nfoobar" -@require_streaming @pytest.mark.parametrize( "input_path, paths_to_join, expected_path", [ @@ -35,15 +35,12 @@ def test_xjoin(input_path, paths_to_join, expected_path): assert output_path == expected_path -@require_streaming def test_xopen_local(text_path): - from datasets.utils.streaming_download_manager import xopen with xopen(text_path, encoding="utf-8") as f, open(text_path, encoding="utf-8") as expected_file: assert list(f) == list(expected_file) -@require_streaming def test_xopen_remote(): from datasets.utils.streaming_download_manager import xopen @@ -51,7 +48,6 @@ def test_xopen_remote(): assert list(f) == TEST_URL_CONTENT.splitlines(keepends=True) -@require_streaming @pytest.mark.parametrize("urlpath", [r"C:\\foo\bar.txt", "/foo/bar.txt", "https://f.oo/bar.txt"]) def test_streaming_dl_manager_download_dummy_path(urlpath): from datasets.utils.streaming_download_manager import StreamingDownloadManager @@ -60,7 +56,6 @@ def test_streaming_dl_manager_download_dummy_path(urlpath): assert dl_manager.download(urlpath) == urlpath -@require_streaming def test_streaming_dl_manager_download(text_path): from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen @@ -71,7 +66,6 @@ def test_streaming_dl_manager_download(text_path): assert f.read() == expected_file.read() -@require_streaming @pytest.mark.parametrize("urlpath", [r"C:\\foo\bar.txt", "/foo/bar.txt", "https://f.oo/bar.txt"]) def test_streaming_dl_manager_download_and_extract_no_extraction(urlpath): from datasets.utils.streaming_download_manager import StreamingDownloadManager @@ -80,7 +74,6 @@ def test_streaming_dl_manager_download_and_extract_no_extraction(urlpath): assert dl_manager.download_and_extract(urlpath) == urlpath -@require_streaming def test_streaming_dl_manager_extract(text_gz_path, text_path): from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen @@ -93,7 +86,6 @@ def test_streaming_dl_manager_extract(text_gz_path, text_path): assert f.read() == expected_file.read() -@require_streaming def test_streaming_dl_manager_download_and_extract_with_extraction(text_gz_path, text_path): from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen @@ -106,7 +98,6 @@ def test_streaming_dl_manager_download_and_extract_with_extraction(text_gz_path, assert f.read() == expected_file.read() -@require_streaming @pytest.mark.parametrize( "input_path, filename, expected_path", [("https://domain.org/archive.zip", "filename.jsonl", "zip://filename.jsonl::https://domain.org/archive.zip")], @@ -120,7 +111,6 @@ def test_streaming_dl_manager_download_and_extract_with_join(input_path, filenam assert output_path == expected_path -@require_streaming @require_zstandard @require_lz4 @pytest.mark.parametrize("compression_fs_class", COMPRESSION_FILESYSTEMS) diff --git a/tests/utils.py b/tests/utils.py index ee26201e986..fe3701cd03b 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -192,21 +192,6 @@ def require_s3(test_case): return test_case -def require_streaming(test_case): - """ - Decorator marking a test that requires aiohttp. - - These tests are skipped when aiohttp isn't installed. - - """ - try: - import aiohttp # noqa F401 - except ImportError: - return unittest.skip("test requires aiohttp")(test_case) - else: - return test_case - - def slow(test_case): """ Decorator marking a test as slow.