Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 16, 2021

Load the data from any Dataset repository on the Hub

This PR adds support for loading datasets from any dataset repository on the hub, without requiring any dataset script.

As a user it's now possible to create a repo and upload some csv/json/text/parquet files, and then be able to load the data in one line. Here is an example with the allenai/c4 repository that contains a lot of compressed json lines files:

from datasets import load_dataset

data_files = {"train": "en/c4-train.*.json.gz"}
c4 = load_dataset("allenai/c4", data_files=data_files, split="train", streaming=True)

print(c4.n_shards)
# 1024
print(next(iter(c4)))
# {'text': 'Beginners BBQ Class Takin...'}

By default it loads all the files, but as shown in the example you can choose the ones you want with unix style patterns.

Of course it's still possible to use dataset scripts since they offer the most flexibility.

Implementation details

It uses huggingface_hub to list the files in a dataset repository.

If you provide a path to a local directory instead of a repository name, it works the same way but it uses glob.

Depending on the data files available, or passed in the data_files parameter, one of the available builders will be used among the csv, json, text and parquet builders.

Because of this, it's not possible to load both csv and json files at once. In this case you have to load them separately and then concatenate the two datasets for example.

TODO

  • tests
  • docs
  • when huggingface_hub gets a new release, update the CI and the setup.py

Close #2629

@lhoestq lhoestq marked this pull request as ready for review July 28, 2021 18:22
@lhoestq
Copy link
Member Author

lhoestq commented Jul 28, 2021

This is ready for review now :)

I would love to have some feedback on the changes in load.py @albertvillanova. There are many changes so if you have questions let me know, especially on the resolve_data_files functions and on the changes in prepare_module.

And @thomwolf if you want to take a look at the documentation, feel free to share your suggestions :)

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this features looks really nice - thank you for improving our quality of life 🥳 !

i left a few nits (feel free to ignore them) and a question about whether we should show you can download private files

- from the `HuggingFace Hub <https://huggingface.co/datasets>`__,
- from local files, e.g. CSV/JSON/text/pandas files, or
- from the `Hugging Face Hub <https://huggingface.co/datasets>`__,
- from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, what is a "pandas file"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see the answer is below: it's a pickled dataframe :)

>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
>>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],
'test': 'my_test_file.csv'})
>>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice to see an explicit example with the expected url!

.. code-block::
>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure where we should mention this, but showing that you can download from private repos with use_auth_token=True would be useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and will unlock a lot of easy workflows!

f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
"To be able to use dataset streaming, you need to install dependencies like aiohttp "
'using \'datasets[streaming]\'" or "pip install aiohttp" for instance'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move aiohttp to required dep since we are now pushing for the streaming feature a lot

# - if path has one "/" and is dataset repository on the HF hub with a python file
# -> the module from the python file in the dataset repository
# - if path has one "/" and is dataset repository on the HF hub without a python file
# -> use a packaged module (csv, text etc.) based on content of the repository
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be in the docstring or doc?

Copy link
Member Author

@lhoestq lhoestq Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but the formulation is a bit different to be clearer for users.
Here the formulation if made for developers


logger = get_logger(__name__)
BASE_KNOWN_EXTENSIONS = ["txt", "csv", "json", "jsonl", "tsv", "conll"]
UNSUPPORTED_ARCHIVE_EXTENSIONS_FOR_STREAMING = ["tar", "xz", "rar", "zst"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also we should have the list of supported (and unsupported) extension/file-format clearly stated in the doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised you do a black-list instead of a white-list here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because most scripts call download_and_extract even for files that don't need extraction. We shouldn't raise an error for an uncompressed file. If we had a white list and you passed un uncompressed file, it wouldn't find any supported compression extension in the filename, so it would fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future we'll fallback on reading the header of the file to know if it's a compressed file or not

@lhoestq
Copy link
Member Author

lhoestq commented Jul 29, 2021

I took your comments into account thanks !
And I made aiohttp a required dependency :)

@lhoestq
Copy link
Member Author

lhoestq commented Aug 25, 2021

Just updated the documentation :)
share_datasets.html

Let me know if you have some comments

@lhoestq
Copy link
Member Author

lhoestq commented Aug 25, 2021

Merging this one :)

We can try to integrate the changes in the docs to #2718 @stevhliu !

@lhoestq lhoestq merged commit 6c766f9 into master Aug 25, 2021
@lhoestq lhoestq deleted the load_dataset-no-dataset-script branch August 25, 2021 14:18
@stevhliu
Copy link
Member

Baked this into the docs already, let me know if there is anything else I should add! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Load datasets from the Hub without requiring a dataset script

5 participants