Load Dataset from the Hub (NO DATASET SCRIPT) #2662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

lhoestq merged 30 commits into master from load_dataset-no-dataset-script

Aug 25, 2021

Member

lhoestq commented Jul 16, 2021 •

edited

Loading

Load the data from any Dataset repository on the Hub

This PR adds support for loading datasets from any dataset repository on the hub, without requiring any dataset script.

As a user it's now possible to create a repo and upload some csv/json/text/parquet files, and then be able to load the data in one line. Here is an example with the allenai/c4 repository that contains a lot of compressed json lines files:

from datasets import load_dataset

data_files = {"train": "en/c4-train.*.json.gz"}
c4 = load_dataset("allenai/c4", data_files=data_files, split="train", streaming=True)

print(c4.n_shards)
# 1024
print(next(iter(c4)))
# {'text': 'Beginners BBQ Class Takin...'}

By default it loads all the files, but as shown in the example you can choose the ones you want with unix style patterns.

Of course it's still possible to use dataset scripts since they offer the most flexibility.

Implementation details

It uses huggingface_hub to list the files in a dataset repository.

If you provide a path to a local directory instead of a repository name, it works the same way but it uses glob.

Depending on the data files available, or passed in the data_files parameter, one of the available builders will be used among the csv, json, text and parquet builders.

Because of this, it's not possible to load both csv and json files at once. In this case you have to load them separately and then concatenate the two datasets for example.

TODO

tests
docs
when huggingface_hub gets a new release, update the CI and the setup.py

Close #2629

lhoestq added 15 commits

July 13, 2021 16:09


          use the data files of a dataset repo and infer the right dataset builder

07c9de2


          fix import

6aa3c02


          Merge branch 'master' into load_dataset-no-dataset-script

f8c7b49


          temporarily use the huggingface_hub on master in the CI

69d6b17


          fix tests

17a09e7


          Merge branch 'master' into load_dataset-no-dataset-script

4d7bec8


          bump huggingface_hub version

d651717


          revert huggingface_hub pin in the CI

a51f731


          fix data_files resolutions for urls

48b0f88


          data file resolution for local/http/hub

d6621c6


          Merge branch 'master' into load_dataset-no-dataset-script

974fd6f


          remove old code

b450e0f


          style

18ae58d


          more tests

a6a6a12


          tests and docs

00686c4

lhoestq mentioned this pull request

Deal with the bad check in test_load.py #2721

Merged

lhoestq added 6 commits

July 28, 2021 17:27


          fix test

f452772


          fix test

d320a47


          style

be38796


          Merge branch 'master' into load_dataset-no-dataset-script

f83cd44


          docs

8361a51


          style

eaa1371

lhoestq marked this pull request as ready for review

July 28, 2021 18:22

Member Author

lhoestq commented Jul 28, 2021

This is ready for review now :)

I would love to have some feedback on the changes in load.py @albertvillanova. There are many changes so if you have questions let me know, especially on the resolve_data_files functions and on the changes in prepare_module.

And @thomwolf if you want to take a look at the documentation, feel free to share your suggestions :)

lewtun reviewed

View reviewed changes

Member

lewtun left a comment

this features looks really nice - thank you for improving our quality of life 🥳 !

i left a few nits (feel free to ignore them) and a question about whether we should show you can download private files

docs/source/loading_datasets.rst

    
              - from the `HuggingFace Hub <https://huggingface.co/datasets>`__,

              - from local files, e.g. CSV/JSON/text/pandas files, or

              - from the `Hugging Face Hub <https://huggingface.co/datasets>`__,

              - from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or

Member

lewtun Jul 28, 2021

out of curiosity, what is a "pandas file"?

Member

lewtun Jul 28, 2021

ah i see the answer is below: it's a pickled dataframe :)

docs/source/loading_datasets.rst Outdated Show resolved Hide resolved

docs/source/loading_datasets.rst Outdated Show resolved Hide resolved

docs/source/loading_datasets.rst Outdated Show resolved Hide resolved

docs/source/loading_datasets.rst Outdated Show resolved Hide resolved

docs/source/loading_datasets.rst

    
                  >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

                  >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],

                                                                'test': 'my_test_file.csv'})

                  >>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/'

Member

lewtun Jul 28, 2021

really nice to see an explicit example with the expected url!

docs/source/loading_datasets.rst Outdated Show resolved Hide resolved

docs/source/loading_datasets.rst

    
              .. code-block::

                  >>> from datasets import load_dataset

                  >>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv")

Member

lewtun Jul 28, 2021

i'm not sure where we should mention this, but showing that you can download from private repos with use_auth_token=True would be useful

Member Author

lhoestq Jul 29, 2021

I added a note at https://github.com/huggingface/datasets/blob/01f53281b267a1020bbb3b2ecaa0848f376897e8/docs/source/loading_datasets.rst#from-local-or-remote-files

thomwolf approved these changes

View reviewed changes

Member

thomwolf left a comment

This is really cool and will unlock a lot of easy workflows!

src/datasets/builder.py Outdated

    
                              f"To be able to use dataset streaming, you need to install dependencies like aiohttp "

                              f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'

                              "To be able to use dataset streaming, you need to install dependencies like aiohttp "

                              'using \'datasets[streaming]\'" or "pip install aiohttp" for instance'

Member

thomwolf Jul 29, 2021

I think you can move aiohttp to required dep since we are now pushing for the streaming feature a lot

src/datasets/load.py

    
                  # - if path has one "/" and is dataset repository on the HF hub with a python file

                  #   -> the module from the python file in the dataset repository

                  # - if path has one "/" and is dataset repository on the HF hub without a python file

                  #   -> use a packaged module (csv, text etc.) based on content of the repository

Member

thomwolf Jul 29, 2021

Maybe this should be in the docstring or doc?

Member Author

lhoestq Jul 29, 2021 •

edited

Loading

It is, but the formulation is a bit different to be clearer for users.
Here the formulation if made for developers

src/datasets/utils/streaming_download_manager.py Outdated

    
              logger = get_logger(__name__)

              BASE_KNOWN_EXTENSIONS = ["txt", "csv", "json", "jsonl", "tsv", "conll"]

              UNSUPPORTED_ARCHIVE_EXTENSIONS_FOR_STREAMING = ["tar", "xz", "rar", "zst"]

Member

thomwolf Jul 29, 2021

Here also we should have the list of supported (and unsupported) extension/file-format clearly stated in the doc

Member

thomwolf Jul 29, 2021

I'm a bit surprised you do a black-list instead of a white-list here

Member Author

lhoestq Jul 29, 2021

This is because most scripts call download_and_extract even for files that don't need extraction. We shouldn't raise an error for an uncompressed file. If we had a white list and you passed un uncompressed file, it wouldn't find any supported compression extension in the filename, so it would fail.

Member Author

lhoestq Aug 25, 2021

in the future we'll fallback on reading the header of the file to know if it's a compressed file or not

lhoestq added 2 commits

July 29, 2021 10:53


          lewis' comments

130a500


          add aiohttp to dependencies

01f5328

Member Author

lhoestq commented Jul 29, 2021

I took your comments into account thanks !
And I made aiohttp a required dependency :)


          minor

04c2a4b

lhoestq mentioned this pull request

New documentation structure #2718

Merged

lhoestq mentioned this pull request

Support streaming compressed files #2786

Merged

lhoestq mentioned this pull request

Add url prefix convention for many compression formats #2822

Merged

lhoestq added 6 commits

August 24, 2021 17:27


          Merge branch 'master' into load_dataset-no-dataset-script

e6b92c6


          fix imports

f310e4c


          fix imports agains

10103cd


          remove remaining require_streaming

b4bf1f8


          fix test

5e2cf2a


          docs: share dataset on the hub

0a96b16

Member Author

lhoestq commented Aug 25, 2021 •

edited

Loading

Just updated the documentation :)
share_datasets.html

Let me know if you have some comments

Member Author

lhoestq commented Aug 25, 2021

Merging this one :)

We can try to integrate the changes in the docs to #2718 @stevhliu !

lhoestq merged commit 6c766f9 into master

lhoestq deleted the load_dataset-no-dataset-script branch

August 25, 2021 14:18

Member

stevhliu commented Aug 25, 2021

Baked this into the docs already, let me know if there is anything else I should add! :)

severo mentioned this pull request

Prevent DoS when accessing some datasets huggingface/dataset-viewer#17

Closed

lhoestq mentioned this pull request

Fix missing conda deps #2952

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet