-
Notifications
You must be signed in to change notification settings - Fork 3k
Load Dataset from the Hub (NO DATASET SCRIPT) #2662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is ready for review now :) I would love to have some feedback on the changes in load.py @albertvillanova. There are many changes so if you have questions let me know, especially on the And @thomwolf if you want to take a look at the documentation, feel free to share your suggestions :) |
lewtun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this features looks really nice - thank you for improving our quality of life 🥳 !
i left a few nits (feel free to ignore them) and a question about whether we should show you can download private files
| - from the `HuggingFace Hub <https://huggingface.co/datasets>`__, | ||
| - from local files, e.g. CSV/JSON/text/pandas files, or | ||
| - from the `Hugging Face Hub <https://huggingface.co/datasets>`__, | ||
| - from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, what is a "pandas file"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see the answer is below: it's a pickled dataframe :)
| >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) | ||
| >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], | ||
| 'test': 'my_test_file.csv'}) | ||
| >>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice to see an explicit example with the expected url!
| .. code-block:: | ||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure where we should mention this, but showing that you can download from private repos with use_auth_token=True would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool and will unlock a lot of easy workflows!
src/datasets/builder.py
Outdated
| f"To be able to use dataset streaming, you need to install dependencies like aiohttp " | ||
| f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance' | ||
| "To be able to use dataset streaming, you need to install dependencies like aiohttp " | ||
| 'using \'datasets[streaming]\'" or "pip install aiohttp" for instance' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can move aiohttp to required dep since we are now pushing for the streaming feature a lot
| # - if path has one "/" and is dataset repository on the HF hub with a python file | ||
| # -> the module from the python file in the dataset repository | ||
| # - if path has one "/" and is dataset repository on the HF hub without a python file | ||
| # -> use a packaged module (csv, text etc.) based on content of the repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be in the docstring or doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but the formulation is a bit different to be clearer for users.
Here the formulation if made for developers
|
|
||
| logger = get_logger(__name__) | ||
| BASE_KNOWN_EXTENSIONS = ["txt", "csv", "json", "jsonl", "tsv", "conll"] | ||
| UNSUPPORTED_ARCHIVE_EXTENSIONS_FOR_STREAMING = ["tar", "xz", "rar", "zst"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also we should have the list of supported (and unsupported) extension/file-format clearly stated in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised you do a black-list instead of a white-list here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because most scripts call download_and_extract even for files that don't need extraction. We shouldn't raise an error for an uncompressed file. If we had a white list and you passed un uncompressed file, it wouldn't find any supported compression extension in the filename, so it would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future we'll fallback on reading the header of the file to know if it's a compressed file or not
|
I took your comments into account thanks ! |
|
Just updated the documentation :) Let me know if you have some comments |
|
Baked this into the docs already, let me know if there is anything else I should add! :) |
Load the data from any Dataset repository on the Hub
This PR adds support for loading datasets from any dataset repository on the hub, without requiring any dataset script.
As a user it's now possible to create a repo and upload some csv/json/text/parquet files, and then be able to load the data in one line. Here is an example with the
allenai/c4repository that contains a lot of compressed json lines files:By default it loads all the files, but as shown in the example you can choose the ones you want with unix style patterns.
Of course it's still possible to use dataset scripts since they offer the most flexibility.
Implementation details
It uses
huggingface_hubto list the files in a dataset repository.If you provide a path to a local directory instead of a repository name, it works the same way but it uses
glob.Depending on the data files available, or passed in the
data_filesparameter, one of the available builders will be used among the csv, json, text and parquet builders.Because of this, it's not possible to load both csv and json files at once. In this case you have to load them separately and then concatenate the two datasets for example.
TODO
Close #2629