Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
07c9de2
use the data files of a dataset repo and infer the right dataset builder
lhoestq Jul 13, 2021
6aa3c02
fix import
lhoestq Jul 13, 2021
f8c7b49
Merge branch 'master' into load_dataset-no-dataset-script
lhoestq Jul 16, 2021
69d6b17
temporarily use the huggingface_hub on master in the CI
lhoestq Jul 16, 2021
17a09e7
fix tests
lhoestq Jul 16, 2021
4d7bec8
Merge branch 'master' into load_dataset-no-dataset-script
lhoestq Jul 21, 2021
d651717
bump huggingface_hub version
lhoestq Jul 21, 2021
a51f731
revert huggingface_hub pin in the CI
lhoestq Jul 21, 2021
48b0f88
fix data_files resolutions for urls
lhoestq Jul 21, 2021
d6621c6
data file resolution for local/http/hub
lhoestq Jul 26, 2021
974fd6f
Merge branch 'master' into load_dataset-no-dataset-script
lhoestq Jul 27, 2021
b450e0f
remove old code
lhoestq Jul 27, 2021
18ae58d
style
lhoestq Jul 27, 2021
a6a6a12
more tests
lhoestq Jul 27, 2021
00686c4
tests and docs
lhoestq Jul 27, 2021
f452772
fix test
lhoestq Jul 28, 2021
d320a47
fix test
lhoestq Jul 28, 2021
be38796
style
lhoestq Jul 28, 2021
f83cd44
Merge branch 'master' into load_dataset-no-dataset-script
lhoestq Jul 28, 2021
8361a51
docs
lhoestq Jul 28, 2021
eaa1371
style
lhoestq Jul 28, 2021
130a500
lewis' comments
lhoestq Jul 29, 2021
01f5328
add aiohttp to dependencies
lhoestq Jul 29, 2021
04c2a4b
minor
lhoestq Jul 29, 2021
e6b92c6
Merge branch 'master' into load_dataset-no-dataset-script
lhoestq Aug 24, 2021
f310e4c
fix imports
lhoestq Aug 24, 2021
10103cd
fix imports agains
lhoestq Aug 24, 2021
b4bf1f8
remove remaining require_streaming
lhoestq Aug 24, 2021
5e2cf2a
fix test
lhoestq Aug 24, 2021
0a96b16
docs: share dataset on the hub
lhoestq Aug 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ The documentation is organized in six parts:
:maxdepth: 2
:caption: Adding new datasets/metrics

add_dataset
share_dataset
add_dataset
add_metric

.. toctree::
Expand Down
94 changes: 77 additions & 17 deletions docs/source/loading_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@ Loading a Dataset

A :class:`datasets.Dataset` can be created from various sources of data:

- from the `HuggingFace Hub <https://huggingface.co/datasets>`__,
- from local files, e.g. CSV/JSON/text/pandas files, or
- from the `Hugging Face Hub <https://huggingface.co/datasets>`__,
- from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, what is a "pandas file"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see the answer is below: it's a pickled dataframe :)

- from in-memory data like python dict or a pandas dataframe.

In this section we study each option.

From the HuggingFace Hub
From the Hugging Face Hub
-------------------------------------------------

Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__ and can be viewed and explored online with the `🤗 Datasets viewer <https://huggingface.co/datasets/viewer>`__.
Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `Hugging Face Hub <https://huggingface.co/datasets>`__ and can be viewed and explored online with the `🤗 Datasets viewer <https://huggingface.co/datasets/viewer>`__.

.. note::

Expand All @@ -25,7 +25,7 @@ All the datasets currently available on the `Hub <https://huggingface.co/dataset
>>> from datasets import list_datasets
>>> datasets_list = list_datasets()
>>> len(datasets_list)
1067
1103
>>> print(', '.join(dataset for dataset in datasets_list))
acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar,
allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat,
Expand All @@ -46,7 +46,7 @@ Let's load the **SQuAD dataset for Question Answering**. You can explore this da

This call to :func:`datasets.load_dataset` does the following steps under the hood:

1. Download and import in the library the **SQuAD python processing script** from HuggingFace github repository or AWS bucket if it's not already stored in the library.
1. Download and import in the library the **SQuAD python processing script** from Hugging Face github repository or AWS bucket if it's not already stored in the library.

.. note::

Expand Down Expand Up @@ -158,22 +158,60 @@ Apart from :obj:`name` and :obj:`split`, the :func:`datasets.load_dataset` metho

The use of these arguments is discussed in the :ref:`load_dataset_cache_management` section below. You can also find the full details on these arguments on the package reference page for :func:`datasets.load_dataset`.

From a community dataset on the Hugging Face Hub
-----------------------------------------------------------

The community shares hundreds of datasets on the Hugging Face Hub using **dataset repositories**.
A dataset repository is a versioned repository of data files.
Everyone can create a dataset repository on the Hugging Face Hub and upload their data.

For example we have created a demo dataset at https://huggingface.co/datasets/lhoestq/demo1.
In this dataset repository we uploaded some CSV files, and you can load the dataset with:

.. code-block::

>>> from datasets import load_dataset
>>> dataset = load_dataset('lhoestq/demo1')

You can even choose which files to load from a dataset repository.
For example you can load a subset of the **C4 dataset for language modeling**, hosted by AllenAI on the Hub.
You can browse the dataset repository at https://huggingface.co/datasets/allenai/c4

In the following example we specify which subset of the files to use with the ``data_files`` parameter:

.. code-block::

>>> from datasets import load_dataset
>>> c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz')


You can also specify custom splits:

.. code-block::

>>> data_files = {"validation": "en/c4-validation.*.json.gz"}
>>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")

In these examples, ``load_dataset`` will return all the files that match the Unix style pattern passed in ``data_files``.
If you don't specify which data files to use, it will use all the data files (here all C4 is about 13TB of data).


.. _loading-from-local-files:

From local files
From local or remote files
-----------------------------------------------------------

It's also possible to create a dataset from local files.
It's also possible to create a dataset from your own local or remote files.

Generic loading scripts are provided for:

- CSV files (with the :obj:`csv` script),
- JSON files (with the :obj:`json` script),
- text files (read as a line-by-line dataset with the :obj:`text` script),
- parquet files (with the :obj:`parquet` script).
- pandas pickled dataframe (with the :obj:`pandas` script).

If you want to control better how your files are loaded, or if you have a file format exactly reproducing the file format for one of the datasets provided on the `HuggingFace Hub <https://huggingface.co/datasets>`__, it can be more flexible and simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` chapter.
If you want more fine-grained control on how your files are loaded or if you have a file format that matches the format for one of the datasets provided on the `Hugging Face Hub <https://huggingface.co/datasets>`__, it can be more simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` section.

The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several data source files. This argument currently accepts three types of inputs:

Expand All @@ -190,12 +228,19 @@ Let's see an example of all the various ways you can provide files to :func:`dat
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
>>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],
'test': 'my_test_file.csv'})
>>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice to see an explicit example with the expected url!

>>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'})

.. note::

The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split.


.. note::

If you use a private dataset repository on the Hub, you just need to pass ``use_auth_token=True`` to ``load_dataset`` after logging in with the ``huggingface-cli login`` bash command. Alternatively you can pass your `API token <https://huggingface.co/settings/token>`__ in ``use_auth_token``.


CSV files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -218,6 +263,13 @@ Here is an example loading two CSV file to create a ``train`` split (default spl
>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv'])

You can also provide the URLs of remote csv files:

.. code-block::

>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure where we should mention this, but showing that you can download from private repos with use_auth_token=True would be useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The ``csv`` loading script provides a few simple access options to control parsing and reading the CSV files:

- :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0)
Expand All @@ -226,12 +278,6 @@ The ``csv`` loading script provides a few simple access options to control parsi
- :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``).
- :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>` for more details).

If you want more control, the ``csv`` script provides full control on reading, parsing and converting through the Apache Arrow `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__, `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ and `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__

- :obj:`read_options` — Can be provided with a `pyarrow.csv.ReadOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html>`__ to control all the reading options. If :obj:`skiprows`, :obj:`column_names` or :obj:`autogenerate_column_names` are also provided (see above), they will take priority over the attributes in :obj:`read_options`.
- :obj:`parse_options` — Can be provided with a `pyarrow.csv.ParseOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html>`__ to control all the parsing options. If :obj:`delimiter` or :obj:`quote_char` are also provided (see above), they will take priority over the attributes in :obj:`parse_options`.
- :obj:`convert_options` — Can be provided with a `pyarrow.csv.ConvertOptions <https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html>`__ to control all the conversion options.


JSON files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -258,6 +304,13 @@ You can load such a dataset direcly with:
>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='my_file.json')

You can also provide the URLs of remote JSON files:

.. code-block::

>>> from datasets import load_dataset
>>> dataset = load_dataset('json', data_files='https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz')

In real-life though, JSON files can have diverse format and the ``json`` script will accordingly fallback on using python JSON loading methods to handle various JSON file format.

One common occurence is to have a JSON file with a single root dictionary where the dataset is contained in a specific field, as a list of dicts or a dict of lists.
Expand Down Expand Up @@ -289,6 +342,13 @@ This is simply done using the ``text`` loading script which will generate a data
>>> from datasets import load_dataset
>>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'})

You can also provide the URLs of remote text files:

.. code-block::

>>> from datasets import load_dataset
>>> dataset = load_dataset('text', data_files={'train': 'https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt'})


Specifying the features of the dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -465,8 +525,8 @@ For example, run the following to skip integrity verifications when loading the
Loading datasets offline
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each dataset builder (e.g. "squad") is a python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
Only the ``text``, ``csv``, ``json`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads.
Each dataset builder (e.g. "squad") is a Python script that is downloaded and cached either from the 🤗 Datasets GitHub repository or from the `Hugging Face Hub <https://huggingface.co/datasets>`__.
Only the ``text``, ``csv``, ``json``, ``parquet`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads.

Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached.
Indeed, if you've already loaded the dataset once before (when you had an internet connection), then the dataset is reloaded from the cache and you can use it offline.
Expand Down
Loading