Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
1984bac
Jul 16, 2021
086a19a
Merge remote-tracking branch 'origin/master'
Jul 16, 2021
da84f8e
Jul 16, 2021
d5c9d33
add instructions for venv and dataset builder to tutorial
Jul 19, 2021
80cf49f
Jul 23, 2021
42f5c7a
Jul 26, 2021
5803ed8
Jul 26, 2021
e8fa2cf
add concept guide for dataset features
Jul 27, 2021
c52c4b4
add concept guides for cache, load, map_batch
Jul 28, 2021
17a0e24
Jul 30, 2021
6c2bb57
Jul 30, 2021
d720143
Jul 30, 2021
d9eff0e
Update about_dataset_load.rst
lhoestq Jul 30, 2021
3f267a1
Jul 30, 2021
7a2c5bc
Merge remote-tracking branch 'origin/master'
Jul 30, 2021
e25fb40
Add improve performance section, update doc link to splits
Jul 31, 2021
de5d66c
Add section for streaming/iterable datasets
Aug 2, 2021
02889e6
Add edits from review
Aug 6, 2021
66fa452
Add some more edits from review
Aug 11, 2021
eb23a30
Add more details for loading dataset from Hub
Aug 25, 2021
f1037e8
Merge remote-tracking branch 'upstream/master' into master
lhoestq Sep 7, 2021
1f9610e
minor improvements
lhoestq Sep 7, 2021
9fb32e2
explain more differences between community and canonical
lhoestq Sep 7, 2021
0059f7a
about arrow
lhoestq Sep 7, 2021
e200a3c
about cache
lhoestq Sep 7, 2021
ec0a1b3
Merge remote-tracking branch 'upstream/master' into master
lhoestq Sep 7, 2021
05f29fd
fix docs
lhoestq Sep 7, 2021
73d4272
map batch n. rows mistmatch example
lhoestq Sep 8, 2021
9e5b801
integrity - how and why to ignore verifications
lhoestq Sep 8, 2021
8d8dccb
more details about arrow in about_arrow
lhoestq Sep 8, 2021
042f08d
add inspect.py functions to documentation
lhoestq Sep 8, 2021
06c31e3
separate dataset share/script/card pages
lhoestq Sep 8, 2021
fcde35a
add minor changes to match style
Sep 9, 2021
e998410
Apply suggestions from code review
lhoestq Sep 9, 2021
7564a1f
Apply suggestions from code review
Sep 9, 2021
0ab7fb4
Apply suggestions from code review
lhoestq Sep 10, 2021
a462633
Fix typo in added sentence
albertvillanova Sep 10, 2021
c7a325e
Remove trailing whitespace
albertvillanova Sep 10, 2021
2bda4dd
Apply suggestions from code review
Sep 10, 2021
f798728
Apply suggestions from code review
lhoestq Sep 13, 2021
13ae8c9
Apply suggestions from code review
lhoestq Sep 13, 2021
00cc036
latest comments from albert
lhoestq Sep 13, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions docs/source/about_arrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Datasets 🤝 Arrow

TO DO:

A brief introduction on why Datasets chose to use Arrow. For example, include some context and background like design decisions and constraints. The user should understand why we decided to use Arrow instead of something else.

## What is Arrow?

Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:

* Arrow's standard format allows zero copy reads which removes virtually all serialization overhead.
* Arrow is language-agnostic so it supports different programming languages.
* Arrow is column-oriented so it is faster at querying and processing data.

## Performance

TO DO:

Discussion on Arrow's speed and performance, especially as it relates to Datasets. In particular, this [tweet] from Thom is worth explaining how Datasets can iterate over such a massive dataset so quickly.

## Memory

TO DO:

Discussion on memory-mapping and efficiency, which enables this:

```python
>>> from datasets import total_allocated_bytes
>>> print("The number of bytes allocated on the drive is", dataset.dataset_size)
The number of bytes allocated on the drive is 1492156
>>> print("For comparison, here is the number of bytes allocated in memory:", total_allocated_bytes())
For comparison, here is the number of bytes allocated in memory: 0
```

[tweet]: https://twitter.com/Thom_Wolf/status/1272512974935203841
32 changes: 32 additions & 0 deletions docs/source/about_cache.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
The Cache
=========

One of the reasons why Datasets is so efficient is because of the cache. It stores previously downloaded and processed datasets so when you need to use them again, Datasets will reload it straight from the cache. This avoids having to download a dataset all over again, or recomputing all the processing functions you applied. Even after you close and start another Python session, Datasets will reload directly from the cache!

Fingerprint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked reading this section - super clear!

-----------

How does the cache keeps track of what transforms you applied to your dataset? Well, Datasets assigns a fingerprint to the cache file. This keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied.

.. tip::

Transforms are any of the processing methods from the :doc:`How-to Process <./process>` guides such as :func:`datasets.Dataset.map` or :func:`datasets.Dataset.shuffle`.

Here are what the actual fingerprints look like:

>>> from datasets import Dataset
>>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1})
>>> print(dataset1._fingerprint, dataset2._fingerprint)
d19493523d95e2dc 5b86abacd4b42434

In order for a transform to be hashable, it needs to be picklable using `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle <https://docs.python.org/3/library/pickle.html>`_. When you use a non-hashable transform, Datasets uses a random fingerprint instead and raises a warning. The non-hashable transform is considered different from the previous transforms, so Datasets will recompute everything. Make sure your transforms are serializable with pickle or dill to avoid this!

An example of when Datasets recomputes everything is when caching is disabled. When this happens, the cache files are always created. The cache files are written to a temporary directory, and gets deleted once the session ends. A random hash is assigned to these cache files, instead of a fingerprint.

.. tip::

If caching is disabled, use :func:`datasets.Dataset.save_to_disk` to save your transformed dataset or it will be deleted once the session ends.

TO DO: Explain why it needs to be picklable to give the user more context.

45 changes: 45 additions & 0 deletions docs/source/about_dataset_features.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
Dataset features
================

:class:`datasets.Features` defines the internal structure of a dataset, and are used to specify the underlying serialization format. What's more relevant to you though is that :class:`datasets.Features` contains high-level information about everything from the column names and types, to the :class:`datasets.ClassLabel`. Datasets uses the `Apache Arrow Automatic Type Inference <https://arrow.apache.org/docs/python/data.html>`_ to generate the features of your dataset. This guide will help you gain a better understanding about Datasets features.

.. tip::

See the Troubleshooting section from How-to Load a dataset to see how you can manually specify features in case Arrow infers an unexpected data type.

The format of :class:`datasets.Features` is simple: ``dict[column_name, column_type]``. The column type provides a wide range of options for describing the type of data you have. Let's take a look at the features of the MRPC dataset again:

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
}

The :class:`datasets.Value` feature tells Datasets that the ``idx`` data type is ``int32``, and the sentences' data types are ``string``. A large number of other data types are supported such as ``bool``, ``float32`` and ``binary`` to name just a few. Take a look at the :class:`datasets.Value` reference for a full list of supported data types.


:class:`datasets.ClassLabel` informs Datasets that the ``label`` column contains two classes with the labels ``not_equivalent`` and ``equivalent``. The labels are stored as integers in the dataset. When you retrieve the labels, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` carries out the conversion from integer value to label name, and vice versa.

If your data type contains a list of objects, then you want to use the :class:`datasets.Sequence` feature. Remember the SQuAD dataset?

>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', split='train')
>>> dataset.features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}

The ``answers`` field is constructed using the :class:`datasets.Sequence` feature because it contains ``text`` and ``answer_start``.

.. tip::

See the Flatten section from How-to Process a dataset to see how you can extract the nested sub-fields as their own independent columns.

Lastly, there are two specific features for machine translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`.

[I think for the translation features, we should either add some example code like we did for the other features or remove it all together.]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should add sample code for the translation classes and a brief explanation on how to use it. Otherwise we can just remove it and refer the user to the package reference.

92 changes: 92 additions & 0 deletions docs/source/about_dataset_load.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
Loading a Dataset
==================

Entire datasets are readily available with a single line of code: :func:`datasets.load_dataset`. But how does this simple function serve you whatever dataset you request? This guide will help you understand how :func:`datasets.load_dataset` works.

What happens when you call :func:`datasets.load_dataset`?
--------------------------------------------------------

In the beginning, :func:`datasets.load_dataset` downloads and imports the dataset loading script associated with the dataset you requested from the Hugging Face Hub. The Hub is a central repository where all the Hugging Face datasets and models are stored. Code in the loading script defines the dataset information (description, features, URL to the original files, etc.), and tells Datasets how to generate and display examples from it.

.. seealso::

Read the Share section for a step-by-step guide on how to write your own Dataset loading script!

The loading script will download the dataset files from the original URL, and cache the dataset in an Arrow table on your drive. If you've downloaded the dataset before, then Datasets will reload it from the cache to save you the trouble of downloading it again. Finally, Datasets will return the dataset built from the splits specified by the user. In the next section, let's dive a little deeper into the nitty-gritty of how all this works.

Building a Dataset
------------------

When you load a dataset for the first time, Datasets takes the raw data file and builds it into a table of rows and typed columns. There are two main classes that are responsible for building a dataset: :class:`datasets.BuilderConfig` and :class:`datasets.DatasetBuilder`.

.. image:: /imgs/builderconfig.png
:align: center

:class:`datasets.BuilderConfig`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`datasets.BuilderConfig` is the base class for :class:`datasets.DatasetBuilder`. The :class:`datasets.BuilderConfig` contains the following basic attributes about the dataset:

.. list-table::
:header-rows: 1

* - Attribute
- Description
* - :obj:`name`
- short name of the dataset
* - :obj:`version`
- dataset version identifier
* - :obj:`data_dir`
- stores the path to a local folder containing the data files
* - :obj:`data_files`
- stores paths to local data files
* - :obj:`description`
- description of the dataset

If you want to add additional attributes to your dataset such as the class labels, you can subclass the base :class:`datasets.BuilderConfig` class. There are two ways to populate the attributes of a :class:`datasets.BuilderConfig` class or subclass:

* Provide a list of predefined :class:`datasets.BuilderConfig` classes or subclasses that can be set in the :attr:`datasets.DatasetBuilder.BUILDER_CONFIGS` attribute of the dataset.

* When you call :func:`datasets.load_dataset`, any keyword arguments that are not specific to the method will be used to set the associated attributeds of the :class:`datasets.BuilderConfig` class. This overrides the predefined attributes.

:class:`datasets.DatasetBuilder`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`datasets.DatasetBuilder` accesses all the attributes inside :class:`datasets.BuilderConfig` to build the actual dataset.

.. image:: /imgs/datasetbuilder.png
:align: center

There are three main methods :class:`datasets.DatasetBuilder` uses:

1. :func:`datasets.DatasetBuilder._info` is in charge of defining the dataset attributes. When you call ``dataset.info``, Datasets returns the information stored here. Likewise, the :class:`datasets.Features` are also specified here. Remember the :class:`datasets.Features` is like the skeleton of the dataset, it provides the names and types of each column.

.. seealso::

Take a look at the package reference of :class:`datasets.DatasetInfo` for a full list of attributes.

2. :func:`datasets.DatasetBuilder._split_generator` downloads or retrieves the requested data files, organizes them into splits, and defines specific arguments for the generation process. This method has a :class:`datasets.DownloadManager` that downloads files or fetches them from your local filesystem. The DownloadManager contains a :func:`datasets.DownloadManager.download_and_extract` method that takes a dictionary of URLs to the original data files, and downloads or retrieves the requested files. It is flexible in the type of inputs it accepts: a single URL or path, or a list/dictionary of URLs or paths. On top of this, :func:`datasets.DownloadManager.download_and_extract` will also extract compressed tar, gzip and zip archives.

It returns a list of :class:`datasets.SplitGenerator`. The :class:`datasets.SplitGenerator` contains the name of the split, and keyword arguments that are provided to the :func:`datasets.DatasetBuilder._generate_examples` method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files to load for each split.

.. tip::

:func:`datasets.DownloadManager.download_and_extract` can download files from a wide range of sources. If the data files are hosted on a special access server, you should use :func:`datasets.DownloadManger.download_custom`. Refer to the package reference of :class:`datasets.DownloadManager` for more details.

3. :func:`datasets.DatasetBuilder._generate_examples` reads and parses the data files for a split, and yields examples with the format specified in the ``features`` from :func:`datasets.DatasetBuilder._info`. The input of :func:`datasets.DatasetBuilder._generate_examples` is the ``filepath`` provided in the last method.

The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an ``ArrowWriter`` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the ``DEFAULT_WRITER_BATCH_SIZE`` attribute in :class:`datasets.DatasetBuilder`. We recommend not exceeding a size of 200MB.


Maintaining integrity
---------------------

To ensure a dataset is complete, :func:`datasets.load_dataset` will perform some tests on the downloaded files to make sure everything is there. This way, you don't encounter any nasty surprises when your requested dataset doesn't get generated as expected. :func:`datasets.load_dataset` verifies:

* the list of downloaded files
* the number of bytes of the downloaded files
* the SHA256 checksums of the downloaded files
* the number of splits in the generated ``DatasetDict``
* the number of samples in each split of the generated ``DatasetDict``

TO DO: Explain why you would want to disable the verifications or override the information used to perform the verifications.
32 changes: 32 additions & 0 deletions docs/source/about_map_batch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Batch mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "mapping" be uppercase for consistency elsewhere? (Not sure what the convention is)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention is lowercase, unless we refer to a class or something that requires an uppercase, or if you want to emphasize on specific words

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually brings up a good point on convention! I've been using the Google developer style guide, and we should use sentence case for headings and titles. I just changed a couple of headings, so it should reflect that now. :)

For mapping, I think it is ok to leave it in lowercase since it is a general thing that isn't specific to HF.

=============

Combining the utility of :func:`datasets.Dataset.map` with batch mode is very powerful. It allows you to speed up processing, and freely control the size of the generated dataset.

Speed
-----

The primary use for batch map is to speed up processing. Sometimes it is faster to work with batches of data. instead of single examples. Naturally, batch map lends itself to tokenization. For example, the 🤗 `Tokenizers <https://huggingface.co/docs/tokenizers/python/latest/>`_ library works faster with batches because it parallelizes the tokenization of all the examples in a batch.

Input size != output size
-------------------------

The ability to control the generated dataset size can be taken advantage of for many interesting use-cases. In the Map section of the :doc:`How-to Process <./process>` guides, there are examples of how to use batch mapping:

* split long sentences into shorter chunks
* augment the dataset with additional tokens

It will be helpful to understand how this works, so you can come up with your own ways to use batch map.

You may be wondering how you can control the size of the generated dataset. The answer is:

✨ The mapped function does not have to return an output batch of the same size. ✨

In other words, your input can be a batch of size ``N`` and return a batch of size ``M`` which can be greater or less than ``N``. This means you can concatenate your examples, divide it up, and even add more examples!

However, you need to remember that each field in the output dictionary must contain the **same number of elements** as the other field in the output dictionary. Otherwise, it is not possible to define the number of examples in the output returned by the mapped function. This number can vary between successive batches processed by the mapped function. For a single batch though, all fields of the output dictionary should have the same number of elements.

TO DO:

Maybe add a code example of when the number of elements in the field of an output dictionary aren't the same, so the user knows what not to do.

26 changes: 26 additions & 0 deletions docs/source/about_metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
All About Metrics
=================

Datasets provides access to a wide range of NLP metrics. You can load metrics associated with benchmark datasets like GLUE or SQuAD, and complex metrics like BLEURT or BERTScore, with a single line of code: :func:`datasets.load_metric`. Once you've loaded a metric, you can easily compute and evaluate a model's performance.

What happens when you call ``load_metric``?
-------------------------------------------

Loading a dataset and loading a metric share many similiarites. This was an intentional design choice because we wanted to create a simple and unified experience for you. When you call :func:`datasets.load_metric`, the metric loading script is downloaded and imported from the Hub (if it hasn't already been downloaded before). The metric loading script itself contains information about the metric such as it's citation, homepage, and description.

The metric loading script will instantiate and return a :class:`datasets.Metric` object. This stores the predictions and references, which you need to compute the metric values. The :class:`datasets.Metric` object is stored as an Apache Arrow table. As a result, the predictions and references are stored directly on disk with memory-mapping. This allows Datasets to do a lazy computation of the metric, and makes it easy to gather all the predictions in a distributed setting.

TO DO: Briefly explain what lazy computation is.

Distributed evaluation
----------------------

Computing metrics in a distributed environment can be tricky. Metric evaluation is executed in separate Python processes, or nodes, on different subsets of a dataset. Typically, when a metric score is additive (``f(AuB) = f(A) + f(B)``), you can use distributed reduce operations to gather the scores for each subset of the dataset. But when a metric is non-additive (``f(AuB) ≠ f(A) + f(B)``), it's not that simple. For example, you can't just add up all the `F1 <https://huggingface.co/metrics/f1>`_ scores for each data subset. A common way to overcome this issue is to fallback on single process evaluation. The metrics are evaluated on a single GPU, which becomes inefficient.

TO DO: Briefly explain what distributed reduce operations are.

Datasets solves this by only computing the final metric on the first node. The predictions and references are computed and provided to the metric separately for each node. These are temporarily stored in an Apache Arrow table, avoiding cluttering the GPU or CPU memory. When you are ready to :func:`datasets.Metric.compute` the final metric, the first node is able to access the predictions and references stored in all the other nodes. Once it's gathered all the predictions and references, :func:`datasets.Metric.compute` will perform the final metric evaluation.

This method allows Datasets to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory.

TO DO: More explanation on how the file locks perform the synchronization, or remove this part.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add some more explanation about how the file lock synchronization works.

Loading