-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New documentation structure #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1984bac
086a19a
da84f8e
d5c9d33
80cf49f
42f5c7a
5803ed8
e8fa2cf
c52c4b4
17a0e24
6c2bb57
d720143
d9eff0e
3f267a1
7a2c5bc
e25fb40
de5d66c
02889e6
66fa452
eb23a30
f1037e8
1f9610e
9fb32e2
0059f7a
e200a3c
ec0a1b3
05f29fd
73d4272
9e5b801
8d8dccb
042f08d
06c31e3
fcde35a
e998410
7564a1f
0ab7fb4
a462633
c7a325e
2bda4dd
f798728
13ae8c9
00cc036
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # Datasets 🤝 Arrow | ||
|
|
||
| ## What is Arrow? | ||
|
|
||
| [Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: | ||
|
|
||
| * Arrow's standard format allows [zero-copy reads](https://en.wikipedia.org/wiki/Zero-copy) which removes virtually all serialization overhead. | ||
| * Arrow is language-agnostic so it supports different programming languages. | ||
| * Arrow is column-oriented so it is faster at querying and processing slices or columns of data. | ||
| * Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow. | ||
| * Arrow supports many, possibly nested, column types. | ||
|
|
||
| ## Memory-mapping | ||
|
|
||
| 🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. | ||
| This architecture allows for large datasets to be used on machines with relatively small device memory. | ||
|
|
||
| For example, loading the full English Wikipedia dataset only takes a few MB of RAM: | ||
|
|
||
| ```python | ||
| >>> import os; import psutil; import timeit | ||
| >>> from datasets import load_dataset | ||
|
|
||
| # Process.memory_info is expressed in bytes, so convert to megabytes | ||
| >>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) | ||
| >>> wiki = load_dataset("wikipedia", "20200501.en", split='train') | ||
| >>> mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20 | ||
|
|
||
| >>> print(f"RAM memory used: {(mem_after - mem_before)} MB") | ||
| 'RAM memory used: 9 MB' | ||
| ``` | ||
|
|
||
| This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice summary of memory-mapping! |
||
| Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups. | ||
|
|
||
| ## Performance | ||
|
|
||
| Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s: | ||
|
|
||
| ```python | ||
| >>> s = """batch_size = 1000 | ||
| ... for i in range(0, len(wiki), batch_size): | ||
| ... batch = wiki[i:i + batch_size] | ||
| ... """ | ||
|
|
||
| >>> time = timeit.timeit(stmt=s, number=1, globals=globals()) | ||
| >>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, " | ||
| ... f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s") | ||
| 'Time to iterate over the 17 GB dataset: 85 sec, ie. 1.7 Gb/s' | ||
| ``` | ||
|
|
||
| You can obtain the best performance by accessing slices of data (or "batches"), in order to reduce the amount of lookups on disk. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| The cache | ||
| ========= | ||
|
|
||
| The cache is one of the reasons why 🤗 Datasets is so efficient. It stores previously downloaded and processed datasets so when you need to use them again, they are reloaded directly from the cache. This avoids having to download a dataset all over again, or reapplying processing functions. Even after you close and start another Python session, 🤗 Datasets will reload your dataset directly from the cache! | ||
|
|
||
| Fingerprint | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really liked reading this section - super clear! |
||
| ----------- | ||
|
|
||
| How does the cache keeps track of what transforms are applied to a dataset? Well, 🤗 Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied. | ||
|
|
||
| .. tip:: | ||
|
|
||
| Transforms are any of the processing methods from the :doc:`How-to Process <./process>` guides such as :func:`datasets.Dataset.map` or :func:`datasets.Dataset.shuffle`. | ||
|
|
||
| Here are what the actual fingerprints look like: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets import Dataset | ||
| >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) | ||
| >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) | ||
| >>> print(dataset1._fingerprint, dataset2._fingerprint) | ||
| d19493523d95e2dc 5b86abacd4b42434 | ||
|
|
||
| In order for a transform to be hashable, it needs to be picklable by `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle <https://docs.python.org/3/library/pickle.html>`_. | ||
|
|
||
| When you use a non-hashable transform, 🤗 Datasets uses a random fingerprint instead and raises a warning. The non-hashable transform is considered different from the previous transforms. As a result, 🤗 Datasets will recompute all the transforms. Make sure your transforms are serializable with pickle or dill to avoid this! | ||
|
|
||
| An example of when 🤗 Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint. | ||
|
|
||
| .. tip:: | ||
|
|
||
| When caching is disabled, use :func:`datasets.Dataset.save_to_disk` to save your transformed dataset or it will be deleted once the session ends. | ||
|
|
||
| Hashing | ||
| ------- | ||
|
|
||
| The fingerprint of a dataset is updated by hashing the function passed to ``map`` as well as the ``map`` parameters (``batch_size``, ``remove_columns``, etc.). | ||
|
|
||
| You can check the hash of any Python object using the :class:`datasets.fingerprint.Hasher`: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets.fingerprint import Hasher | ||
| >>> my_func = lambda example: {"length": len(example["text"])} | ||
| >>> print(Hasher.hash(my_func)) | ||
| '3d35e2b3e94c81d6' | ||
|
|
||
| The hash is computed by dumping the object using a ``dill`` pickler and hashing the dumped bytes. | ||
| The pickler recursively dumps all the variables used in your function, so any change you do to an object that is used in your function, will cause the hash to change. | ||
|
|
||
| If one of your functions doesn't seem to have the same hash across sessions, it means at least one of its variables contains a Python object that is not deterministic. | ||
| When this happens, feel free to hash any object you find suspicious to try to find the object that caused the hash to change. | ||
| For example, if you use a list for which the order of its elements is not deterministic across sessions, then the hash won't be the same across sessions either. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| Dataset features | ||
| ================ | ||
|
|
||
| :class:`datasets.Features` defines the internal structure of a dataset. The :class:`datasets.Features` is used to specify the underlying serialization format. What's more interesting to you though is that :class:`datasets.Features` contains high-level information about everything from the column names and types, to the :class:`datasets.ClassLabel`. You can think of :class:`datasets.Features` as the backbone of a dataset. | ||
|
|
||
| The :class:`datasets.Features` format is simple: ``dict[column_name, column_type]``. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have. | ||
|
|
||
| Let's have a look at the features of the MRPC dataset from the GLUE benchmark: | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('glue', 'mrpc', split='train') | ||
| >>> dataset.features | ||
| {'idx': Value(dtype='int32', id=None), | ||
| 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), | ||
| 'sentence1': Value(dtype='string', id=None), | ||
| 'sentence2': Value(dtype='string', id=None), | ||
| } | ||
|
|
||
| The :class:`datasets.Value` feature tells 🤗 Datasets: | ||
|
|
||
| * The ``idx`` data type is ``int32``. | ||
| * The ``sentence1`` and ``sentence2`` data types are ``string``. | ||
|
|
||
| 🤗 Datasets supports many other data types such as ``bool``, ``float32`` and ``binary`` to name just a few. | ||
|
|
||
| .. seealso:: | ||
|
|
||
| Refer to :class:`datasets.Value` for a full list of supported data types. | ||
|
|
||
| The :class:`datasets.ClassLabel` feature informs 🤗 Datasets the ``label`` column contains two classes. The classes are labeled ``not_equivalent`` and ``equivalent``. Labels are stored as integers in the dataset. When you retrieve the labels, :func:`datasets.ClassLabel.int2str` and :func:`datasets.ClassLabel.str2int` carries out the conversion from integer value to label name, and vice versa. | ||
|
|
||
| If your data type contains a list of objects, then you want to use the :class:`datasets.Sequence` feature. Remember the SQuAD dataset? | ||
|
|
||
| .. code-block:: | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('squad', split='train') | ||
| >>> dataset.features | ||
| {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), | ||
| 'context': Value(dtype='string', id=None), | ||
| 'id': Value(dtype='string', id=None), | ||
| 'question': Value(dtype='string', id=None), | ||
| 'title': Value(dtype='string', id=None)} | ||
|
|
||
| The ``answers`` field is constructed using the :class:`datasets.Sequence` feature because it contains two subfields, ``text`` and ``answer_start``, which are lists of ``string`` and ``int32``, respectively. | ||
|
|
||
| .. tip:: | ||
|
|
||
| See the :ref:`flatten` section to learn how you can extract the nested subfields as their own independent columns. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| Build and load | ||
| ============== | ||
|
|
||
| Nearly every deep learning workflow begins with loading a dataset, which makes it one of the most important steps. With 🤗 Datasets, there are more than 900 datasets available to help you get started with your NLP task. All you have to do is call: :func:`datasets.load_dataset` to take your first step. This function is a true workhorse in every sense because it builds and loads every dataset you use. | ||
|
|
||
| ELI5: ``load_dataset`` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool! |
||
| ------------------------------- | ||
|
|
||
| Let's begin with a basic Explain Like I'm Five. | ||
|
|
||
| For community datasets, :func:`datasets.load_dataset` downloads and imports the dataset loading script associated with the requested dataset from the Hugging Face Hub. The Hub is a central repository where all the Hugging Face datasets and models are stored. Code in the loading script defines the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it. | ||
|
|
||
| If you are working with a canonical dataset, :func:`datasets.load_dataset` downloads and imports the dataset loading script from GitHub. | ||
|
|
||
| .. seealso:: | ||
|
|
||
| Read the :doc:`Share <./share>` section to learn more about the difference between community and canonical datasets. This section also provides a step-by-step guide on how to write your own dataset loading script! | ||
|
|
||
| The loading script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again. | ||
|
|
||
| Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works. | ||
|
|
||
| Building a dataset | ||
| ------------------ | ||
|
|
||
| When you load a dataset for the first time, 🤗 Datasets takes the raw data file and builds it into a table of rows and typed columns. There are two main classes responsible for building a dataset: :class:`datasets.BuilderConfig` and :class:`datasets.DatasetBuilder`. | ||
|
|
||
| .. image:: /imgs/builderconfig.png | ||
| :align: center | ||
|
|
||
| :class:`datasets.BuilderConfig` | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| :class:`datasets.BuilderConfig` is the configuration class of :class:`datasets.DatasetBuilder`. The :class:`datasets.BuilderConfig` contains the following basic attributes about a dataset: | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Attribute | ||
| - Description | ||
| * - :obj:`name` | ||
| - Short name of the dataset. | ||
| * - :obj:`version` | ||
| - Dataset version identifier. | ||
| * - :obj:`data_dir` | ||
| - Stores the path to a local folder containing the data files. | ||
| * - :obj:`data_files` | ||
| - Stores paths to local data files. | ||
| * - :obj:`description` | ||
| - Description of the dataset. | ||
|
|
||
| If you want to add additional attributes to your dataset such as the class labels, you can subclass the base :class:`datasets.BuilderConfig` class. There are two ways to populate the attributes of a :class:`datasets.BuilderConfig` class or subclass: | ||
|
|
||
| * Provide a list of predefined :class:`datasets.BuilderConfig` class (or subclass) instances in the datasets :attr:`datasets.DatasetBuilder.BUILDER_CONFIGS` attribute. | ||
|
|
||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * When you call :func:`datasets.load_dataset`, any keyword arguments that are not specific to the method will be used to set the associated attributes of the :class:`datasets.BuilderConfig` class. This will override the predefined attributes if a specific configuration was selected. | ||
|
|
||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| You can also set the :attr:`datasets.DatasetBuilder.BUILDER_CONFIG_CLASS` to any custom subclass of :class:`datasets.BuilderConfig`. | ||
|
|
||
| :class:`datasets.DatasetBuilder` | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| :class:`datasets.DatasetBuilder` accesses all the attributes inside :class:`datasets.BuilderConfig` to build the actual dataset. | ||
|
|
||
| .. image:: /imgs/datasetbuilder.png | ||
| :align: center | ||
|
|
||
| There are three main methods in :class:`datasets.DatasetBuilder`: | ||
|
|
||
| 1. :func:`datasets.DatasetBuilder._info` is in charge of defining the dataset attributes. When you call ``dataset.info``, 🤗 Datasets returns the information stored here. Likewise, the :class:`datasets.Features` are also specified here. Remember, the :class:`datasets.Features` are like the skeleton of the dataset. It provides the names and types of each column. | ||
|
|
||
| 2. :func:`datasets.DatasetBuilder._split_generator` downloads or retrieves the requested data files, organizes them into splits, and defines specific arguments for the generation process. This method has a :class:`datasets.DownloadManager` that downloads files or fetches them from your local filesystem. Within the :class:`datasets.DownloadManager`, there is a :func:`datasets.DownloadManager.download_and_extract` method that accepts a dictionary of URLs to the original data files, and downloads the requested files. Accepted inputs include: a single URL or path, or a list/dictionary of URLs or paths. Any compressed file types like TAR, GZIP and ZIP archives will be automatically extracted. | ||
|
|
||
| Once the files are downloaded, :class:`datasets.SplitGenerator` organizes them into splits. The :class:`datasets.SplitGenerator` contains the name of the split, and any keyword arguments that are provided to the :func:`datasets.DatasetBuilder._generate_examples` method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files for each split. | ||
|
|
||
| .. tip:: | ||
|
|
||
| :func:`datasets.DownloadManager.download_and_extract` can download files from a wide range of sources. If the data files are hosted on a special access server, you should use :func:`datasets.DownloadManger.download_custom`. Refer to the reference of :class:`datasets.DownloadManager` for more details. | ||
|
|
||
| 3. :func:`datasets.DatasetBuilder._generate_examples` reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the ``features`` from :func:`datasets.DatasetBuilder._info`. The input of :func:`datasets.DatasetBuilder._generate_examples` is actually the ``filepath`` provided in the keyword arguments of the last method. | ||
|
|
||
| The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an ``ArrowWriter`` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the ``DEFAULT_WRITER_BATCH_SIZE`` attribute in :class:`datasets.DatasetBuilder`. We recommend not exceeding a size of 200 MB. | ||
|
|
||
| Without loading scripts | ||
| ----------------------- | ||
|
|
||
| As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. 🤗 Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see :ref:`upload_dataset_repo` for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated. | ||
|
|
||
| The loading script-free method uses the `huggingface_hub <https://github.com/huggingface/huggingface_hub>`_ library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case 🤗 Datasets will use `glob <https://docs.python.org/3/library/glob.html>`_ instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them. | ||
|
|
||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Maintaining integrity | ||
| --------------------- | ||
|
|
||
| To ensure a dataset is complete, :func:`datasets.load_dataset` will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. :func:`datasets.load_dataset` verifies: | ||
|
|
||
| * The list of downloaded files. | ||
| * The number of bytes of the downloaded files. | ||
| * The SHA256 checksums of the downloaded files. | ||
| * The number of splits in the generated ``DatasetDict``. | ||
| * The number of samples in each split of the generated ``DatasetDict``. | ||
|
|
||
| If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files. | ||
| In this case, an error is raised to alert that the dataset has changed. | ||
| To ignore the error, one needs to specify ``ignore_verifications=True`` in :func:`load_dataset`. | ||
| Anytime you see a verification error, feel free to `open an issue on GitHub <https://github.com/huggingface/datasets/issues>`_ so that we can update the integrity checks for this dataset. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super nice summary of Apache Arrow 😍