diff --git a/docs/source/about_arrow.md b/docs/source/about_arrow.md new file mode 100644 index 00000000000..72ca2089288 --- /dev/null +++ b/docs/source/about_arrow.md @@ -0,0 +1,52 @@ +# Datasets 🀝 Arrow + +## What is Arrow? + +[Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: + +* Arrow's standard format allows [zero-copy reads](https://en.wikipedia.org/wiki/Zero-copy) which removes virtually all serialization overhead. +* Arrow is language-agnostic so it supports different programming languages. +* Arrow is column-oriented so it is faster at querying and processing slices or columns of data. +* Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow. +* Arrow supports many, possibly nested, column types. + +## Memory-mapping + +πŸ€— Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. +This architecture allows for large datasets to be used on machines with relatively small device memory. + +For example, loading the full English Wikipedia dataset only takes a few MB of RAM: + +```python +>>> import os; import psutil; import timeit +>>> from datasets import load_dataset + +# Process.memory_info is expressed in bytes, so convert to megabytes +>>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) +>>> wiki = load_dataset("wikipedia", "20200501.en", split='train') +>>> mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20 + +>>> print(f"RAM memory used: {(mem_after - mem_before)} MB") +'RAM memory used: 9 MB' +``` + +This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory. +Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups. + +## Performance + +Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s: + +```python +>>> s = """batch_size = 1000 +... for i in range(0, len(wiki), batch_size): +... batch = wiki[i:i + batch_size] +... """ + +>>> time = timeit.timeit(stmt=s, number=1, globals=globals()) +>>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {time:.1f} sec, " +... f"ie. {float(wiki.dataset_size >> 27)/time:.1f} Gb/s") +'Time to iterate over the 17 GB dataset: 85 sec, ie. 1.7 Gb/s' +``` + +You can obtain the best performance by accessing slices of data (or "batches"), in order to reduce the amount of lookups on disk. diff --git a/docs/source/about_cache.rst b/docs/source/about_cache.rst new file mode 100644 index 00000000000..1cbef6c3645 --- /dev/null +++ b/docs/source/about_cache.rst @@ -0,0 +1,54 @@ +The cache +========= + +The cache is one of the reasons why πŸ€— Datasets is so efficient. It stores previously downloaded and processed datasets so when you need to use them again, they are reloaded directly from the cache. This avoids having to download a dataset all over again, or reapplying processing functions. Even after you close and start another Python session, πŸ€— Datasets will reload your dataset directly from the cache! + +Fingerprint +----------- + +How does the cache keeps track of what transforms are applied to a dataset? Well, πŸ€— Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied. + +.. tip:: + + Transforms are any of the processing methods from the :doc:`How-to Process <./process>` guides such as :func:`datasets.Dataset.map` or :func:`datasets.Dataset.shuffle`. + +Here are what the actual fingerprints look like: + +.. code-block:: + + >>> from datasets import Dataset + >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) + >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) + >>> print(dataset1._fingerprint, dataset2._fingerprint) + d19493523d95e2dc 5b86abacd4b42434 + +In order for a transform to be hashable, it needs to be picklable by `dill `_ or `pickle `_. + +When you use a non-hashable transform, πŸ€— Datasets uses a random fingerprint instead and raises a warning. The non-hashable transform is considered different from the previous transforms. As a result, πŸ€— Datasets will recompute all the transforms. Make sure your transforms are serializable with pickle or dill to avoid this! + +An example of when πŸ€— Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint. + +.. tip:: + + When caching is disabled, use :func:`datasets.Dataset.save_to_disk` to save your transformed dataset or it will be deleted once the session ends. + +Hashing +------- + +The fingerprint of a dataset is updated by hashing the function passed to ``map`` as well as the ``map`` parameters (``batch_size``, ``remove_columns``, etc.). + +You can check the hash of any Python object using the :class:`datasets.fingerprint.Hasher`: + +.. code-block:: + + >>> from datasets.fingerprint import Hasher + >>> my_func = lambda example: {"length": len(example["text"])} + >>> print(Hasher.hash(my_func)) + '3d35e2b3e94c81d6' + +The hash is computed by dumping the object using a ``dill`` pickler and hashing the dumped bytes. +The pickler recursively dumps all the variables used in your function, so any change you do to an object that is used in your function, will cause the hash to change. + +If one of your functions doesn't seem to have the same hash across sessions, it means at least one of its variables contains a Python object that is not deterministic. +When this happens, feel free to hash any object you find suspicious to try to find the object that caused the hash to change. +For example, if you use a list for which the order of its elements is not deterministic across sessions, then the hash won't be the same across sessions either. diff --git a/docs/source/about_dataset_features.rst b/docs/source/about_dataset_features.rst new file mode 100644 index 00000000000..5635f2a97a3 --- /dev/null +++ b/docs/source/about_dataset_features.rst @@ -0,0 +1,51 @@ +Dataset features +================ + +:class:`datasets.Features` defines the internal structure of a dataset. The :class:`datasets.Features` is used to specify the underlying serialization format. What's more interesting to you though is that :class:`datasets.Features` contains high-level information about everything from the column names and types, to the :class:`datasets.ClassLabel`. You can think of :class:`datasets.Features` as the backbone of a dataset. + +The :class:`datasets.Features` format is simple: ``dict[column_name, column_type]``. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have. + +Let's have a look at the features of the MRPC dataset from the GLUE benchmark: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue', 'mrpc', split='train') + >>> dataset.features + {'idx': Value(dtype='int32', id=None), + 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), + 'sentence1': Value(dtype='string', id=None), + 'sentence2': Value(dtype='string', id=None), + } + +The :class:`datasets.Value` feature tells πŸ€— Datasets: + +* The ``idx`` data type is ``int32``. +* The ``sentence1`` and ``sentence2`` data types are ``string``. + +πŸ€— Datasets supports many other data types such as ``bool``, ``float32`` and ``binary`` to name just a few. + +.. seealso:: + + Refer to :class:`datasets.Value` for a full list of supported data types. + +The :class:`datasets.ClassLabel` feature informs πŸ€— Datasets the ``label`` column contains two classes. The classes are labeled ``not_equivalent`` and ``equivalent``. Labels are stored as integers in the dataset. When you retrieve the labels, :func:`datasets.ClassLabel.int2str` and :func:`datasets.ClassLabel.str2int` carries out the conversion from integer value to label name, and vice versa. + +If your data type contains a list of objects, then you want to use the :class:`datasets.Sequence` feature. Remember the SQuAD dataset? + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('squad', split='train') + >>> dataset.features + {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), + 'context': Value(dtype='string', id=None), + 'id': Value(dtype='string', id=None), + 'question': Value(dtype='string', id=None), + 'title': Value(dtype='string', id=None)} + +The ``answers`` field is constructed using the :class:`datasets.Sequence` feature because it contains two subfields, ``text`` and ``answer_start``, which are lists of ``string`` and ``int32``, respectively. + +.. tip:: + + See the :ref:`flatten` section to learn how you can extract the nested subfields as their own independent columns. \ No newline at end of file diff --git a/docs/source/about_dataset_load.rst b/docs/source/about_dataset_load.rst new file mode 100644 index 00000000000..47a84563e4d --- /dev/null +++ b/docs/source/about_dataset_load.rst @@ -0,0 +1,105 @@ +Build and load +============== + +Nearly every deep learning workflow begins with loading a dataset, which makes it one of the most important steps. With πŸ€— Datasets, there are more than 900 datasets available to help you get started with your NLP task. All you have to do is call: :func:`datasets.load_dataset` to take your first step. This function is a true workhorse in every sense because it builds and loads every dataset you use. + +ELI5: ``load_dataset`` +------------------------------- + +Let's begin with a basic Explain Like I'm Five. + +For community datasets, :func:`datasets.load_dataset` downloads and imports the dataset loading script associated with the requested dataset from the Hugging Face Hub. The Hub is a central repository where all the Hugging Face datasets and models are stored. Code in the loading script defines the dataset information (description, features, URL to the original files, etc.), and tells πŸ€— Datasets how to generate and display examples from it. + +If you are working with a canonical dataset, :func:`datasets.load_dataset` downloads and imports the dataset loading script from GitHub. + +.. seealso:: + + Read the :doc:`Share <./share>` section to learn more about the difference between community and canonical datasets. This section also provides a step-by-step guide on how to write your own dataset loading script! + +The loading script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then πŸ€— Datasets will reload it from the cache to save you the trouble of downloading it again. + +Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works. + +Building a dataset +------------------ + +When you load a dataset for the first time, πŸ€— Datasets takes the raw data file and builds it into a table of rows and typed columns. There are two main classes responsible for building a dataset: :class:`datasets.BuilderConfig` and :class:`datasets.DatasetBuilder`. + +.. image:: /imgs/builderconfig.png + :align: center + +:class:`datasets.BuilderConfig` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:class:`datasets.BuilderConfig` is the configuration class of :class:`datasets.DatasetBuilder`. The :class:`datasets.BuilderConfig` contains the following basic attributes about a dataset: + +.. list-table:: + :header-rows: 1 + + * - Attribute + - Description + * - :obj:`name` + - Short name of the dataset. + * - :obj:`version` + - Dataset version identifier. + * - :obj:`data_dir` + - Stores the path to a local folder containing the data files. + * - :obj:`data_files` + - Stores paths to local data files. + * - :obj:`description` + - Description of the dataset. + +If you want to add additional attributes to your dataset such as the class labels, you can subclass the base :class:`datasets.BuilderConfig` class. There are two ways to populate the attributes of a :class:`datasets.BuilderConfig` class or subclass: + +* Provide a list of predefined :class:`datasets.BuilderConfig` class (or subclass) instances in the datasets :attr:`datasets.DatasetBuilder.BUILDER_CONFIGS` attribute. + +* When you call :func:`datasets.load_dataset`, any keyword arguments that are not specific to the method will be used to set the associated attributes of the :class:`datasets.BuilderConfig` class. This will override the predefined attributes if a specific configuration was selected. + +You can also set the :attr:`datasets.DatasetBuilder.BUILDER_CONFIG_CLASS` to any custom subclass of :class:`datasets.BuilderConfig`. + +:class:`datasets.DatasetBuilder` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:class:`datasets.DatasetBuilder` accesses all the attributes inside :class:`datasets.BuilderConfig` to build the actual dataset. + +.. image:: /imgs/datasetbuilder.png + :align: center + +There are three main methods in :class:`datasets.DatasetBuilder`: + +1. :func:`datasets.DatasetBuilder._info` is in charge of defining the dataset attributes. When you call ``dataset.info``, πŸ€— Datasets returns the information stored here. Likewise, the :class:`datasets.Features` are also specified here. Remember, the :class:`datasets.Features` are like the skeleton of the dataset. It provides the names and types of each column. + +2. :func:`datasets.DatasetBuilder._split_generator` downloads or retrieves the requested data files, organizes them into splits, and defines specific arguments for the generation process. This method has a :class:`datasets.DownloadManager` that downloads files or fetches them from your local filesystem. Within the :class:`datasets.DownloadManager`, there is a :func:`datasets.DownloadManager.download_and_extract` method that accepts a dictionary of URLs to the original data files, and downloads the requested files. Accepted inputs include: a single URL or path, or a list/dictionary of URLs or paths. Any compressed file types like TAR, GZIP and ZIP archives will be automatically extracted. + + Once the files are downloaded, :class:`datasets.SplitGenerator` organizes them into splits. The :class:`datasets.SplitGenerator` contains the name of the split, and any keyword arguments that are provided to the :func:`datasets.DatasetBuilder._generate_examples` method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files for each split. + + .. tip:: + + :func:`datasets.DownloadManager.download_and_extract` can download files from a wide range of sources. If the data files are hosted on a special access server, you should use :func:`datasets.DownloadManger.download_custom`. Refer to the reference of :class:`datasets.DownloadManager` for more details. + +3. :func:`datasets.DatasetBuilder._generate_examples` reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the ``features`` from :func:`datasets.DatasetBuilder._info`. The input of :func:`datasets.DatasetBuilder._generate_examples` is actually the ``filepath`` provided in the keyword arguments of the last method. + + The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an ``ArrowWriter`` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the ``DEFAULT_WRITER_BATCH_SIZE`` attribute in :class:`datasets.DatasetBuilder`. We recommend not exceeding a size of 200 MB. + +Without loading scripts +----------------------- + +As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. πŸ€— Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see :ref:`upload_dataset_repo` for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated. + +The loading script-free method uses the `huggingface_hub `_ library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case πŸ€— Datasets will use `glob `_ instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them. + +Maintaining integrity +--------------------- + +To ensure a dataset is complete, :func:`datasets.load_dataset` will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. :func:`datasets.load_dataset` verifies: + +* The list of downloaded files. +* The number of bytes of the downloaded files. +* The SHA256 checksums of the downloaded files. +* The number of splits in the generated ``DatasetDict``. +* The number of samples in each split of the generated ``DatasetDict``. + +If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files. +In this case, an error is raised to alert that the dataset has changed. +To ignore the error, one needs to specify ``ignore_verifications=True`` in :func:`load_dataset`. +Anytime you see a verification error, feel free to `open an issue on GitHub `_ so that we can update the integrity checks for this dataset. diff --git a/docs/source/about_map_batch.rst b/docs/source/about_map_batch.rst new file mode 100644 index 00000000000..0945d63e77c --- /dev/null +++ b/docs/source/about_map_batch.rst @@ -0,0 +1,43 @@ +Batch mapping +============= + +Combining the utility of :func:`datasets.Dataset.map` with batch mode is very powerful. It allows you to speed up processing, and freely control the size of the generated dataset. + +Need for speed +-------------- + +The primary objective of batch mapping is to speed up processing. Often times, it is faster to work with batches of data instead of single examples. Naturally, batch mapping lends itself to tokenization. For example, the πŸ€— `Tokenizers `_ library works faster with batches because it parallelizes the tokenization of all the examples in a batch. + +Input size != output size +------------------------- + +The ability to control the size of the generated dataset can be leveraged for many interesting use-cases. In the How-to :ref:`map` section, there are examples of using batch mapping to: + +* Split long sentences into shorter chunks. +* Augment a dataset with additional tokens. + +It is helpful to understand how this works, so you can come up with your own ways to use batch mapping. At this point, you may be wondering how you can control the size of the generated dataset. The answer is: **the mapped function does not have to return an output batch of the same size**. + +In other words, your mapped function input can be a batch of size ``N`` and return a batch of size ``M``. The output ``M`` can be greater than or less than ``N``. This means you can concatenate your examples, divide it up, and even add more examples! + +However, remember that all values in the output dictionary must contain the **same number of elements** as the other fields in the output dictionary. Otherwise, it is not possible to define the number of examples in the output returned by the mapped function. The number can vary between successive batches processed by the mapped function. For a single batch though, all values of the output dictionary should have the same length (i.e., the number of elements). + +For example, from a dataset of 1 column and 3 rows, if you use ``map`` to return a new column with twice as many rows, then you will have an error. +In this case, you end up with one column with 3 rows, and one column with 6 rows. As you can see, the table will not be valid: + +.. code-block:: + + >>> from datasets import Dataset + >>> dataset = Dataset.from_dict({"a": [0, 1, 2]}) + >>> dataset.map(lambda batch: {"b": batch["a"] * 2}, batched=True) # new column with 6 elements: [0, 1, 2, 0, 1, 2] + 'ArrowInvalid: Column 1 named b expected length 3 but got length 6' + +To make it valid, you have to drop one of the columns: + +.. code-block:: + + >>> from datasets import Dataset + >>> dataset = Dataset.from_dict({"a": [0, 1, 2]}) + >>> dataset_with_duplicates = dataset.map(lambda batch: {"b": batch["a"] * 2}, remove_columns=["a"], batched=True) + >>> len(dataset_with_duplicates) + 6 diff --git a/docs/source/about_metrics.rst b/docs/source/about_metrics.rst new file mode 100644 index 00000000000..0e084dd2b0d --- /dev/null +++ b/docs/source/about_metrics.rst @@ -0,0 +1,22 @@ +All about metrics +================= + +πŸ€— Datasets provides access to a wide range of NLP metrics. You can load metrics associated with benchmark datasets like GLUE or SQuAD, and complex metrics like BLEURT or BERTScore, with a single command: :func:`datasets.load_metric`. Once you've loaded a metric, easily compute and evaluate a model's performance. + +ELI5: ``load_metric`` +------------------------------------------- + +Loading a dataset and loading a metric share many similiarites. This was an intentional design choice because we wanted to create a simple and unified experience. When you call :func:`datasets.load_metric`, the metric loading script is downloaded and imported from GitHub (if it hasn't already been downloaded before). It contains information about the metric such as it's citation, homepage, and description. + +The metric loading script will instantiate and return a :class:`datasets.Metric` object. This stores the predictions and references, which you need to compute the metric values. The :class:`datasets.Metric` object is stored as an Apache Arrow table. As a result, the predictions and references are stored directly on disk with memory-mapping. This enables πŸ€— Datasets to do a lazy computation of the metric, and makes it easier to gather all the predictions in a distributed setting. + +Distributed evaluation +---------------------- + +Computing metrics in a distributed environment can be tricky. Metric evaluation is executed in separate Python processes, or nodes, on different subsets of a dataset. Typically, when a metric score is additive (``f(AuB) = f(A) + f(B)``), you can use distributed reduce operations to gather the scores for each subset of the dataset. But when a metric is non-additive (``f(AuB) β‰  f(A) + f(B)``), it's not that simple. For example, you can't take the sum of the `F1 `_ scores of each data subset as your **final metric**. + +A common way to overcome this issue is to fallback on single process evaluation. The metrics are evaluated on a single GPU, which becomes inefficient. + +πŸ€— Datasets solves this issue by only computing the final metric on the first node. The predictions and references are computed and provided to the metric separately for each node. These are temporarily stored in an Apache Arrow table, avoiding cluttering the GPU or CPU memory. When you are ready to :func:`datasets.Metric.compute` the final metric, the first node is able to access the predictions and references stored on all the other nodes. Once it has gathered all the predictions and references, :func:`datasets.Metric.compute` will perform the final metric evaluation. + +This solution allows πŸ€— Datasets to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory. \ No newline at end of file diff --git a/docs/source/access.rst b/docs/source/access.rst new file mode 100644 index 00000000000..bf60c615969 --- /dev/null +++ b/docs/source/access.rst @@ -0,0 +1,130 @@ +The ``Dataset`` object +====================== + +In the previous tutorial, you learned how to successfully load a dataset. This section will familiarize you with the :class:`datasets.Dataset` object. You will learn about the metadata stored inside a Dataset object, and the basics of querying a Dataset object to return rows and columns. + +A :class:`datasets.Dataset` object is returned when you load an instance of a dataset. This object behaves like a normal Python container. + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue', 'mrpc', split='train') + +Metadata +-------- + +The :class:`datasets.Dataset` object contains a lot of useful information about your dataset. For example, call :attr:`dataset.info` to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes. + +.. code-block:: + + >>> dataset.info + DatasetInfo( + description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', + citation='@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398', + license='', + features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')}, + download_checksums={'https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}}, + download_size=1494541, + post_processing_size=None, + dataset_size=1492156, + size_in_bytes=2986697 + ) + +You can request specific attributes of the dataset, like ``description``, ``citation``, and ``homepage``, by calling them directly. Take a look at :class:`datasets.DatasetInfo` for a complete list of attributes you can return. + +.. code-block:: + + >>> dataset.split + NamedSplit('train') + >>> dataset.description + 'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n' + >>> dataset.citation + '@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.' + >>> dataset.homepage + 'https://www.microsoft.com/en-us/download/details.aspx?id=52398' + +Features and columns +-------------------- + +A dataset is a table of rows and typed columns. Querying a dataset returns a Python dictionary where the keys correspond to column names, and the values correspond to column values: + +.. code-block:: + + >>> dataset[0] + {'idx': 0, + 'label': 1, + 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'} + +Return the number of rows and columns with the following standard attributes: + +.. code-block:: + + >>> dataset.shape + (3668, 4) + >>> dataset.num_columns + 4 + >>> dataset.num_rows + 3668 + >>> len(dataset) + 3668 + +List the columns names with :func:`datasets.Dataset.column_names`: + +.. code-block:: + + >>> dataset.column_names + ['idx', 'label', 'sentence1', 'sentence2'] + +Get detailed information about the columns with :attr:`datasets.Dataset.features`: + +.. code-block:: + + >>> dataset.features + {'idx': Value(dtype='int32', id=None), + 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), + 'sentence1': Value(dtype='string', id=None), + 'sentence2': Value(dtype='string', id=None), + } + +Return even more specific information about a feature like :class:`datasets.ClassLabel`, by calling its parameters ``num_classes`` and ``names``: + +.. code-block:: + + >>> dataset.features['label'].num_classes + 2 + >>> dataset.features['label'].names + ['not_equivalent', 'equivalent'] + +Rows, slices, batches, and columns +---------------------------------- + +Get several rows of your dataset at a time with slice notation or a list of indices: + +.. code-block:: + + >>> dataset[:3] + {'idx': [0, 1, 2], + 'label': [1, 0, 1], + 'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'], + 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."] + } + >>> dataset[[1, 3, 5]] + {'idx': [1, 3, 5], + 'label': [0, 0, 1], + 'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'], + 'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."] + } + +Querying by the column name will return its values. For example, if you want to only return the first three examples: + +.. code-block:: + + >>> dataset['sentence1'][:3] + ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'] + +Depending on how a :class:`datasets.Dataset` object is queried, the format returned will be different: + +* A single row like ``dataset[0]`` returns a Python dictionary of values. +* A batch like ``dataset[5:10]`` returns a Python dictionary of lists of values. +* A column like ``dataset['sentence1']`` returns a Python list of values. \ No newline at end of file diff --git a/docs/source/add_dataset.rst b/docs/source/add_dataset.rst deleted file mode 100644 index 950c9890c4c..00000000000 --- a/docs/source/add_dataset.rst +++ /dev/null @@ -1,358 +0,0 @@ -Writing a dataset loading script -============================================= - -There are two main reasons you may want to write your own dataset loading script: - -- you want to use local/private data files and the generic dataloader for CSV/JSON/text files (see :ref:`loading-from-local-files`) are not enough for your use-case, -- you would like to share a new dataset with the community, for instance in the `HuggingFace Hub `__. - -This chapter will explain how datasets are loaded and how you can write from scratch or adapt a dataset loading script. - -.. note:: - - You can start from the `template for a dataset loading script `__ when writing a new dataset loading script. You can find this template in the ``templates`` folder on the github repository. - -Here is a quick general overview of the classes and method involved when generating a dataset: - -.. image:: /imgs/datasets_doc.jpg - -On the left is the general organization inside the library to create a :class:`datasets.Dataset` instance and on the right, the elements which are specific to each dataset loading script. To create a new dataset loading script one mostly needs to specify three methods in a :class:`datasets.DatasetBuilder` class: - -- :func:`datasets.DatasetBuilder._info` which is in charge of specifying the dataset metadata as a :obj:`datasets.DatasetInfo` dataclass and in particular the :class:`datasets.Features` which defines the names and types of each column in the dataset, -- :func:`datasets.DatasetBuilder._split_generator` which is in charge of downloading or retrieving the data files, organizing them by splits and defining specific arguments for the generation process if needed, -- :func:`datasets.DatasetBuilder._generate_examples` which is in charge of loading the files for a split and yielding examples with the format specified in the ``features``. - -Optionally, the dataset loading script can define a configuration to be used by the :class:`datasets.DatasetBuilder` by inheriting from :class:`datasets.BuilderConfig`. Such a class allows us to customize the building process, for instance by allowing to select specific subsets of the data or specific ways to process the data when loading the dataset. - -.. note:: - - Note on naming: the dataset class should be camel case, while the dataset name is its snake case equivalent (ex: :obj:`class BookCorpus(datasets.GeneratorBasedBuilder)` for the dataset ``book_corpus``). - - -Adding dataset metadata ----------------------------------- - -The :func:`datasets.DatasetBuilder._info` method is in charge of specifying the dataset metadata as a :obj:`datasets.DatasetInfo` dataclass and in particular the :class:`datasets.Features` which defined the names and types of each column in the dataset. :class:`datasets.DatasetInfo` has a predefined set of attributes and cannot be extended. The full list of attributes can be found in the package reference. - -The most important attributes to specify are: - -- :attr:`datasets.DatasetInfo.features`: a :class:`datasets.Features` instance defining the name and the type of each column in the dataset and the general organization of the examples, -- :attr:`datasets.DatasetInfo.description`: a :obj:`str` describing the dataset, -- :attr:`datasets.DatasetInfo.citation`: a :obj:`str` containing the citation for the dataset in a BibTex format for inclusion in communications citing the dataset, -- :attr:`datasets.DatasetInfo.homepage`: a :obj:`str` containing an URL to an original homepage of the dataset. - -Here is for instance the :func:`datasets.Dataset._info` for the SQuAD dataset for instance, which is taken from the `squad dataset loading script `__: - -.. code-block:: - - def _info(self): - return datasets.DatasetInfo( - description=_DESCRIPTION, - features=datasets.Features( - { - "id": datasets.Value("string"), - "title": datasets.Value("string"), - "context": datasets.Value("string"), - "question": datasets.Value("string"), - "answers": datasets.features.Sequence( - { - "text": datasets.Value("string"), - "answer_start": datasets.Value("int32"), - } - ), - } - ), - # No default supervised_keys (as we have to pass both question - # and context as input). - supervised_keys=None, - homepage="https://rajpurkar.github.io/SQuAD-explorer/", - citation=_CITATION, - ) - - -The :class:`datasets.Features` define the structure for each examples and can define arbitrary nested objects with fields of various types. More details on the available ``features`` can be found in the guide on features :doc:`features` and in the package reference on :class:`datasets.Features`. Many examples of features can also be found in the various `dataset scripts provided on the GitHub repository `__ and even directly inspected on the `datasets viewer `__. - -Here are the features of the SQuAD dataset for instance, which is taken from the `squad dataset loading script `__: - -.. code-block:: - - datasets.Features( - { - "id": datasets.Value("string"), - "title": datasets.Value("string"), - "context": datasets.Value("string"), - "question": datasets.Value("string"), - "answers": datasets.Sequence( - { - "text": datasets.Value("string"), - "answer_start": datasets.Value("int32"), - } - ), - } - ) - -These features should be mostly self-explanatory given the above introduction. One specific behavior here is the fact that the ``Sequence`` field in ``"answers"`` is given a dictionary of sub-fields. As mentioned in the above note, in this case, this feature is actually **converted into a dictionary of lists** (instead of the list of dictionary that we read in the feature here). This is confirmed in the structure of the examples yielded by the generation method at the very end of the `squad dataset loading script `__: - -.. code-block:: - - answer_starts = [answer["answer_start"] for answer in qa["answers"]] - answers = [answer["text"].strip() for answer in qa["answers"]] - - yield key, { - "title": title, - "context": context, - "question": qa["question"], - "id": qa["id"], - "answers": {"answer_start": answer_starts, "text": answers,}, - } - -Here the ``"answers"`` is accordingly provided with a dictionary of lists and not a list of dictionary. - -Let's take another example of features from the `large-scale reading comprehension dataset Race `__: - -.. code-block:: - - features=datasets.Features( - { - "article": datasets.Value("string"), - "answer": datasets.Value("string"), - "question": datasets.Value("string"), - "options": datasets.features.Sequence(datasets.Value("string")) - } - ) - -Here is the corresponding first examples in the dataset: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('race', 'high', split='train') - >>> dataset[0] - {'article': 'My husband is a born shopper. He loves to look at things and to touch them. He likes to compare prices between the same items in different shops. He would never think of buying anything without looking around in several - ... - sadder. When he saw me he said, "I\'m sorry, Mum. I have forgotten to buy oranges and the meat. I only remembered to buy six eggs, but I\'ve dropped three of them."', - 'answer': 'C', - 'question': 'The husband likes shopping because _ .', - 'options': - ['he has much money.', - 'he likes the shops.', - 'he likes to compare the prices between the same items.', - 'he has nothing to do but shopping.' - ] - } - - -Downloading data files and organizing splits -------------------------------------------------- - -The :func:`datasets.DatasetBuilder._split_generator` method is in charge of downloading (or retrieving locally the data files), organizing them according to the splits and defining specific arguments for the generation process if needed. - -This method **takes as input** a :class:`datasets.DownloadManager` which is a utility which can be used to download files (or to retrieve them from the local filesystem if they are local files or are already in the cache) and return a list of :class:`datasets.SplitGenerator`. A :class:`datasets.SplitGenerator` is a simple dataclass containing the name of the split and keywords arguments to be provided to the :func:`datasets.DatasetBuilder._generate_examples` method that we detail in the next section. These arguments can be specific to each splits and typically comprise at least the local path to the data files to load for each split. - -.. note:: - - **Using local data files** Two attributes of :class:`datasets.BuilderConfig` are specifically provided to store paths to local data files if your dataset is not online but constituted by local data files. These two attributes are :obj:`data_dir` and :obj:`data_files` and can be freely used to provide a directory path or file paths. These two attributes can be set when calling :func:`datasets.load_dataset` using the associated keyword arguments, e.g. ``dataset = datasets.load_dataset('my_script', data_files='my_local_data_file.csv')`` and the values can be used in :func:`datasets.DatasetBuilder._split_generator` by accessing ``self.config.data_dir`` and ``self.config.data_files``. See the `text file loading script `__ for a simple example using :attr:`datasets.BuilderConfig.data_files`. - -Let's have a look at a simple example of a :func:`datasets.DatasetBuilder._split_generator` method. We'll take the example of the `squad dataset loading script `__: - -.. code-block:: - - _URLS = { - "train": _URL + "train-v1.1.json", - "dev": _URL + "dev-v1.1.json", - } - - class Squad(datasets.GeneratorBasedBuilder): - """SQUAD: The Stanford Question Answering Dataset. Version 1.1.""" - - def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]: - downloaded_files = dl_manager.download_and_extract(_URLS) - - return [ - datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}), - datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}), - ] - -As you can see, this method first prepares a dict of URLs to the original data files for SQuAD. This dict is then provided to the :func:`datasets.DownloadManager.download_and_extract` method which will take care of downloading or retrieving these files from the local file system and returning a object of the same type and organization (here a dictionary) with the path to the local version of the requested files. :func:`datasets.DownloadManager.download_and_extract` can take as input a single URL/path or a list or dictionary of URLs/paths and will return an object of the same structure (single URL/path, list or dictionary of URLs/paths) with the path to the local files. This method also takes care of extracting compressed tar, gzip and zip archives. - -:func:`datasets.DownloadManager.download_and_extract` can download files from a large set of origins but if your data files are hosted on a special access server, it's also possible to provide a callable which will take care of the downloading process to the ``DownloadManager`` using :func:`datasets.DownloadManager.download_custom`. - -.. note:: - - In addition to :func:`datasets.DownloadManager.download_and_extract` and :func:`datasets.DownloadManager.download_custom`, the :class:`datasets.DownloadManager` class also provide more fine-grained control on the download and extraction process through several methods including: :func:`datasets.DownloadManager.download`, :func:`datasets.DownloadManager.extract` and :func:`datasets.DownloadManager.iter_archive`. Please refer to the package reference on :class:`datasets.DownloadManager` for details on these methods. - -Once the data files are downloaded, the next mission for the :func:`datasets.DatasetBuilder._split_generator` method is to prepare the :class:`datasets.SplitGenerator` for each split which will be used to call the :func:`datasets.DatasetBuilder._generate_examples` method that we detail in the next session. - -A :class:`datasets.SplitGenerator` is a simple dataclass containing: - -- :obj:`name` (``string``) : the **name** of a split, when possible, standard split names provided in :class:`datasets.Split` can be used: :obj:`datasets.Split.TRAIN`, :obj:`datasets.Split.VALIDATION` and :obj:`datasets.Split.TEST`, -- :obj:`gen_kwargs` (``dict``): **keywords arguments** to be provided to the :func:`datasets.DatasetBuilder._generate_examples` method to generate the samples in this split. These arguments can be specific to each split and typically comprise at least the local path to the data files to load for each split as indicated in the above SQuAD example. - - -Generating the samples in each split -------------------------------------------------- - -The :func:`datasets.DatasetBuilder._generate_examples` is in charge of reading the data files for a split and yielding examples with the format specified in the ``features`` set in :func:`datasets.DatasetBuilder._info`. - -The input arguments of :func:`datasets.DatasetBuilder._generate_examples` are defined by the :obj:`gen_kwargs` dictionary returned by the :func:`datasets.DatasetBuilder._split_generator` method we detailed above. - -Here again, let's take the simple example of the `squad dataset loading script `__: - -.. code-block:: - - def _generate_examples(self, filepath): - """This function returns the examples in the raw (text) form.""" - logger.info("generating examples from = %s", filepath) - key = 0 - with open(filepath) as f: - squad = json.load(f) - for article in squad["data"]: - title = article.get("title", "") - for paragraph in article["paragraphs"]: - context = paragraph["context"] - for qa in paragraph["qas"]: - answer_starts = [answer["answer_start"] for answer in qa["answers"]] - answers = [answer["text"] for answer in qa["answers"]] - # Features currently used are "context", "question", and "answers". - # Others are extracted here for the ease of future expansions. - yield key, { - "title": title, - "context": context, - "question": qa["question"], - "id": qa["id"], - "answers": {"answer_start": answer_starts, "text": answers,}, - } - key += 1 - -The input argument is the ``filepath`` provided in the :obj:`gen_kwargs` of each :class:`datasets.SplitGenerator` returned by the :func:`datasets.DatasetBuilder._split_generator` method. - -The method reads and parses the inputs files and yields a tuple constituted of an ``id_`` (can be arbitrary but should be unique (for backward compatibility with TensorFlow datasets) and an example. The example is a dictionary with the same structure and element types as the ``features`` defined in :func:`datasets.DatasetBuilder._info`. - -.. note:: - - Since generating a dataset is based on a python generator, then it doesn't load all the data in memory and therefore it can handle pretty big datasets. However before being flushed to the dataset file on disk, the generated samples are stored in the :obj:`ArrowWriter` buffer so that they are written by batch. If your dataset's samples take a lot of memory (with images or videos), then make sure to speficy a low value for the `DEFAULT_WRITER_BATCH_SIZE` class attribute of the dataset builder class. We recommend to not exceed 200MB. - -Specifying several dataset configurations -------------------------------------------------- - -Sometimes you want to provide access to several sub-sets of your dataset, for instance if your dataset comprises several languages or is constituted of various sub-sets or if you want to provide several ways to structure examples. - -This is possible by defining a specific :class:`datasets.BuilderConfig` class and providing predefined instances of this class for the user to select from. - -The base :class:`datasets.BuilderConfig` class is very simple and only comprises the following attributes: - -- :obj:`name` (``str``) is the name of the dataset configuration, for instance the language name if the various configurations are specific to various languages -- :obj:`version` an optional version identifier -- :obj:`data_dir` (``str``) can be used to store the path to a local folder containing data files -- :obj:`data_files` (``Union[Dict, List]``) can be used to store paths to local data files -- :obj:`description` (``str``) can be used to give a long description of the configuration - -:class:`datasets.BuilderConfig` is only used as a container of informations which can be used in the :class:`datasets.DatasetBuilder` to build the dataset by being accessed in the ``self.config`` attribute of the :class:`datasets.DatasetBuilder` instance. - -You can sub-class the base :class:`datasets.BuilderConfig` class to add additional attributes that you may want to use to control the generation of a dataset. The specific configuration class that will be used by the dataset is set in the :attr:`datasets.DatasetBuilder.BUILDER_CONFIG_CLASS`. - -There are two ways to populate the attributes of a :class:`datasets.BuilderConfig` class or sub-class: - -- a list of predefined :class:`datasets.BuilderConfig` classes or sub-classes can be set in the :attr:`datasets.DatasetBuilder.BUILDER_CONFIGS` attribute of the dataset. Each specific configuration can then be selected by giving its ``name`` as ``name`` keyword to :func:`datasets.load_dataset`, -- when calling :func:`datasets.load_dataset`, all the keyword arguments which are not specific to the :func:`datasets.load_dataset` method will be used to set the associated attributes of the :class:`datasets.BuilderConfig` class and override the predefined attributes if a specific configuration was selected. - -Let's take an example adapted from the `CSV files loading script `__. - -Let's say we would like two simple ways to load CSV files: using ``','`` as a delimiter (we will call this configuration ``'comma'``) or using ``';'`` as a delimiter (we will call this configuration ``'semi-colon'``). - -We can define a custom configuration with a ``delimiter`` attribute: - -.. code-block:: - - @dataclass - class CsvConfig(datasets.BuilderConfig): - """BuilderConfig for CSV.""" - delimiter: str = None - -And then define several predefined configurations in the DatasetBuilder: - -.. code-block:: - - class Csv(datasets.ArrowBasedBuilder): - BUILDER_CONFIG_CLASS = CsvConfig - BUILDER_CONFIGS = [CsvConfig(name='comma', - description="Load CSV using ',' as a delimiter", - delimiter=','), - CsvConfig(name='semi-colon', - description="Load CSV using a semi-colon as a delimiter", - delimiter=';')] - - ... - - def _generate_examples(file): - with open(file) as csvfile: - data = csv.reader(csvfile, delimiter = self.config.delimiter) - for i, row in enumerate(data): - yield i, row - -Here we can see how reading the CSV file can be controlled using the ``self.config.delimiter`` attribute. - -The users of our dataset loading script will be able to select one or the other way to load the CSV files with the configuration names or even a totally different way by setting the ``delimiter`` attrbitute directly. For instance using commands like this: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('my_csv_loading_script', name='comma', data_files='my_file.csv') - >>> dataset = load_dataset('my_csv_loading_script', name='semi-colon', data_files='my_file.csv') - >>> dataset = load_dataset('my_csv_loading_script', name='comma', delimiter='\t', data_files='my_file.csv') - -In the last case, the delimiter set by the configuration will be overriden by the delimiter given as argument to ``load_dataset``. - -While the configuration attributes are used in this case to control the reading/parsing of the data files, the configuration attributes can be used at any stage of the processing and in particular: - -- to control the :class:`datasets.DatasetInfo` attributes set in the :func:`datasets.DatasetBuilder._info` method, for instances the ``features``, -- to control the files downloaded in the :func:`datasets.DatasetBuilder._split_generator` method, for instance to select different URLs depending on a ``language`` attribute defined by the configuration - -An example of a custom configuration class with several predefined configurations can be found in the `Super-GLUE loading script `__ which providescontrol over the various sub-dataset of the SuperGLUE benchmark through the configurations. Another example is the `Wikipedia loading script `__ which provides control over the language of the Wikipedia dataset through the configurations. - -Specifying a default dataset configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When a user loads a dataset with more than one configuration, they must specify a configuration name or else a ValueError is raised. With some datasets, it may be preferable to specify a default configuration that will be loaded if a user does not specify one. - -This can be done with the :attr:`datasets.DatasetBuilder.DEFAULT_CONFIG_NAME` attribute. By setting this attribute equal to the name of one of the dataset configurations, that config will be loaded in the case that the user does not specify a config name. - -This feature is opt-in and should only be used where a default configuration makes sense for the dataset. For example, many cross-lingual datasets have a different configuration for each language. In this case, it may make sense to create an aggregate configuration which can serve as the default. This would, in effect, load all languages of the dataset by default unless the user specifies a particular language. See the `Polyglot NER loading script `__ for an example. - - -Testing the dataset loading script -------------------------------------------------- - -Once you're finished with creating or adapting a dataset loading script, you can try it locally by giving the path to the dataset loading script: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('PATH/TO/MY/SCRIPT.py') - -If your dataset has several configurations or requires to be given the path to local data files, you can use the arguments of :func:`datasets.load_dataset` accordingly: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('PATH/TO/MY/SCRIPT.py', 'my_configuration', data_files={'train': 'my_train_file.txt', 'validation': 'my_validation_file.txt'}) - - - -Dataset scripts of reference -------------------------------------------------- - -It is common to see datasets that share the same format. Therefore it is possible that there already exists a dataset script from which you can get some inspiration to help you write your own. - -Here is a list of datasets of reference. Feel free to reuse parts of their code and adapt them to your case: - -- question-answering: `squad `__ (original data are in json) -- natural language inference: `snli `__ (original data are in text files with tab separated columns) -- POS/NER: `conll2003 `__ (original data are in text files with one token per line) -- sentiment analysis: `allocine `__ (original data are in jsonl files) -- text classification: `ag_news `__ (original data are in csv files) -- translation: `flores `__ (original data come from text files - one per language) -- summarization: `billsum `__ (original data are in json files) -- benchmark: `glue `__ (original data are various formats) -- multilingual: `xquad `__ (original data are in json) -- multitask: `matinf `__ (original data need to be downloaded by the user because it requires authentificaition) diff --git a/docs/source/add_metric.rst b/docs/source/add_metric.rst deleted file mode 100644 index 0b4fa7d9aaa..00000000000 --- a/docs/source/add_metric.rst +++ /dev/null @@ -1,183 +0,0 @@ -Writing a metric loading script -============================================= - -If you want to use your own metric, or if you would like to share a new metric with the community, for instance in the `HuggingFace Hub `__, then you can define a new metric loading script. - -This chapter will explain how metrics are loaded and how you can write from scratch or adapt a metric loading script. - -.. note:: - - You can start from the `template for a metric loading script `__ when writing a new metric loading script. You can find this template in the ``templates`` folder on the github repository. - - -To create a new metric loading script one mostly needs to specify three methods in a :class:`datasets.Metric` class: - -- :func:`datasets.Metric._info` which is in charge of specifying the metric metadata as a :obj:`datasets.MetricInfo` dataclass and in particular the :class:`datasets.Features` which defines the types of the predictions and the references, -- :func:`datasets.Metric._compute` which is in charge of computing the actual score(s), given some predictions and references. - -.. note:: - - Note on naming: the metric class should be camel case, while the metric name is its snake case equivalent (ex: :obj:`class Rouge(datasets.Metric)` for the metric ``rouge``). - - -Adding metric metadata ----------------------------------- - -The :func:`datasets.Metric._info` method is in charge of specifying the metric metadata as a :obj:`datasets.MetricInfo` dataclass and in particular the :class:`datasets.Features` which defines the types of the predictions and the references. :class:`datasets.MetricInfo` has a predefined set of attributes and cannot be extended. The full list of attributes can be found in the package reference. - -The most important attributes to specify are: - -- :attr:`datasets.MetricInfo.features`: a :class:`datasets.Features` instance defining the name and the type the predictions and references, -- :attr:`datasets.MetricInfo.description`: a :obj:`str` describing the metric, -- :attr:`datasets.MetricInfo.citation`: a :obj:`str` containing the citation for the metric in a BibTex format for inclusion in communications citing the metric, -- :attr:`datasets.MetricInfo.homepage`: a :obj:`str` containing an URL to an original homepage of the metric. -- :attr:`datasets.MetricInfo.format`: an optional :obj:`str` to tell what is the format of the predictions and the references passed to the :func:`datasets.DatasetBuilder._compute` method. It can be set to "numpy", "torch", "tensorflow", "jax" or "pandas". - -Here is for instance the :func:`datasets.Metric._info` for the Sacrebleu metric, which is taken from the `sacrebleu metric loading script `__: - -.. code-block:: - - def _info(self): - return datasets.MetricInfo( - description=_DESCRIPTION, - citation=_CITATION, - homepage="https://github.com/mjpost/sacreBLEU", - inputs_description=_KWARGS_DESCRIPTION, - features=datasets.Features({ - 'predictions': datasets.Value('string'), - 'references': datasets.Sequence(datasets.Value('string')), - }), - codebase_urls=["https://github.com/mjpost/sacreBLEU"], - reference_urls=["https://github.com/mjpost/sacreBLEU", - "https://en.wikipedia.org/wiki/BLEU", - "https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213"] - ) - - -The :class:`datasets.Features` define the type of the predictions and the references and can define arbitrary nested objects with fields of various types. More details on the available ``features`` can be found in the guide on features :doc:`features` and in the package reference on :class:`datasets.Features`. Many examples of features can also be found in the various `metric scripts provided on the GitHub repository `__ and even in `dataset scripts provided on the GitHub repository `__ or directly inspected on the `datasets viewer `__. - -Here are the features of the SQuAD metric for instance, which is taken from the `squad metric loading script `__: - -.. code-block:: - - datasets.Features({ - 'predictions': datasets.Value('string'), - 'references': datasets.Sequence(datasets.Value('string')), - }), - -We can see that each prediction is a string, and each reference is a sequence of strings. -Indeed we can use the metric the following way: - -.. code-block:: - - >>> import datasets - - >>> metric = datasets.load_metric('./metrics/sacrebleu') - >>> reference_batch = [['The dog bit the man.', 'The dog had bit the man.'], - ... ['It was not unexpected.', 'No one was surprised.'], - ... ['The man bit him first.', 'The man had bitten the dog.']] - >>> sys_batch = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.'] - >>> score = metric.compute(predictions=sys_batch, references=reference_batch) - >>> print(score) - {'score': 48.530827009929865, 'counts': [14, 7, 5, 3], 'totals': [17, 14, 11, 8], 'precisions': [82.3529411764706, 50.0, 45.45454545454545, 37.5], 'bp': 0.9428731438548749, 'sys_len': 17, 'ref_len': 18} - - -Downloading data files -------------------------------------------------- - -The :func:`datasets.Metric._download_and_prepare` method is in charge of downloading (or retrieving locally the data files) if needed. - -This method **takes as input** a :class:`datasets.DownloadManager` which is a utility which can be used to download files (or to retrieve them from the local filesystem if they are local files or are already in the cache). - -Let's have a look at a simple example of a :func:`datasets.Metric._download_and_prepare` method. We'll take the example of the `bleurt metric loading script `__: - -.. code-block:: - - def _download_and_prepare(self, dl_manager): - - # check that config name specifies a valid BLEURT model - if self.config_name not in CHECKPOINT_URLS.keys(): - raise KeyError(f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}") - - # download the model checkpoint specified by self.config_name and set up the scorer - model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name]) - self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name)) - -As you can see this method downloads a model checkpoint depending of the configuration name of the metric. The checkpoint url is then provided to the :func:`datasets.DownloadManager.download_and_extract` method which will take care of downloading or retrieving the file from the local file system and returning a object of the same type and organization (here a just one path, but it could be a list or a dict of paths) with the path to the local version of the requested files. :func:`datasets.DownloadManager.download_and_extract` can take as input a single URL/path or a list or dictionary of URLs/paths and will return an object of the same structure (single URL/path, list or dictionary of URLs/paths) with the path to the local files. This method also takes care of extracting compressed tar, gzip and zip archives. - -:func:`datasets.DownloadManager.download_and_extract` can download files from a large set of origins but if your data files are hosted on a special access server, it's also possible to provide a callable which will take care of the downloading process to the ``DownloadManager`` using :func:`datasets.DownloadManager.download_custom`. - -.. note:: - - In addition to :func:`datasets.DownloadManager.download_and_extract` and :func:`datasets.DownloadManager.download_custom`, the :class:`datasets.DownloadManager` class also provide more fine-grained control on the download and extraction process through several methods including: :func:`datasets.DownloadManager.download`, :func:`datasets.DownloadManager.extract` and :func:`datasets.DownloadManager.iter_archive`. Please refer to the package reference on :class:`datasets.DownloadManager` for details on these methods. - - -Computing the scores -------------------------------------------------- - -The :func:`datasets.DatasetBuilder._compute` is in charge of computing the metric scores given predictions and references that are in the format specified in the ``features`` set in :func:`datasets.DatasetBuilder._info`. - -Here again, let's take the simple example of the `xnli metric loading script `__: - -.. code-block:: - - def simple_accuracy(preds, labels): - return (preds == labels).mean() - - class Xnli(datasets.Metric): - def _info(self): - return datasets.MetricInfo( - description=_DESCRIPTION, - citation=_CITATION, - inputs_description=_KWARGS_DESCRIPTION, - features=datasets.Features({ - 'predictions': datasets.Value('int64' if self.config_name != 'sts-b' else 'float32'), - 'references': datasets.Value('int64' if self.config_name != 'sts-b' else 'float32'), - }), - codebase_urls=[], - reference_urls=[], - format='numpy' - ) - - def _compute(self, predictions, references): - return {"accuracy": simple_accuracy(predictions, references)} - -Here to compute the accuracy it uses the simple_accuracy function, that uses numpy to compute the accuracy using .mean() - -The predictions and references objects passes to _compute are sequences of integers or floats, and the sequences are formated as numpy arrays since the format specified in the :obj:`datasets.MetricInfo` object is set to "numpy". - -Specifying several metric configurations -------------------------------------------------- - -Sometimes you want to provide several ways of computing the scores. - -It is possible to gave different configurations for a metric. The configuration name is stored in the :obj:`datasets.Metric.config_name` attribute. The configuration name can be specified by the user when instantiating a metric: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('bleurt', name='bleurt-base-128') - >>> metric = load_metric('bleurt', name='bleurt-base-512') - -Here depending on the configuration name, a different checkpoint will be downloaded and used to compute the BLEURT score. - -You can access :obj:`datasets.Metric.config_name` from inside :func:`datasets.Metric._info`, :func:`datasets.Metric._download_and_prepare` and :func:`datasets.Metric._compute` - -Testing the metric loading script -------------------------------------------------- - -Once you're finished with creating or adapting a metric loading script, you can try it locally by giving the path to the metric loading script: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('PATH/TO/MY/SCRIPT.py') - -If your metric has several configurations you can use the arguments of :func:`datasets.load_metric` accordingly: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('PATH/TO/MY/SCRIPT.py', 'my_configuration') - - diff --git a/docs/source/beam.rst b/docs/source/beam.rst new file mode 100644 index 00000000000..b7ca09252e9 --- /dev/null +++ b/docs/source/beam.rst @@ -0,0 +1,45 @@ +Beam Datasets +============= + +Some datasets are too large to be processed on a single machine. Instead, you can process them with `Apache Beam `_, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as `Apache Flink `_, `Apache Spark `_, or `Google Cloud Dataflow `_. + +We have already created Beam pipelines for some of the larger datasets like `wikipedia `_, and `wiki40b `_. You can load these normally with :func:`datasets.Datasets.load_dataset`. But if you want to run your own Beam pipeline with Dataflow, here is how: + +1. Specify the dataset and configuration you want to process: + +.. code-block:: + + DATASET_NAME=your_dataset_name # ex: wikipedia + CONFIG_NAME=your_config_name # ex: 20200501.en + +2. Input your Google Cloud Platform information: + +.. code-block:: + + PROJECT=your_project + BUCKET=your_bucket + REGION=your_region + +3. Specify your Python requirements: + +.. code-block:: + + echo "datasets" > /tmp/beam_requirements.txt + echo "apache_beam" >> /tmp/beam_requirements.txt + +4. Run the pipeline: + +.. code-block:: + + datasets-cli run_beam datasets/$DATASET_NAME \ + --name $CONFIG_NAME \ + --save_infos \ + --cache_dir gs://$BUCKET/cache/datasets \ + --beam_pipeline_options=\ + "runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\ + "staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\ + "region=$REGION,requirements_file=/tmp/beam_requirements.txt" + +.. tip:: + + When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers. \ No newline at end of file diff --git a/docs/source/beam_dataset.rst b/docs/source/beam_dataset.rst deleted file mode 100644 index 982ce3a66e3..00000000000 --- a/docs/source/beam_dataset.rst +++ /dev/null @@ -1,69 +0,0 @@ -Beam Datasets -================ - -Intro -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Some datasets are too big to be processed on a single machine, for example: wikipedia, wiki40b, etc. -Instead, we allow to process them using `Apache Beam `__. - -Beam processing pipelines can be executed on many execution engines like Dataflow, Spark, Flink, etc. -More infos about the different runners `here `__. - -Already processed datasets are provided -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -At Hugging Face we have already run the Beam pipelines for datasets like wikipedia and wiki40b to provide already processed datasets. Therefore users can simply run ``load_dataset('wikipedia', '20200501.en')`` and the already processed dataset will be downloaded. - -How to run a Beam dataset processing pipeline -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If you want to run the Beam pipeline of a dataset anyway, here are the different steps to run on Dataflow: - -- First define which dataset and config you want to process: - -.. code:: - - DATASET_NAME=your_dataset_name # ex: wikipedia - CONFIG_NAME=your_config_name # ex: 20200501.en - -- Then, define your GCP infos: - -.. code:: - - PROJECT=your_project - BUCKET=your_bucket - REGION=your_region - -- Specify the python requirements: - -.. code:: - - echo "datasets" > /tmp/beam_requirements.txt - echo "apache_beam" >> /tmp/beam_requirements.txt - -- Finally run your pipeline: - -.. code:: - - datasets-cli run_beam datasets/$DATASET_NAME \ - --name $CONFIG_NAME \ - --save_infos \ - --cache_dir gs://$BUCKET/cache/datasets \ - --beam_pipeline_options=\ - "runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\ - "staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\ - "region=$REGION,requirements_file=/tmp/beam_requirements.txt" - - -.. note:: - - You can also use the flags `num_workers` or `machine_type` to fit your needs. - -Note that it also works if you change the runner to Spark, Flink, etc. instead of Dataflow or if you change the output location to S3 or HDFS instead of GCS. - -How to create your own Beam dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It is highly recommended to be familiar with the Beam pipelines first. -Then, you can start looking at the `wikipedia.py `_ script for an example. diff --git a/docs/source/cache.rst b/docs/source/cache.rst new file mode 100644 index 00000000000..761258cf817 --- /dev/null +++ b/docs/source/cache.rst @@ -0,0 +1,103 @@ +Cache management +================ + +When you download a dataset, the processing scripts and data are stored locally on your computer. The cache allows πŸ€— Datasets to avoid re-downloading or processing the entire dataset every time you use it. + +This guide will show you how to: + +* Change the cache directory. +* Control how a dataset is loaded from the cache. +* Clean up cache files in the directory. +* Enable or disable caching. + +Cache directory +--------------- + +The default cache directory is ``~/.cache/huggingface/datasets``. Change the cache location by setting the shell environment variable, ``HF_DATASETS_CACHE`` to another directory: + +.. code:: + + $ export HF_DATASETS_CACHE="/path/to/another/directory" + +When you load a dataset, you also have the option to change where the data is cached. Change the ``cache_dir`` parameter to the path you want: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR") + +Similarly, you can change where a metric is cached with the ``cache_dir`` parameter: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY") + +Download mode +------------- + +After you download a dataset, control how it is loaded by :func:`datasets.load_dataset` with the :obj:`download_mode` parameter. By default, πŸ€— Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('squad', download_mode='force_redownload') + +Refer to :class:`datasets.GenerateMode` for a full list of download modes. + +Cache files +----------- + +Clean up the cache files in the directory with :func:`datasets.Dataset.cleanup_cache_files`: + +.. code-block:: + + #Returns the number of removed cache files + >>> dataset.cleanup_cache_files() + 2 + +Enable or disable caching +------------------------- + +If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument ``load_from_cache=False`` in :func:`datasets.Dataset.map`: + +.. code:: + + >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache=False) + +In the example above, πŸ€— Datasets will execute the function ``add_prefix`` over the entire dataset again instead of loading the dataset from its previous state. + +Disable caching on a global scale with :func:`datasets.set_caching_enabled`: + +.. code-block:: + + >>> from datasets import set_caching_enabled + >>> set_caching_enabled(False) + +When you disable caching, πŸ€— Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied. + +.. tip:: + + If you want to reuse a dataset from scratch, try setting the ``download_mode`` parameter in :func:`datasets.load_dataset` instead. + +You can also avoid caching your metric entirely, and keep it in CPU memory instead: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('glue', 'mrpc', keep_in_memory=True) + +.. caution:: + + Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared. + +.. _load_dataset_enhancing_performance: + +Improve performance +------------------- + +Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory: + +1. Set ``datasets.config.IN_MEMORY_MAX_SIZE`` to a nonzero value (in bytes) that fits in your RAM memory. + +2. Set the environment variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` to a nonzero value. Note that the first method takes higher precedence. \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index d57d39ba8d3..d7562a880d2 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -37,11 +37,13 @@ "sphinx.ext.autodoc", "sphinx.ext.coverage", "sphinx.ext.napoleon", - "recommonmark", "sphinx.ext.viewcode", "sphinx_markdown_tables", "sphinx_copybutton", "sphinxext.opengraph", + "sphinx_panels", + "myst_parser", + "sphinx_inline_tabs" ] @@ -205,4 +207,4 @@ def setup(app): # -- Extension configuration ------------------------------------------------- # Napoleon settings napoleon_use_ivar = True -napoleon_use_rtype = False +napoleon_use_rtype = False \ No newline at end of file diff --git a/docs/source/dataset_card.rst b/docs/source/dataset_card.rst new file mode 100644 index 00000000000..8a38404c4c6 --- /dev/null +++ b/docs/source/dataset_card.rst @@ -0,0 +1,27 @@ +Create a dataset card +===================== + +Each dataset should be accompanied with a Dataset card to promote responsible usage, and alert the user to any potential biases within the dataset. +This idea is inspired by the Model Cards proposed by `Mitchell, 2018 `_. +Dataset cards help users understand the contents of the dataset, context for how the dataset should be used, how it was created, and considerations for using the dataset. +This guide shows you how to create your own Dataset card. + +1. Create a new Dataset card by opening the `online card creator `_, or manually copying the template from `here `_. + +2. Next, you need to generate structured tags. The tags help users discover your dataset on the Hub. Create the tags with the `online tagging app `_, or clone and install the `Datasets tagging app `_ locally. + +3. Select the appropriate tags for your dataset from the dropdown menus, and save the file once you are done. + +4. Expand the **Show YAML output aggregating the tags** section on the right, copy the YAML tags, and paste it under the matching section on the online form. Paste the tags into your ``README.md`` file if you manually created your Dataset card. + +5. Expand the **Show Markdown Data Fields** section, paste it into the **Data Fields** section under **Data Structure** on the online form (or your local ``README.md``). Modify the descriptions as needed, and briefly describe each of the fields. + +6. Fill out the Dataset card to the best of your ability. Refer to the `Dataset Card Creation Guide `_ for more detailed information about each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**. + +7. Once you are done filling out the card with the online form, click the **Export** button to download the Dataset card. Place it in the same folder as your dataset. + +Feel free to take a look at these examples of good Dataset cards for inspiration: + +- `SNLI `_ +- `CNN / DailyMail `_ +- `AllocinΓ© `_ diff --git a/docs/source/dataset_script.rst b/docs/source/dataset_script.rst new file mode 100644 index 00000000000..c8685d8f2a6 --- /dev/null +++ b/docs/source/dataset_script.rst @@ -0,0 +1,360 @@ +Create a dataset loading script +=============================== + +Write a dataset script to load and share your own datasets. It is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data. + +Any dataset script, for example ``my_dataset.py``, can be placed in a folder or a repository named ``my_dataset`` and be loaded with: + +.. code-block:: + + >>> from datasets import load_dataset + >>> load_dataset("path/to/my_dataset") + +The following guide includes instructions for dataset scripts for how to: + +* Add dataset metadata. +* Download data files. +* Generate samples. +* Test if your dataset was generated correctly. +* Create a Dataset card. +* Upload a dataset to the Hugging Face Hub or GitHub. + +Open the `SQuAD dataset loading script `_ template to follow along on how to share a dataset. + +.. tip:: + + To help you get started, try beginning with the dataset loading script `template `_! + + +Add dataset attributes +---------------------- + +The first step is to add some information, or attributes, about your dataset in :func:`datasets.DatasetBuilder._info`. The most important attributes you should specify are: + +1. :obj:`datasets.DatasetInfo.description` provides a concise description of your dataset. The description informs the user what's in the dataset, how it was collected, and how it can be used for a NLP task. + +2. :obj:`datasets.DatasetInfo.features` defines the name and type of each column in your dataset. This will also provide the structure for each example, so it is possible to create nested subfields in a column if you want. Take a look at :class:`datasets.Features` for a full list of feature types you can use. + +.. code-block:: + + datasets.Features( + { + "id": datasets.Value("string"), + "title": datasets.Value("string"), + "context": datasets.Value("string"), + "question": datasets.Value("string"), + "answers": datasets.Sequence( + { + "text": datasets.Value("string"), + "answer_start": datasets.Value("int32"), + } + ), + } + ) + +3. :obj:`datasets.DatasetInfo.homepage` contains the URL to the dataset homepage so users can find more details about the dataset. + +4. :obj:`datasets.DatasetInfo.citation` contains a BibTeX citation for the dataset. + +After you've filled out all these fields in the template, it should look like the following example from the SQuAD loading script: + +.. code-block:: + + def _info(self): + return datasets.DatasetInfo( + description=_DESCRIPTION, + features=datasets.Features( + { + "id": datasets.Value("string"), + "title": datasets.Value("string"), + "context": datasets.Value("string"), + "question": datasets.Value("string"), + "answers": datasets.features.Sequence( + {"text": datasets.Value("string"), "answer_start": datasets.Value("int32"),} + ), + } + ), + # No default supervised_keys (as we have to pass both question + # and context as input). + supervised_keys=None, + homepage="https://rajpurkar.github.io/SQuAD-explorer/", + citation=_CITATION, + ) + +Multiple configurations +^^^^^^^^^^^^^^^^^^^^^^^ + +In some cases, your dataset may have multiple configurations. For example, the `SuperGLUE `_ dataset is a collection of 5 datasets designed to evaluate language understanding tasks. πŸ€— Datasets provides :class:`datasets.BuilderConfig` which allows you to create different configurations for the user to select from. + +Let's study the `SuperGLUE loading script `_ to see how you can define several configurations. + +1. Create a :class:`datasets.BuilderConfig` subclass with attributes about your dataset. These attributes can be the features of your dataset, label classes, and a URL to the data files. + +.. code-block:: + + class SuperGlueConfig(datasets.BuilderConfig): + """BuilderConfig for SuperGLUE.""" + + def __init__(self, features, data_url, citation, url, label_classes=("False", "True"), **kwargs): + """BuilderConfig for SuperGLUE. + + Args: + features: `list[string]`, list of the features that will appear in the + feature dict. Should not include "label". + data_url: `string`, url to download the zip file from. + citation: `string`, citation for the data set. + url: `string`, url for information about the data set. + label_classes: `list[string]`, the list of classes for the label if the + label is present as a string. Non-string labels will be cast to either + 'False' or 'True'. + **kwargs: keyword arguments forwarded to super. + """ + # Version history: + # 1.0.2: Fixed non-nondeterminism in ReCoRD. + # 1.0.1: Change from the pre-release trial version of SuperGLUE (v1.9) to + # the full release (v2.0). + # 1.0.0: S3 (new shuffling, sharding and slicing mechanism). + # 0.0.2: Initial version. + super(SuperGlueConfig, self).__init__(version=datasets.Version("1.0.2"), **kwargs) + self.features = features + self.label_classes = label_classes + self.data_url = data_url + self.citation = citation + self.url = url + +2. Create instances of your config to specify the values of the attributes of each configuration. This gives you the flexibility to specify all the name and description of each configuration. These sub-class instances should be listed under :obj:`datasets.DatasetBuilder.BUILDER_CONFIGS`: + +.. code-block:: + + class SuperGlue(datasets.GeneratorBasedBuilder): + """The SuperGLUE benchmark.""" + + BUILDER_CONFIGS = [ + SuperGlueConfig( + name="boolq", + description=_BOOLQ_DESCRIPTION, + features=["question", "passage"], + data_url="https://dl.fbaipublicfiles.com/glue/superglue/data/v2/BoolQ.zip", + citation=_BOOLQ_CITATION, + url="https://github.com/google-research-datasets/boolean-questions", + ), + ... + ... + SuperGlueConfig( + name="axg", + description=_AXG_DESCRIPTION, + features=["premise", "hypothesis"], + label_classes=["entailment", "not_entailment"], + data_url="https://dl.fbaipublicfiles.com/glue/superglue/data/v2/AX-g.zip", + citation=_AXG_CITATION, + url="https://github.com/rudinger/winogender-schemas", + ), + + +3. Now, users can load a specific configuration of the dataset with the configuration ``name``: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('super_glue', 'boolq') + +Default configurations +^^^^^^^^^^^^^^^^^^^^^^ + +Users must specify a configuration name when they load a dataset with multiple configurations. Otherwise, πŸ€— Datasets will raise a ``ValueError``, and prompt the user to select a configuration name. You can avoid this by setting a default dataset configuration with the :attr:`datasets.DatasetBuilder.DEFAULT_CONFIG_NAME` attribute: + +.. code-block:: + + class NewDataset(datasets.GeneratorBasedBuilder): + + VERSION = datasets.Version("1.1.0") + + BUILDER_CONFIGS = [ + datasets.BuilderConfig(name="first_domain", version=VERSION, description="This part of my dataset covers a first domain"), + datasets.BuilderConfig(name="second_domain", version=VERSION, description="This part of my dataset covers a second domain"), + ] + + DEFAULT_CONFIG_NAME = "first_domain" + +.. important:: + + Only use a default configuration when it makes sense. Don't set one because it may be more convenient for the user to not specify a configuration when they load your dataset. For example, multi-lingual datasets often have a separate configuration for each language. An appropriate default may be an aggregated configuration that loads all the languages of the dataset if the user doesn't request a particular one. + +Download data files and organize splits +--------------------------------------- + +After you've defined the attributes of your dataset, the next step is to download the data files and organize them according to their splits. + +1. Create a dictionary of URLs in the loading script that point to the original SQuAD data files: + +.. code-block:: + + _URL = "https://rajpurkar.github.io/SQuAD-explorer/dataset/" + _URLS = { + "train": _URL + "train-v1.1.json", + "dev": _URL + "dev-v1.1.json", + } + +.. tip:: + If the data files live in the same folder or repository of the dataset script, you can just pass the relative paths to the files instead of URLs. + +2. :func:`datasets.DownloadManager.download_and_extract` takes this dictionary and downloads the data files. Once the files are downloaded, use :class:`datasets.SplitGenerator` to organize each split in the dataset. This is a simple class that contains: + + * The :obj:`name` of each split. You should use the standard split names: :obj:`datasets.Split.TRAIN`, :obj:`datasets.Split.TEST`, and :obj:`datasets.Split.VALIDATION`. + + * :obj:`gen_kwargs` provides the file paths to the data files to load for each split. + +Your :obj:`datasets.DatasetBuilder._split_generator()` should look like this now: + +.. code-block:: + + def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]: + urls_to_download = self._URLS + downloaded_files = dl_manager.download_and_extract(urls_to_download) + + return [ + datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}), + datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}), + ] + +Generate samples +---------------- + +At this point, you have: + +* Added the dataset attributes. +* Provided instructions for how to download the data files. +* Organized the splits. + +The next step is to actually generate the samples in each split. + +1. :obj:`datasets.DatasetBuilder._generate_examples` takes the file path provided by :obj:`gen_kwargs` to read and parse the data files. You need to write a function that loads the data files and extracts the columns. + +2. Your function should yield a tuple of an ``id_``, and an example from the dataset. + +.. code-block:: + + def _generate_examples(self, filepath): + """This function returns the examples in the raw (text) form.""" + logger.info("generating examples from = %s", filepath) + with open(filepath) as f: + squad = json.load(f) + for article in squad["data"]: + title = article.get("title", "").strip() + for paragraph in article["paragraphs"]: + context = paragraph["context"].strip() + for qa in paragraph["qas"]: + question = qa["question"].strip() + id_ = qa["id"] + + answer_starts = [answer["answer_start"] for answer in qa["answers"]] + answers = [answer["text"].strip() for answer in qa["answers"]] + + # Features currently used are "context", "question", and "answers". + # Others are extracted here for the ease of future expansions. + yield id_, { + "title": title, + "context": context, + "question": question, + "id": id_, + "answers": {"answer_start": answer_starts, "text": answers,}, + } + +Testing data and checksum metadata +---------------------------------- + +We strongly recommend adding testing data and checksum metadata to your dataset to verify and test its behavior. This ensures the generated dataset matches your expectations. +Testing data and checksum metadata are mandatory for Canonical datasets stored in the GitHub repository of the πŸ€— Datasets library. + +.. important:: + + Make sure you run all of the following commands **from the root** of your local ``datasets`` repository. + +Dataset metadata +^^^^^^^^^^^^^^^^ + +1. Run the following command to create the metadata file, ``dataset_infos.json``. This will also test your new dataset loading script and make sure it works correctly. + +.. code:: + + datasets-cli test datasets/ --save_infos --all_configs + +2. If your dataset loading script passed the test, you should now have a ``dataset_infos.json`` file in your dataset folder. This file contains information about the dataset, like its ``features`` and ``download_size``. + +Dummy data +^^^^^^^^^^ + +Next, you need to create some dummy data for automated testing. There are two methods for generating dummy data: automatically and manually. + +Automatic +""""""""" + +If your data file is one of the following formats, then you can automatically generate the dummy data: + +* txt +* csv +* tsv +* jsonl +* json +* xml + +Run the command below to generate the dummy data: + +.. code:: + + datasets-cli dummy_data datasets/ --auto_generate + +Manual +"""""" + +If your data files are not among the supported formats, you will need to generate your dummy data manually. Run the command below to output detailed instructions on how to create the dummy data: + +.. code-block:: + + datasets-cli dummy_data datasets/ + + ==============================DUMMY DATA INSTRUCTIONS============================== + - In order to create the dummy data for my-dataset, please go into the folder './datasets/my-dataset/dummy/1.1.0' with `cd ./datasets/my-dataset/dummy/1.1.0` . + + - Please create the following dummy data files 'dummy_data/TREC_10.label, dummy_data/train_5500.label' from the folder './datasets/my-dataset/dummy/1.1.0' + + - For each of the splits 'train, test', make sure that one or more of the dummy data files provide at least one example + + - If the method `_generate_examples(...)` includes multiple `open()` statements, you might have to create other files in addition to 'dummy_data/TREC_10.label, dummy_data/train_5500.label'. In this case please refer to the `_generate_examples(...)` method + + - After all dummy data files are created, they should be zipped recursively to 'dummy_data.zip' with the command `zip -r dummy_data.zip dummy_data/` + + - You can now delete the folder 'dummy_data' with the command `rm -r dummy_data` + + - To get the folder 'dummy_data' back for further changes to the dummy data, simply unzip dummy_data.zip with the command `unzip dummy_data.zip` + + - Make sure you have created the file 'dummy_data.zip' in './datasets/my-dataset/dummy/1.1.0' + =================================================================================== + +.. tip:: + + Manually creating dummy data can be tricky. Make sure you follow the instructions from the command ``datasets-cli dummy_data datasets/``. If you are still unable to succesfully generate dummy data, open a `Pull Request `_ and we will be happy to help you out! + +There should be two new files in your dataset folder: + +* ``dataset_infos.json`` stores the dataset metadata including the data file checksums, and the number of examples required to confirm the dataset was generated properly. + +* ``dummy_data.zip`` is a file used to test the behavior of the loading script without having to download the full dataset. + + +Test +^^^^ + +The last step is to actually test dataset generation with the real and dummy data. Run the following command to test the real data: + +.. code:: + + RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_ + +Test the dummy data: + +.. code:: + + RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_ + +If both tests pass, your dataset was generated correctly! diff --git a/docs/source/dataset_streaming.rst b/docs/source/dataset_streaming.rst deleted file mode 100644 index 527d4d7b4e6..00000000000 --- a/docs/source/dataset_streaming.rst +++ /dev/null @@ -1,271 +0,0 @@ -Load a Dataset in Streaming mode -============================================================== - -When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset. -The data are downloaded progressively as you iterate over the dataset. -You can enable dataset streaming by passing ``streaming=True`` in the :func:`load_dataset` function to get an iterable dataset. - -This is useful if you don't have enough space on your disk to download the dataset, or if you don't want to wait for your dataset to be downloaded before using it. - -Here is a demonstration: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) - >>> print(next(iter(dataset))) - {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... - -Even though the dataset is 1.2 terabytes of data, you can start using it right away. Under the hood, it downloaded only the first examples of the dataset for buffering, and returned the first example. - -.. note:: - - The dataset that is returned is a :class:`datasets.IterableDataset`, not the classic map-style :class:`datasets.Dataset`. To get examples from an iterable dataset, you have to iterate over it using a for loop for example. To get the very last example of the dataset, you first have to iterate on all the previous examples. - Therefore iterable datasets are mostly useful for iterative jobs like training a model, but not for jobs that require random access of examples. - -.. _iterable-dataset-shuffling: - -Shuffling the dataset: ``shuffle`` --------------------------------------------------- - -Shuffle the dataset -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To shuffle your dataset, the :func:`datasets.IterableDataset.shuffle` method fills a buffer of size ``buffer_size`` and randomly samples examples from this buffer. -The selected examples in the buffer are replaced by new examples. - -For instance, if your dataset contains 1,000,000 examples but ``buffer_size`` is set to 1,000, then shuffle will initially select a random examples from only the first 1,000 examples in the buffer. -Once an example is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) example, maintaining the 1,000 example buffer. - -.. note:: - For perfect shuffling, you need to set ``buffer_size`` to be greater than the size of your dataset. But in this case it will download the full dataset in the buffer. - -Moreover, for larger datasets that are sharded into multiple files, :func:`datasets.IterableDataset.shuffle` also shuffles the order of the shards. - -.. code-block:: - - >>> shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42) - >>> print(next(iter(shuffled_dataset))) - {text': 'In this role, she oversees the day-to-day operations of the agency’s motoring services divisions (Vehicle Titles & Registration, Motor Vehicles, Motor Carrier, Enforcement, Consumer Relations and the Automobile Burglary & Theft Prevention Authority) to ensure they are constantly improving and identifying opportunities to become more efficient and effective in service delivery... - >>> print(dataset.n_shards) - 670 - -In this example, the shuffle buffer contains 10,000 examples that were downloaded from one random shard of the dataset (here it actually comes from the 480-th shard out of 670). -The example was selected randomly from this buffer, and replaced by the 10,001-st example of the dataset shard. - -Note that if the order of the shards has been fixed by using :func:`datasets.IterableDataset.skip` or :func:`datasets.IterableDataset.take` then the order of the shards is kept unchanged. - - -Reshuffle the dataset at each epoch -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The seed used to shuffle the dataset is the one you specify in :func:`datasets.IterableDataset.shuffle`. But often we want to use another seed after each epoch to reshuffle the dataset. -Therefore between epochs you can simply tell the dataset at what epoch you're at, and the data will be shuffled using an effective seed of ``seed + epoch``. - -For example your training loop can look like this: - -.. code-block:: - - >>> for epoch in range(epochs): - ... shuffled_dataset.set_epoch(epoch) - ... for example in shuffled_dataset: - ... ... - -In this case in the first epoch, the dataset is shuffled with ``seed + 0`` and in the second epoch it is shuffled with ``seed + 1``, making your dataset reshuffled at each epoch. It randomizes both the shuffle buffer and the shards order. - - -Processing data with ``map`` --------------------------------------------------- - -As for :class:`datasets.Dataset` objects, you can process your data using ``map``. This is useful if you want to transform the data or rename/remove columns. -Since the examples of an :class:`datasets.IterableDataset` are downloaded progressively, the :func:`datasets.IterableDataset.map` method processes the examples on-the-fly when you are iterating over the dataset (contrary to :func:`datasets.Dataset.map` which processes all the examples directly). - -This example shows how to tokenize your dataset: - -.. code-block:: - - >>> from transformers import AutoTokenizer - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"])) - >>> print(next(iter(tokenized_dataset))) - {'input_ids': [101, 11047, 10497, 7869, 2352...], 'token_type_ids': [0, 0, 0, 0, 0...], 'attention_mask': [1, 1, 1, 1, 1...]} - -Tokenizers are written in Rust and use parallelism to speed up tokenization. To leverage parallelism, you can process the examples batch by batch. Note that the output examples are still returned one by one. - - >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]), batched=True) # default batch_size is 1000 but you can specify another batch_size if needed - >>> print(next(iter(tokenized_dataset))) - {'input_ids': [101, 11047, 10497, 7869, 2352...], 'token_type_ids': [0, 0, 0, 0, 0...], 'attention_mask': [1, 1, 1, 1, 1...]} - - -Split your dataset with ``take`` and ``skip`` --------------------------------------------------- - -You can split your dataset by taking or skipping the first ``n`` examples. - -You can create a new dataset with the first ``n`` examples by using :func:`datasets.IterableDataset.take`, or you can create a dataset with the rest of the examples by skipping the first ``n`` examples with :func:`datasets.IterableDataset.skip`: - - -.. code-block:: - - >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) - >>> dataset_head = dataset.take(2) - >>> list(dataset_head) - [{'id': 0, 'text': 'Mtendere Village was...'}, '{id': 1, 'text': 'Lily James cannot fight the music...'}] - >>> # You can also create splits from a shuffled dataset - >>> train_dataset = shuffled_dataset.skip(1000) - >>> eval_dataset = shuffled_dataset.take(1000) - -Some things to keep in mind: - -- When you apply ``skip`` to a dataset, iterating on the new dataset will take some time to start. This is because under the hood it has to iterate over the skipped examples first. -- Using ``take`` (or ``skip``) prevents future calls to ``shuffle`` from shuffling the dataset shards order, otherwise the taken examples could come from other shards. In this case it only uses the shuffle buffer. Therefore it is advised to shuffle the dataset before splitting using ``take`` or ``skip``. See more details in the :ref:`iterable-dataset-shuffling` section. - - -Mix several iterable datasets together with ``interleave_datasets`` ----------------------------------------------------------------------------------------------------- - -It is common to use several datasets to use a model. For example BERT was trained on a mix of Wikipedia and BookCorpus. -You can mix several iterable datasets together using :func:`datasets.interleave_datasets`. - -By default, the resulting dataset alternates between the original datasets, but can also define sampling probabilities to sample randomly from the different datasets. - -For example if you want a dataset in several languages: - -.. code-block:: - - >>> from datasets import interleave_datasets - >>> from itertools import islice - >>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) - >>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True) - >>> - >>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset]) - >>> print(list(islice(multilingual_dataset, 2))) - [{'text': 'Mtendere Village was inspired by the vision...}, {'text': "MΓ©dia de dΓ©bat d'idΓ©es, de culture et de littΓ©rature....}] - >>> - >>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42) - >>> print(list(islice(multilingual_dataset_with_oversampling, 2))) - [{'text': 'Mtendere Village was inspired by the vision...}, {'text': 'Lily James cannot fight the music...}] - - -Working with NumPy, pandas, PyTorch and TensorFlow --------------------------------------------------- - -This part is still experimental and breaking changes may happen in the near future. - -It is possible to get a ``torch.utils.data.IterableDataset`` from a :class:`datasets.IterableDataset` by setting the dataset format to "torch", as for a :class:`datasets.Dataset`: - -.. code-block:: - - >>> import torch - >>> tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], return_tensors="pt")) - >>> torch_tokenized_dataset = tokenized_dataset.with_format("torch") - >>> assert isinstance(torch_tokenized_dataset, torch.utils.data.IterableDataset) - >>> print(next(iter(torch_tokenized_dataset))) - {'input_ids': tensor([[101, 11047, 10497, 7869, 2352...]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0...]]), 'attention_mask': tensor([[1, 1, 1, 1, 1...]])} - -For now, only the PyTorch format is supported but support for TensorFlow and others will be added soon. - - -How does dataset streaming work ? --------------------------------------------------- - -The StreamingDownloadManager -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The standard (i.e. non-streaming) way of loading a dataset has two steps: - -1. download and extract the raw data files of the dataset by using the :class:`datasets.DownloadManager` -2. process the data files to generate the Arrow file used to load the :class:`datasets.Dataset` object. - -For example, in non-streaming mode a file is simply downloaded like this: - -.. code-block:: - - >>> from datasets import DownloadManager - >>> url = "https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt" - >>> filepath = DownloadManager().download(url) # the file is downloaded here - >>> print(filepath) - '/Users/user/.cache/huggingface/datasets/downloads/16b702620cad8d485bafea59b1d2ed69e796196e6f2c73f005dee935a413aa19.ab631f60c6cb31a079ecf1ad910005a7c009ef0f1e4905b69d489fb2bd162683' - >>> with open(filepath) as f: - ... print(f.read()) - -When you load a dataset in streaming mode, the download manager that is used instead is the :class:`datasets.StreamingDownloadManager`. -Instead of actually downloading and extracting all the data when you load the dataset, it is done lazily. -The file starts to be downloaded and extracted only when ``open`` is called. -This is made possible by extending ``open`` to support opening remote files via HTTP. -In each dataset script, ``open`` is replaced by our function ``xopen`` that extends ``open`` to be able to stream data from remote files. - -Here is a sample code that shows what is done under the hood: - -.. code-block:: - - >>> from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen - >>> url = "https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt" - >>> urlpath = StreamingDownloadManager().download(url) - >>> print(urlpath) - 'https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt' - >>> with xopen(urlpath) as f: - ... print(f.read()) # the file is actually downloaded here - -As you can see, since it's possible to open remote files via an URL, the streaming download manager just returns the URL instead of the path to the local downloaded file. - -Then the file is downloaded in a streaming fashion: it is downloaded progessively as you iterate over the data file. -This is made possible because it is based on ``fsspec``, a library that allows to open and iterate on remote files. -You can find more information about ``fsspec`` in `its documentation `_ - -Compressed files and archives -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You may have noticed that the streaming download manager returns the exact same URL that was given as input for a text file. -However if you use ``download_and_extract`` on a compressed file instead, then the output url will be a chained URL. -Chained URLs are used by ``fsspec`` to navigate in remote compressed archives. - -Some examples of chained URL are: - -.. code-block:: - - >>> from datasets.utils.streaming_download_manager import xopen - >>> chained_url = "zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip" - >>> with xopen(chained_url) as f: - ... print(f.read()[:100]) - '{"data": [{"title": "Brain", "paragraphs": [{"context": "Another approach to brain function is to ex' - >>> chained_url2 = "gzip://mkqa.jsonl::https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz" - >>> with xopen(chained_url2) as f: - ... print(f.readline()[:100]) - '{"query": "how long did it take the twin towers to be built", "answers": {"en": [{"type": "number_wi' - -We also extended some functions from ``os.path`` to work with chained URLs. -For example ``os.path.join`` is replaced by our function ``xjoin`` that extends ``os.path.join`` to work with chained URLs: - -.. code-block:: - - >>> from datasets.utils.streaming_download_manager import StreamingDownloadManager, xopen, xjoin - >>> url = "https://adversarialqa.github.io/data/aqa_v1.0.zip" - >>> archive_path = StreamingDownloadManager().download_and_extract(url) - >>> print(archive_path) - 'zip://::https://adversarialqa.github.io/data/aqa_v1.0.zip' - >>> filepath = xjoin(archive_path, "combined", "train.json") - >>> print(filepath) - 'zip://combined/train.json::https://adversarialqa.github.io/data/aqa_v1.0.zip' - >>> with xopen(filepath) as f: - ... print(f.read()[:100]) - '{"data": [{"title": "Brain", "paragraphs": [{"context": "Another approach to brain function is to ex' - -You can also take a look at the ``fsspec`` documentation about URL chaining `here `_ - -.. note:: - - Streaming data from TAR archives is currently highly inefficient and requires a lot of bandwidth. We are working on optimizing this to offer you the best performance, stay tuned ! - -Dataset script compatibility -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Now that you are aware of how dataset streaming works, you can make sure your dataset script work in streaming mode: - -1. make sure you use ``open`` to open the data files: it is extended to work with remote files -2. if you have to deal with archives like ZIP files, make sure you use ``os.path.join`` and ``os.path.dirname`` to navigate in the archive - -Currently a few python functions or classes are not supported for dataset streaming: - -- ``pathlib.Path`` and all its methods are not supported, please use ``os.path.join`` and string objects -- ``os.walk``, ``os.listdir``, ``glob.glob`` are not supported yet diff --git a/docs/source/exploring.rst b/docs/source/exploring.rst deleted file mode 100644 index 8a4a3855acd..00000000000 --- a/docs/source/exploring.rst +++ /dev/null @@ -1,257 +0,0 @@ -What's in the Dataset object -============================================================== - - -The :class:`datasets.Dataset` object that you get when you execute for instance the following commands: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('glue', 'mrpc', split='train') - -behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc). - -In this guide we will detail what's in this object and how to access all the information. - -A :class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset: - -.. code-block:: - - >>> len(dataset) - 3668 - >>> dataset[0] - {'idx': 0, - 'label': 1, - 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'} - -Features and columns ------------------------------------------------------- - -A :class:`datasets.Dataset` instance is more precisely a table with **rows** and **columns** in which the columns are typed. Querying an example (a single row) will thus return a python dictionary with keys corresponding to column names, and values corresponding to the example's value for each column. - -You can get the number of rows and columns of the dataset with various standard attributes: - -.. code-block:: - - >>> dataset.shape - (3668, 4) - >>> dataset.num_columns - 4 - >>> dataset.num_rows - 3668 - >>> len(dataset) - 3668 - -You can list the column names with :attr:`datasets.Dataset.column_names` and get their detailed types (called ``features``) with :attr:`datasets.Dataset.features`: - -.. code-block:: - - >>> dataset.column_names - ['idx', 'label', 'sentence1', 'sentence2'] - >>> dataset.features - {'idx': Value(dtype='int32', id=None), - 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), - 'sentence1': Value(dtype='string', id=None), - 'sentence2': Value(dtype='string', id=None), - } - -Here we can see that the column ``label`` is a :class:`datasets.ClassLabel` feature. - -We can access this feature to get more information on the values in the ``label`` columns. In particular, a :class:`datasets.ClassLabel` feature provides a mapping from integers (as single integer, lists, numpy arrays or even pytorch/tensorflow tensors) to human-readable names and vice-versa: - -.. code-block:: - - >>> dataset.features['label'].num_classes - 2 - >>> dataset.features['label'].names - ['not_equivalent', 'equivalent'] - >>> dataset.features['label'].str2int('equivalent') - 1 - >>> dataset.features['label'].str2int('not_equivalent') - 0 - -More details on the ``features`` can be found in the guide on :doc:`features` and in the package reference on :class:`datasets.Features`. - -Metadata ------------------------------------------------------- - -The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``). - -All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available). - -.. code-block:: - - >>> dataset.split - NamedSplit('train') - >>> dataset.description - 'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n' - >>> dataset.citation - '@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.' - >>> dataset.homepage - 'https://www.microsoft.com/en-us/download/details.aspx?id=52398' - >>> dataset.license - '' - -Accessing ``dataset.info`` will give you all the metadata in a single object. - -Cache files and memory-usage ------------------------------------------------------- - -Datasets are backed by Apache Arrow cache files. - -You can check the current cache files backing the dataset with the ``cache_file`` property - -.. code-block:: - - >>> dataset.cache_files - [{'filename': '/Users/thomwolf/.cache/huggingface/datasets/glue/mrpc/1.0.0/glue-train.arrow'}] - -Using cache files allows: - -- to load arbitrary large datasets by using memory mapping (as long as the datasets can fit on the drive) -- to use a fast backend to process the dataset efficiently -- to do smart caching by storing and reusing the results of operations performed on the drive - -Let's see how big is our dataset and how much RAM loading it requires: - -.. code-block:: - - >>> from datasets import total_allocated_bytes - >>> print("The number of bytes allocated on the drive is", dataset.dataset_size) - The number of bytes allocated on the drive is 1492156 - >>> print("For comparison, here is the number of bytes allocated in memory:", total_allocated_bytes()) - For comparison, here is the number of bytes allocated in memory: 0 - -This is not a typo. The dataset is memory-mapped on the drive and requires no space in RAM for storage. This memory-mapping is done using a zero-deserialization-cost format so the speed of reading/writing is usually really high as well. - -You can clean up the cache files in the current dataset directory (only keeping the currently used one) with :func:`datasets.Dataset.cleanup_cache_files`: - -.. code-block:: - - >>> dataset.cleanup_cache_files() # Returns the number of removed cache files - 2 - -.. note:: - - Be careful to check that no other process might be using other cache files when running this command. - - -Getting rows, slices, batches and columns ------------------------------------------------------- - -While you can access a single row with the ``dataset[i]`` pattern, you can also access several rows using slice notation or with a list of indices (or a numpy/torch/tf array of indices): - -.. code-block:: - - >>> dataset[:3] - {'idx': [0, 1, 2], - 'label': [1, 0, 1], - 'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'], - 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."] - } - >>> dataset[[1, 3, 5]] - {'idx': [1, 3, 5], - 'label': [0, 0, 1], - 'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'], - 'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."] - } - - -You can also get a full column by querying its name as a string. This will return a list of elements: - -.. code-block:: - - >>> dataset['sentence1'][:3] - ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'] - -As you can see depending on the object queried (single row, batch of rows or column), the returned object is different: - -- a single row like ``dataset[0]`` will be returned as a python dictionary of values, -- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values, -- a column like ``dataset['sentence1']`` will be returned as a python lists of values. - -This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset. - -In particular, you can easily select a specific column in batches, and also naturally permute rows and column indexings with identical results: - -.. code-block:: - - >>> dataset[0]['sentence1'] == dataset['sentence1'][0] - True - >>> dataset[2:5]['sentence1'] == dataset['sentence1'][2:5] - True - - -Working with NumPy, pandas, PyTorch, TensorFlow, JAX and on-the-fly formatting transforms ------------------------------------------------------------------------------------------------------ - -Up to now, the rows/batches/columns returned when querying the elements of the dataset were python objects. - -Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists. - -πŸ€— Datasets provides a way to do that through what is called a ``format``. - -While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly. - -A specific format can be activated with :func:`datasets.Dataset.set_format`. - -:func:`datasets.Dataset.set_format` accepts those inputs to control the format of the dataset: - -- :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects), -- :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns), -- :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects. -- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``. - -.. note:: - - The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned. - This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format. - -Here is an example: - -.. code-block:: - - >>> dataset.set_format(type='torch', columns=['label']) - >>> dataset[0] - {'label': tensor(1)} - -It's also possible to use :func:`datasets.Dataset.with_format` instead, to get a new dataset object with the specified format. - -The current format of the dataset can be queried with ``datasets.Dataset.format`` and can be reset to the original format (python and no column filtered) with :func:`datasets.Dataset.reset_format`: - -.. code-block:: - - >>> dataset.format - {'type': 'torch', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False} - >>> dataset.reset_format() - >>> dataset.format - {'type': 'python', 'format_kwargs': {}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} - -You can also define your own formatting transform that is applied on-the-fly. To do so you can use :func:`datasets.Dataset.set_transform`. It replaces any format that may have been defined beforehand. -A formatting transform is a callable that takes a batch (as a dict) as input and returns a batch. - -Here is an example to tokenize and pad tokens on-the-fly when accessing the samples: - -.. code-block:: - - >>> from transformers import BertTokenizer - >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") - >>> def encode(batch): - >>> return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt") - >>> - >>> dataset.set_transform(encode) - >>> dataset.format - {'type': 'custom', 'format_kwargs': {'transform': }, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} - >>> dataset[:2] - {'input_ids': tensor([[ 101, 2572, 3217, ... 102]]), 'token_type_ids': tensor([[0, 0, 0, ... 0]]), 'attention_mask': tensor([[1, 1, 1, ... 1]])} - -It’s also possible to use :func:`datasets.Dataset.with_transform` instead, to get a new dataset object with the specified transform. - -Since the formatting function is applied on-the-fly, your original data are intact: - -.. code-block:: - - >>> dataset.reset_format() - >>> dataset[0] - {'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his [...] evidence .', 'sentence2': 'Referring to him [...] evidence .'} diff --git a/docs/source/faiss_and_ea.rst b/docs/source/faiss_and_ea.rst deleted file mode 100644 index c22fb9ca5d6..00000000000 --- a/docs/source/faiss_and_ea.rst +++ /dev/null @@ -1,151 +0,0 @@ -Adding a FAISS or Elastic Search index to a Dataset -==================================================================== - -It is possible to do document retrieval in a dataset. - -For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents. - -FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search. -As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval. - -On the other hand there exist other tools like ElasticSearch for exact match retrieval in texts (sparse retrieval). - -Both FAISS and ElasticSearch can be used in :class:`datasets.Dataset`, using these methods: - -- :func:`datasets.Dataset.add_faiss_index` to add a FAISS index -- :func:`datasets.Dataset.add_elasticsearch_index` to add an ElasticSearch index - -.. note:: - - One :class:`datasets.Dataset` can have several indexes, each identified by its :obj:`index_name`. By default it corresponds to the name of the column used to build the index. - -Then as soon as you have your index you can query it using these methods: - -- :func:`datasets.Dataset.search` to retrieve the scores and the ids of the examples. There is a version to do batched queries: :func:`datasets.Dataset.search_batch`. -- :func:`datasets.Dataset.get_nearest_examples` to retrieve the scores and the content of the examples. There is a version to do batched queries: :func:`datasets.Dataset.get_nearest_examples_batch`. - -Adding a FAISS index ----------------------------------- - -The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index. - -One way to get good vector representations for text passages is to use the `DPR model `_. We'll compute the representations of only 100 examples just to give you the idea of how it works. - -.. code-block:: - - >>> from transformers import DPRContextEncoder, DPRContextEncoderTokenizer - >>> import torch - >>> torch.set_grad_enabled(False) - >>> ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base") - >>> ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base") - -Then you can load your dataset and compute the representations: - -.. code-block:: - - >>> from datasets import load_dataset - >>> ds = load_dataset('crime_and_punish', split='train[:100]') - >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()}) - -.. note:: - - If you have the embeddings in numpy format, you can call :func:`datasets.Dataset.add_faiss_index_from_external_arrays` instead. - -We can create the index: - -.. code-block:: - - >>> ds_with_embeddings.add_faiss_index(column='embeddings') - -Now have an index named 'embeddings' that we can query. Let's load the question encoder from DPR to have vector representations of questions. - -.. code-block:: - - >>> from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer - >>> q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base") - >>> q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base") - -.. code-block:: - - >>> question = "Is it serious ?" - >>> question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].numpy() - >>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', question_embedding, k=10) - >>> retrieved_examples["line"][0] - '_that_ serious? It is not serious at all. It’s simply a fantasy to amuse\r\n' - - -When you are done with your queries you can save your index on disk: - -.. code-block:: - - ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss') - -And reload it later: - -.. code-block:: - - >>> ds = load_dataset('crime_and_punish', split='train[:100]') - >>> ds.load_faiss_index('embeddings', 'my_index.faiss') - - -Adding an ElasticSearch index ----------------------------------- - -The :func:`datasets.Dataset.add_elasticsearch_index` method is in charge of adding documents to an ElasticSearch index. - -ElasticSearch is a distributed text search engine based on Lucene. - -To use an ElasticSearch index with your dataset, you first need to have ElasticSearch running and accessible from your machine. - -For example if you have ElasticSearch running on your machine (default host=localhost, port=9200), you can run - -.. code-block:: - - >>> from datasets import load_dataset - >>> squad = load_dataset('squad', split='validation') - >>> squad.add_elasticsearch_index("context", host="localhost", port="9200") - -and then query the index of the "context" column of the squad dataset: - -.. code-block:: - - >>> query = "machine" - >>> scores, retrieved_examples = squad.get_nearest_examples("context", query, k=10) - >>> retrieved_examples["title"][0] - 'Computational_complexity_theory' - -You can reuse your index later by specifying the ElasticSearch index name - -.. code-block:: - - >>> from datasets import load_dataset - >>> squad = load_dataset('squad', split='validation') - >>> squad.add_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") - >>> squad.get_index("context").es_index_name - hf_squad_val_context - -.. code-block:: - - >>> from datasets import load_dataset - >>> squad = load_dataset('squad', split='validation') - >>> squad.load_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") - >>> query = "machine" - >>> scores, retrieved_examples = squad.get_nearest_examples("context", query, k=10) - -If you want to use a more advanced ElasticSearch configuration, you can also specify your own ElasticSearch, your own ElasticSearch index configuration, as well as you own ElasticSearch index name. - -.. code-block:: - - >>> import elasticsearch as es - >>> import elasticsearch.helpers - >>> from elasticsearch import Elasticsearch - >>> es_client = Elasticsearch([{"host": "localhost", "port": "9200"}]) # default client - >>> es_config = { - "settings": { - "number_of_shards": 1, - "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, - }, - "mappings": {"properties": {"text": {"type": "text", "analyzer": "standard", "similarity": "BM25"}}}, - } # default config - >>> es_index_name = "hf_squad_context" # name of the index in ElasticSearch - >>> squad.add_elasticsearch_index("context", es_client=es_client, es_config=es_config, es_index_name=es_index_name) diff --git a/docs/source/faiss_es.rst b/docs/source/faiss_es.rst new file mode 100644 index 00000000000..d98c14b39fd --- /dev/null +++ b/docs/source/faiss_es.rst @@ -0,0 +1,129 @@ +Search index +============ + +`FAISS `_ and `ElasticSearch `_ enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question. + +This guide will show you how to build an index for your dataset that will allow you to search it. + +FAISS +----- + +FAISS retrieves documents based on the similiarity of their vector representations. In this example, you will generate the vector representations with the `DPR `_ model. + +1. Download the DPR model from πŸ€— Transformers: + +.. code-block:: + + >>> from transformers import DPRContextEncoder, DPRContextEncoderTokenizer + >>> import torch + >>> torch.set_grad_enabled(False) + >>> ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base") + >>> ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base") + +2. Load your dataset and compute the vector representations: + +.. code-block:: + + >>> from datasets import load_dataset + >>> ds = load_dataset('crime_and_punish', split='train[:100]') + >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()}) + +3. Create the index with :func:`datasets.Dataset.add_faiss_index`: + +.. code:: + + >>> ds_with_embeddings.add_faiss_index(column='embeddings') + +4. Now you can query your dataset with the ``embeddings`` index. Load the DPR Question Encoder, and search for a question with :func:`datasets.Dataset.get_nearest_examples`: + +.. code-block:: + + >>> from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer + >>> q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base") + >>> q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base") + + >>> question = "Is it serious ?" + >>> question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].numpy() + >>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', question_embedding, k=10) + >>> retrieved_examples["line"][0] + '_that_ serious? It is not serious at all. It’s simply a fantasy to amuse\r\n' + +5. When you are done querying, save the index on disk with :func:`datasets.Dataset.save_faiss_index`: + +.. code:: + + >>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss') + +6. Reload it at a later time with :func:`datasets.Dataset.load_faiss_index`: + +.. code-block:: + + >>> ds = load_dataset('crime_and_punish', split='train[:100]') + >>> ds.load_faiss_index('embeddings', 'my_index.faiss') + +ElasticSearch +------------- + +Unlike FAISS, ElasticSearch retrieves documents based on exact matches. + +Start ElasticSearch on your machine, or see the `ElasticSearch installation guide `_ if you don't already have it installed. + +1. Load the dataset you want to index: + +.. code-block:: + + >>> from datasets import load_dataset + >>> squad = load_dataset('squad', split='validation') + +2. Build the index with :func:`datasets.Dataset.add_elasticsearch_index`: + +.. code:: + + >>> squad.add_elasticsearch_index("context", host="localhost", port="9200") + +3. Then you can query the ``context`` index with :func:`datasets.Dataset.get_nearest_examples`: + +.. code-block:: + + >>> query = "machine" + >>> scores, retrieved_examples = squad.get_nearest_examples("context", query, k=10) + >>> retrieved_examples["title"][0] + 'Computational_complexity_theory' + +4. If you want to reuse the index, define the ``es_index_name`` parameter when you build the index: + +.. code-block:: + + >>> from datasets import load_dataset + >>> squad = load_dataset('squad', split='validation') + >>> squad.add_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") + >>> squad.get_index("context").es_index_name + hf_squad_val_context + +5. Reload it later with the index name when you call :func:`datasets.Dataset.load_elasticsearch_index`: + +.. code-block:: + + >>> from datasets import load_dataset + >>> squad = load_dataset('squad', split='validation') + >>> squad.load_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") + >>> query = "machine" + >>> scores, retrieved_examples = squad.get_nearest_examples("context", query, k=10) + +For more advanced ElasticSearch usage, you can specify your own configuration with custom settings: + +.. code-block:: + + >>> import elasticsearch as es + >>> import elasticsearch.helpers + >>> from elasticsearch import Elasticsearch + >>> es_client = Elasticsearch([{"host": "localhost", "port": "9200"}]) # default client + >>> es_config = { + ... "settings": { + ... "number_of_shards": 1, + ... "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, + ... }, + ... "mappings": {"properties": {"text": {"type": "text", "analyzer": "standard", "similarity": "BM25"}}}, + ... } # default config + >>> es_index_name = "hf_squad_context" # name of the index in ElasticSearch + >>> squad.add_elasticsearch_index("context", es_client=es_client, es_config=es_config, es_index_name=es_index_name) \ No newline at end of file diff --git a/docs/source/features.rst b/docs/source/features.rst deleted file mode 100644 index 28db1079127..00000000000 --- a/docs/source/features.rst +++ /dev/null @@ -1,60 +0,0 @@ -Dataset features -================ - -:class:`datasets.Features` defines the internal structure of a dataset. Features are used to specify the underlying -serialization format but also contain high-level information regarding the fields, e.g. column names, types, and -conversion methods from class label strings to integer values for a :class:`datasets.ClassLabel` field. - -A brief summary of how to use this class: - -- :class:`datasets.Features` should be only called once and instantiated with a ``dict[str, FieldType]``, where keys are - your desired column names, and values are the type of that column. - -``FieldType`` can be one of a few possibilities: - -- a :class:`datasets.Value` feature specifies a single typed value, e.g. ``int64`` or ``string``. The dtypes supported - are as follows: - - - null - - bool - - int8 - - int16 - - int32 - - int64 - - uint8 - - uint16 - - uint32 - - uint64 - - float16 - - float32 (alias float) - - float64 (alias double) - - timestamp[(s|ms|us|ns)] - - timestamp[(s|ms|us|ns), tz=(tzstring)] - - binary - - large_binary - - string - - large_string - -- a python :obj:`dict` specifies that the field is a nested field containing a mapping of sub-fields to sub-fields - features. It's possible to have nested fields of nested fields in an arbitrary manner. - -- a python :obj:`list` or a :class:`datasets.Sequence` specifies that the field contains a list of objects. The python - :obj:`list` or :class:`datasets.Sequence` should be provided with a single sub-feature as an example of the feature - type hosted in this list. Python :obj:`list` are simplest to define and write while :class:`datasets.Sequence` provide - a few more specific behaviors like the possibility to specify a fixed length for the list (slightly more efficient). - - .. note:: - - A :class:`datasets.Sequence` with a internal dictionary feature will be automatically converted into a dictionary of - lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be - un-wanted in some cases. If you don't want this behavior, you can use a python :obj:`list` instead of the - :class:`datasets.Sequence`. - -- a :class:`datasets.ClassLabel` feature specifies a field with a predefined set of classes which can have labels - associated to them and will be stored as integers in the dataset. This field will be stored and retrieved as an - integer value and two conversion methods, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` - can be used to convert from the label names to the associate integer value and vice-versa. - -- finally, two features are specific to Machine Translation: :class:`datasets.Translation` and - :class:`datasets.TranslationVariableLanguages`. We refer to the :ref:`package reference ` - for more details on these features. diff --git a/docs/source/filesystems.rst b/docs/source/filesystems.rst index 39a5686d005..7552627fbcc 100644 --- a/docs/source/filesystems.rst +++ b/docs/source/filesystems.rst @@ -1,171 +1,147 @@ -FileSystems Integration for cloud storages -==================================================================== +Cloud storage +============== -Supported Filesystems ---------------------- +πŸ€— Datasets supports access to cloud storage providers through a S3 filesystem implementation: :class:`datasets.filesystems.S3FileSystem`. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers: -Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem `_, which is a known implementation of ``fsspec``. +.. list-table:: + :header-rows: 1 -Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are: + * - Storage provider + - Filesystem implementation + * - Amazon S3 + - `s3fs `_ + * - Google Cloud Storage + - `gcsfs `_ + * - Azure DataLake + - `adl `_ + * - Azure Blob + - `abfs `_ + * - Dropbox + - `dropboxdrivefs `_ + * - Google Drive + - `gdrivefs `_ -- `s3fs `_ for Amazon S3 and other compatible stores -- `gcsfs `_ for Google Cloud Storage -- `adl `_ for Azure DataLake storage -- `abfs `_ for Azure Blob service -- `dropbox `_ for access to dropbox shares -- `gdrive `_ to access Google Drive and shares (experimental) +This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly. -These known implementations are going to be natively added in the near future within ``datasets``, but you can use them already in a similar way to ``s3fs``. +Listing datasets +---------------- -**Examples:** +1. Install the S3 dependecy with πŸ€— Datasets: -Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``. +.. code:: + >>> pip install datasets[s3] -.. code-block:: - - >>> pip install "datasets[s3]" - -Listing files from a public s3 bucket. +2. List files from a public S3 bucket with ``s3.ls``: .. code-block:: - >>> import datasets - >>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP - >>> s3.ls('some-public-datasets/imdb/train') # doctest: +SKIP - ['dataset_info.json.json','dataset.arrow','state.json'] + >>> import datasets + >>> s3 = datasets.filesystems.S3FileSystem(anon=True) + >>> s3.ls('public-datasets/imdb/train') + ['dataset_info.json.json','dataset.arrow','state.json'] -Listing files from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. +Access a private S3 bucket by entering your ``aws_access_key_id`` and ``aws_secret_access_key``: .. code-block:: - >>> import datasets - >>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP - >>> s3.ls('my-private-datasets/imdb/train') # doctest: +SKIP - ['dataset_info.json.json','dataset.arrow','state.json'] - -Using ``S3FileSystem`` with ``botocore.session.Session`` and custom AWS ``profile``. - -.. code-block:: - - >>> import botocore - >>> from datasets.filesystems import S3FileSystem - >>> s3_session = botocore.session.Session(profile='my_profile_name') - >>> s3 = S3FileSystem(session=s3_session) # doctest: +SKIP - - + >>> import datasets + >>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) + >>> s3.ls('my-private-datasets/imdb/train') + ['dataset_info.json.json','dataset.arrow','state.json'] -Example using a another ``fsspec`` implementations, like ``gcsfs`` within ``datasets``. +Google Cloud Storage +^^^^^^^^^^^^^^^^^^^^ -.. code-block:: +Other filesystem implementations, like Google Cloud Storage, are used similarly: - >>> conda install -c conda-forge gcsfs - >>> # or - >>> pip install gcsfs +1. Install the Google Cloud Storage implementation: .. code-block:: - >>> import gcsfs - >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') # doctest: +SKIP - >>> - >>> # saves encoded_dataset to your s3 bucket - >>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs) # doctest: +SKIP - - - -Saving a processed dataset to s3 --------------------------------- - -Once you have your final dataset you can save it to s3 and reuse it later using :obj:`datasets.load_from_disk`. -Saving a dataset to s3 will upload various files to your bucket: + >>> conda install -c conda-forge gcsfs + # or install with pip + >>> pip install gcsfs -- ``arrow files``: they contain your dataset's data -- ``dataset_info.json``: contains the description, citations, etc. of the dataset -- ``state.json``: contains the list of the arrow files and other informations like the dataset format type, if any (torch or tensorflow for example) - -Saving ``encoded_dataset`` to a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. +2. Load your dataset: .. code-block:: - >>> from datasets.filesystems import S3FileSystem - >>> - >>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key - >>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP - >>> - >>> # saves encoded_dataset to your s3 bucket - >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP + >>> import gcsfs + >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') + + >>> # saves encoded_dataset to your s3 bucket + >>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs) -Saving ``encoded_dataset`` to a private s3 bucket using ``botocore.session.Session`` and custom AWS ``profile``. +Saving datasets +--------------- -.. code-block:: +After you have processed your dataset, you can save it to S3 with :func:`datasets.Dataset.save_to_disk`: - >>> import botocore - >>> from datasets.filesystems import S3FileSystem - >>> - >>> # creates a botocore session with the provided AWS profile - >>> s3_session = botocore.session.Session(profile='my_profile_name') - >>> - >>> # create S3FileSystem instance with s3_session - >>> s3 = S3FileSystem(session=s3_session) # doctest: +SKIP - >>> - >>> # saves encoded_dataset to your s3 bucket - >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP +.. code-block:: + >>> from datasets.filesystems import S3FileSystem + + >>> # create S3FileSystem instance + >>> s3 = S3FileSystem(anon=True) + + >>> # saves encoded_dataset to your s3 bucket + >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3) -Loading a processed dataset from s3 ------------------------------------ +.. tip:: -After you have saved your processed dataset to s3 you can load it using :obj:`datasets.load_from_disk`. -You can only load datasets from s3, which are saved using :func:`datasets.Dataset.save_to_disk` -and :func:`datasets.DatasetDict.save_to_disk`. + Remember to include your ``aws_access_key_id`` and ``aws_secret_access_key`` whenever you are interacting with a private S3 bucket. -Loading ``encoded_dataset`` from a public s3 bucket. +Save your dataset with ``botocore.session.Session`` and a custom AWS profile: .. code-block:: - >>> from datasets import load_from_disk - >>> from datasets.filesystems import S3FileSystem - >>> - >>> # create S3FileSystem without credentials - >>> s3 = S3FileSystem(anon=True) # doctest: +SKIP - >>> - >>> # load encoded_dataset from s3 bucket - >>> dataset = load_from_disk('s3://some-public-datasets/imdb/train',fs=s3) # doctest: +SKIP - >>> - >>> print(len(dataset)) - >>> # 25000 + >>> import botocore + >>> from datasets.filesystems import S3FileSystem + + >>> # creates a botocore session with the provided AWS profile + >>> s3_session = botocore.session.Session(profile='my_profile_name') + + >>> # create S3FileSystem instance with s3_session + >>> s3 = S3FileSystem(session=s3_session) + + >>> # saves encoded_dataset to your s3 bucket + >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) + +Loading datasets +---------------- -Loading ``encoded_dataset`` from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. +When you are ready to use your dataset again, reload it with :obj:`datasets.load_from_disk`: .. code-block:: - >>> from datasets import load_from_disk - >>> from datasets.filesystems import S3FileSystem - >>> - >>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key - >>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP - >>> - >>> # load encoded_dataset to from s3 bucket - >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP - >>> - >>> print(len(dataset)) - >>> # 25000 + >>> from datasets import load_from_disk + >>> from datasets.filesystems import S3FileSystem + + >>> # create S3FileSystem without credentials + >>> s3 = S3FileSystem(anon=True) + + >>> # load encoded_dataset to from s3 bucket + >>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) + + >>> print(len(dataset)) + >>> # 25000 -Loading ``encoded_dataset`` from a private s3 bucket using ``botocore.session.Session`` and custom AWS ``profile``. +Load with ``botocore.session.Session`` and custom AWS profile: .. code-block:: - >>> import botocore - >>> from datasets.filesystems import S3FileSystem - >>> - >>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key - >>> s3_session = botocore.session.Session(profile='my_profile_name') - >>> - >>> # create S3FileSystem instance with s3_session - >>> s3 = S3FileSystem(session=s3_session) - >>> - >>> # load encoded_dataset to from s3 bucket - >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP - >>> - >>> print(len(dataset)) - >>> # 25000 + >>> import botocore + >>> from datasets.filesystems import S3FileSystem + + >>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key + >>> s3_session = botocore.session.Session(profile='my_profile_name') + + >>> # create S3FileSystem instance with s3_session + >>> s3 = S3FileSystem(session=s3_session) + + >>> # load encoded_dataset to from s3 bucket + >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) + + >>> print(len(dataset)) + >>> # 25000 \ No newline at end of file diff --git a/docs/source/how_to.md b/docs/source/how_to.md new file mode 100644 index 00000000000..df6c3d1f493 --- /dev/null +++ b/docs/source/how_to.md @@ -0,0 +1,23 @@ +# Overview + +Our how-to guides will show you how to complete a specific task. These guides are intended to help you apply your knowledge of πŸ€— Datasets to real-world problems you may encounter. Want to flatten a column or load a dataset from a local file? We got you covered! You should already be familiar and comfortable with the πŸ€— Datasets basics, and if you aren't, we recommend reading our [tutorial](../tutorial.md) first. + +The how-to guides will cover eight key areas of πŸ€— Datasets: + +* How to load a dataset from other data sources. + +* How to process a dataset. + +* How to stream large datasets. + +* How to upload and share a dataset. + +* How to create a dataset loading script. + +* How to create a dataset card. + +* How to compute metrics. + +* How to manage the cache. + +You can also find guides on how to process massive datasets with Beam, how to integrate with cloud storage providers, and how to add an index to search your dataset. diff --git a/docs/source/how_to_metrics.rst b/docs/source/how_to_metrics.rst new file mode 100644 index 00000000000..7f2e5e7f630 --- /dev/null +++ b/docs/source/how_to_metrics.rst @@ -0,0 +1,228 @@ +Metrics +======= + +Metrics are important for evaluating a model's predictions. In the tutorial, you learned how to compute a metric over an entire evaluation set. You have also seen how to load a metric. + +This guide will show you how to: + +* Add predictions and references. +* Compute metrics using different methods. +* Write your own metric loading script. + +Add predictions and references +------------------------------ + +When you want to add model predictions and references to a :class:`datasets.Metric` instance, you have two options: + +* :func:`datasets.Metric.add` adds a single ``prediction`` and ``reference``. + +* :func:`datasets.Metric.add_batch` adds a batch of ``predictions`` and ``references``. + +Use :func:`datasets.Metric.add_batch` by passing it your model predictions, and the references the model predictions should be evaluated against: + +.. code-block:: + + >>> import datasets + >>> metric = datasets.load_metric('my_metric') + >>> for model_input, gold_references in evaluation_dataset: + ... model_predictions = model(model_inputs) + ... metric.add_batch(predictions=model_predictions, references=gold_references) + >>> final_score = metric.compute() + +.. note:: + + Metrics accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation. + +Compute scores +-------------- + +The most straightforward way to calculate a metric is to call :func:`datasets.Metric.compute`. But some metrics have additional arguments that allow you to modify the metrics behavior. + +Let's load the `SacreBLEU `_ metric, and compute it with a different smoothing method. + +1. Load the SacreBLEU metric: + +.. code-block:: + + >>> import datasets + >>> metric = datasets.load_metric('sacrebleu') + +2. Inspect the different argument methods for computing the metric: + +.. code-block:: + + >>> print(metric.inputs_description) + Produces BLEU scores along with its sufficient statistics + from a source against one or more references. + + Args: + predictions: The system stream (a sequence of segments). + references: A list of one or more reference streams (each a sequence of segments). + smooth_method: The smoothing method to use. (Default: 'exp'). + smooth_value: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1). + tokenize: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for Japanese and '13a' (mteval) otherwise. + lowercase: Lowercase the data. If True, enables case-insensitivity. (Default: False). + force: Insist that your tokenized input is actually detokenized. + ... + +3. Compute the metric with the ``floor`` method, and a different ``smooth_value``: + +.. code:: + + >>> score = metric.compute(smooth_method="floor", smooth_value=0.2) + +.. _metric_script: + +Custom metric loading script +---------------------------- + +Write a metric loading script to use your own custom metric (or one that is not on the Hub). Then you can load it as usual with :func:`datasets.load_metric`. + +To help you get started, open the `SQuAD metric loading script `_ and follow along. + +.. tip:: + + Get jump started with our metric loading script `template `_! + +Add metric attributes +^^^^^^^^^^^^^^^^^^^^^ + +Start by adding some information about your metric in :func:`datasets.Metric._info`. The most important attributes you should specify are: + +1. :attr:`datasets.MetricInfo.description` provides a brief description about your metric. + +2. :attr:`datasets.MetricInfo.citation` contains a BibTex citation for the metric. + +3. :attr:`datasets.MetricInfo.inputs_description` describes the expected inputs and outputs. It may also provide an example usage of the metric. + +4. :attr:`datasets.MetricInfo.features` defines the name and type of the predictions and references. + +After you've filled out all these fields in the template, it should look like the following example from the SQuAD metric script: + +.. code-block:: + + class Squad(datasets.Metric): + def _info(self): + return datasets.MetricInfo( + description=_DESCRIPTION, + citation=_CITATION, + inputs_description=_KWARGS_DESCRIPTION, + features=datasets.Features( + { + "predictions": {"id": datasets.Value("string"), "prediction_text": datasets.Value("string")}, + "references": { + "id": datasets.Value("string"), + "answers": datasets.features.Sequence( + { + "text": datasets.Value("string"), + "answer_start": datasets.Value("int32"), + } + ), + }, + } + ), + codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"], + reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"], + ) + +Download metric files +^^^^^^^^^^^^^^^^^^^^^ + +If your metric needs to download, or retrieve local files, you will need to use the :func:`datasets.Metric._download_and_prepare` method. For this example, let's examine the `BLEURT metric loading script `_. + +1. Provide a dictionary of URLs that point to the metric files: + +.. code-block:: + + CHECKPOINT_URLS = { + "bleurt-tiny-128": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-128.zip", + "bleurt-tiny-512": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-512.zip", + "bleurt-base-128": "https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip", + "bleurt-base-512": "https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip", + "bleurt-large-128": "https://storage.googleapis.com/bleurt-oss/bleurt-large-128.zip", + "bleurt-large-512": "https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip", + } + +.. tip:: + + If the files are stored locally, provide a dictionary of path(s) instead of URLs. + +2. :func:`datasets.Metric._download_and_prepare` will take the URLs and download the metric files specified: + +.. code-block:: + + def _download_and_prepare(self, dl_manager): + + # check that config name specifies a valid BLEURT model + if self.config_name == "default": + logger.warning( + "Using default BLEURT-Base checkpoint for sequence maximum length 128. " + "You can use a bigger model for better results with e.g.: datasets.load_metric('bleurt', 'bleurt-large-512')." + ) + self.config_name = "bleurt-base-128" + if self.config_name not in CHECKPOINT_URLS.keys(): + raise KeyError( + f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}" + ) + + # download the model checkpoint specified by self.config_name and set up the scorer + model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name]) + self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name)) + +Compute score +^^^^^^^^^^^^^ + +:func:`datasets.DatasetBuilder._compute` provides the actual instructions for how to compute a metric given the predictions and references. Now let's take a look at the `GLUE metric loading script `_. + +1. Provide the functions for :func:`datasets.DatasetBuilder._compute` to calculate your metric: + +.. code-block:: + + def simple_accuracy(preds, labels): + return (preds == labels).mean().item() + + def acc_and_f1(preds, labels): + acc = simple_accuracy(preds, labels) + f1 = f1_score(y_true=labels, y_pred=preds).item() + return { + "accuracy": acc, + "f1": f1, + } + + def pearson_and_spearman(preds, labels): + pearson_corr = pearsonr(preds, labels)[0].item() + spearman_corr = spearmanr(preds, labels)[0].item() + return { + "pearson": pearson_corr, + "spearmanr": spearman_corr, + } + +2. Create :func:`datasets.DatasetBuilder._compute` with instructions for what metric to calculate for each configuration: + +.. code-block:: + + def _compute(self, predictions, references): + if self.config_name == "cola": + return {"matthews_correlation": matthews_corrcoef(references, predictions)} + elif self.config_name == "stsb": + return pearson_and_spearman(predictions, references) + elif self.config_name in ["mrpc", "qqp"]: + return acc_and_f1(predictions, references) + elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]: + return {"accuracy": simple_accuracy(predictions, references)} + else: + raise KeyError( + "You should supply a configuration name selected in " + '["sst2", "mnli", "mnli_mismatched", "mnli_matched", ' + '"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]' + ) + +Test +^^^^ + +Once you're finished writing your metric loading script, try to load it locally: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('PATH/TO/MY/SCRIPT.py') \ No newline at end of file diff --git a/docs/source/imgs/builderconfig.png b/docs/source/imgs/builderconfig.png new file mode 100644 index 00000000000..888734accf5 Binary files /dev/null and b/docs/source/imgs/builderconfig.png differ diff --git a/docs/source/imgs/datasetbuilder.png b/docs/source/imgs/datasetbuilder.png new file mode 100644 index 00000000000..c1d74153874 Binary files /dev/null and b/docs/source/imgs/datasetbuilder.png differ diff --git a/docs/source/imgs/datasets_logo.png b/docs/source/imgs/datasets_logo.png new file mode 100644 index 00000000000..d6f6ff56db6 Binary files /dev/null and b/docs/source/imgs/datasets_logo.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 47c783d6de0..faafa24ee03 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,83 +1,103 @@ -HuggingFace Datasets -======================================= +Datasets +======== -Datasets and evaluation metrics for natural language processing +.. image:: /imgs/datasets_logo.png + :align: center -Compatible with NumPy, Pandas, PyTorch and TensorFlow +πŸ€— Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. -πŸ€— Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). +Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the `Hugging Face Hub `_, allowing you to easily load and share a dataset with the wider NLP community. There are currently over 900 datasets, and more than 25 metrics available. -πŸ€— Datasets has many interesting features (beside easy sharing and accessing datasets/metrics): +Find your dataset today on the `Hugging Face Hub `_, or take an in-depth look inside a dataset with the live `Datasets Viewer `_. -- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 -- Lightweight and fast with a transparent and pythonic API -- Strive on large datasets: πŸ€— Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default. -- Smart caching: never wait for your data to process several times -- πŸ€— Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `πŸ€— Datasets viewer `_. +.. panels:: + :card: shadow -πŸ€— Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between πŸ€— Datasets and `tfds` can be found in the section Main differences between πŸ€— Datasets and `tfds`. + .. link-button:: tutorial + :type: ref + :text: Tutorials + :classes: btn-primary btn-block + + ^^^ + Learn the basics and become familiar with loading, accessing, and processing a dataset. Start here if you are using πŸ€— Datasets for the first time! -Contents ---------------------------------- + --- + .. link-button:: how_to + :type: ref + :text: How-to guides + :classes: btn-primary btn-block -The documentation is organized in six parts: + ^^^ + Practical guides to help you achieve a specific goal. Take a look at these guides to learn how to use πŸ€— Datasets to solve real-world problems. -- **GET STARTED** contains a quick tour and the installation instructions. -- **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library. -- **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library. -- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script. -- **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library. -- **PACKAGE REFERENCE** contains the documentation of each public class and function. + --- + .. link-button:: about_arrow + :type: ref + :text: Conceptual guides + :classes: btn-primary btn-block -.. toctree:: - :maxdepth: 2 - :caption: Get started + ^^^ + High-level explanations for building a better understanding about important topics such as the underlying data format, the cache, and how datasets are generated. + --- + .. link-button:: package_reference/main_classes + :type: ref + :text: Reference + :classes: btn-primary btn-block - quicktour - installation + ^^^ + Technical descriptions of how πŸ€— Datasets classes and methods work. .. toctree:: - :maxdepth: 2 - :caption: Using datasets + :hidden: - loading_datasets - exploring - processing - torch_tensorflow - filesystems - faiss_and_ea - dataset_streaming + quickstart .. toctree:: - :maxdepth: 2 - :caption: Using metrics - - loading_metrics - using_metrics + :hidden: + :caption: Tutorials + tutorial + installation + load_hub + access + use_dataset + metrics + .. toctree:: - :maxdepth: 2 - :caption: Adding new datasets/metrics - - share_dataset - add_dataset - add_metric + :hidden: + :caption: How-to guides + + how_to + loading + process + stream + share + dataset_script + dataset_card + cache + filesystems + faiss_es + how_to_metrics + beam .. toctree:: - :maxdepth: 2 - :caption: Advanced guides + :hidden: + :caption: Conceptual guides - features - splits - beam_dataset + about_arrow + about_cache + about_dataset_features + about_dataset_load + about_map_batch + about_metrics .. toctree:: - :maxdepth: 2 - :caption: Package reference + :hidden: + :caption: Reference - package_reference/loading_methods package_reference/main_classes package_reference/builder_classes + package_reference/loading_methods package_reference/table_classes package_reference/logging_methods package_reference/task_templates diff --git a/docs/source/installation.md b/docs/source/installation.md index 61574b6121d..89eddcd8d40 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -1,40 +1,63 @@ # Installation -πŸ€— Datasets is tested on Python 3.6+. +Before you start, you will need to setup your environment and install the appropriate packages. πŸ€— Datasets is tested on **Python 3.6+**. -You should install πŸ€— Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're -unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it. +```{seealso} +If you want to use πŸ€— Datasets with TensorFlow or PyTorch, you will need to install them separately. Refer to the [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) or the [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) for the specific install command for your framework. +``` + +## Virtual environment + +You should install πŸ€— Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html) to keep everything neat and tidy. + +1. Create and navigate to your project directory: + + ```bash + mkdir ~/my-project + cd ~/my-project + ``` -Now, if you want to use πŸ€— Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source. +2. Start a virtual environment inside the directory: -## Installation with pip + ```bash + python -m venv .env + ``` -πŸ€— Datasets can be installed using pip as follows: +3. Activate and deactivate the virtual environment with the following commands: + + ```bash + # Activate the virtual environment + source .env/bin/activate + + # Deactivate the virtual environment + source .env/bin/deactivate + ``` + +Once you have created your virtual environment, you can install πŸ€— Datasets in it. + +## pip + +The most straightforward way to install πŸ€— Datasets is with pip: ```bash pip install datasets ``` -To check πŸ€— Datasets is properly installed, run the following command: +Run the following command to check if πŸ€— Datasets has been properly installed: ```bash python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" ``` -It should download version 1 of the [Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/), load its training split and print the first training example: +This should download version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), load the training split, and print the first training example: ```python {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'} ``` -If you want to use the πŸ€— Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately. -Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) -and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform. - +## source -## Installing from source - -To install from source, clone the repository and install with the following commands: +Building πŸ€— Datasets from source lets you make changes to the code base. To install from source, clone the repository and install with the following commands: ```bash git clone https://github.com/huggingface/datasets.git @@ -42,34 +65,16 @@ cd datasets pip install -e . ``` -Again, you can run: +Again, you can check if πŸ€— Datasets has been properly installed with: ```bash python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" ``` -to check πŸ€— Datasets is properly installed. - -## With conda +## conda -πŸ€— Datasets can be installed using conda as follows: +πŸ€— Datasets can also be installed with conda, a package management system: ```bash conda install -c huggingface -c conda-forge datasets -``` - -Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda. - -## Caching datasets and metrics - -This library will download and cache datasets and metrics processing scripts and data locally. - -Unless you specify a location with `cache_dir=...` when you use methods like `load_dataset` and `load_metric`, these datasets and metrics will automatically be downloaded in the folders respectively given by the shell environment variables ``HF_DATASETS_CACHE`` and ``HF_METRICS_CACHE``. The default value for it will be the HuggingFace cache home followed by ``/datasets/`` for datasets scripts and data, and ``/metrics/`` for metrics scripts and data. - -The HuggingFace cache home is (by order of priority): - - * shell environment variable ``HF_HOME`` - * shell environment variable ``XDG_CACHE_HOME`` + ``/huggingface/`` - * default: ``~/.cache/huggingface/`` - -So if you don't have any specific environment variable set, the cache directory for dataset scripts and data will be at ``~/.cache/huggingface/datasets/``. +``` \ No newline at end of file diff --git a/docs/source/load_hub.rst b/docs/source/load_hub.rst new file mode 100644 index 00000000000..54be77daed0 --- /dev/null +++ b/docs/source/load_hub.rst @@ -0,0 +1,96 @@ +Hugging Face Hub +================ + +Now that you are all setup, the first step is to load a dataset. The easiest way to load a dataset is from the `Hugging Face Hub `_. There are already over 900 datasets in over 100 languages on the Hub. Choose from a wide category of datasets to use for NLP tasks like question answering, summarization, machine translation, and language modeling. For a more in-depth look inside a dataset, use the live `Datasets Viewer `_. + +Load a dataset +-------------- + +Before you take the time to download a dataset, it is often helpful to quickly get all the relevant information about a dataset. The :func:`datasets.load_dataset_builder` method allows you to inspect the attributes of a dataset without downloading it. + +.. code-block:: + + >>> from datasets import load_dataset_builder + >>> dataset_builder = load_dataset_builder('imdb') + >>> print(dataset_builder.cache_dir) + /Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676 + >>> print(dataset_builder.info.features) + {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)} + >>> print(dataset_builder.info.splits) + {'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')} + +.. seealso:: + + Take a look at :class:`datasets.DatasetInfo` for a full list of attributes you can use with ``dataset_builder``. + +Once you are happy with the dataset you want, load it in a single line with :func:`datasets.load_dataset`: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue', 'mrpc', split='train') + +Select a split +-------------- + +A split is a specific subset of the dataset like ``train`` and ``test``. Make sure you select a split when you load a dataset. If you don't supply a ``split`` argument, πŸ€— Datasets will only return a dictionary containing the subsets of the dataset. + +.. code-block:: + + >>> from datasets import load_dataset + >>> datasets = load_dataset('glue', 'mrpc') + >>> print(datasets) + {train: Dataset({ + features: ['idx', 'label', 'sentence1', 'sentence2'], + num_rows: 3668 + }) + validation: Dataset({ + features: ['idx', 'label', 'sentence1', 'sentence2'], + num_rows: 408 + }) + test: Dataset({ + features: ['idx', 'label', 'sentence1', 'sentence2'], + num_rows: 1725 + }) + } + +Select a configuration +---------------------- + +Some datasets, like the `General Language Understanding Evaluation (GLUE) `_ benchmark, are actually made up of several datasets. These sub-datasets are called **configurations**, and you must explicitly select one when you load the dataset. If you don't provide a configuration name, πŸ€— Datasets will raise a ``ValueError`` and remind you to select a configuration. + +Use ``get_dataset_config_names`` to retrieve a a list of all the possible configurations available to your dataset: + +.. code-block:: + + from datasets import get_dataset_config_names + + configs = get_dataset_config_names("glue") + print(configs) + # ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax'] + + +❌ Incorrect way to load a configuration: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue') + ValueError: Config name is missing. + Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax'] + Example of usage: + `load_dataset('glue', 'cola')` + +βœ… Correct way to load a configuration: + +.. code-block:: + + >>> dataset = load_dataset('glue', 'sst2') + Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0... + Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7.44M/7.44M [00:01<00:00, 7.03MB/s] + Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data. + >>> print(dataset) + {'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 67349), + 'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 872), + 'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 1821) + } \ No newline at end of file diff --git a/docs/source/loading.rst b/docs/source/loading.rst new file mode 100644 index 00000000000..a110e600057 --- /dev/null +++ b/docs/source/loading.rst @@ -0,0 +1,422 @@ +Load +==== + +You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won't find the one you want on the Hub. A dataset can be on disk on your local machine, in a Github repository, and in in-memory data structures like Python dictionaries and Pandas DataFrames. Wherever your dataset may be stored, πŸ€— Datasets provides a way for you to load and use it for training. + +This guide will show you how to load a dataset from: + +* The Hub without a dataset loading script +* Local files +* In-memory data +* Offline +* A specific slice of a split + +You will also learn how to troubleshoot common errors, and how to load specific configurations of a metric. + +.. _load-from-the-hub: + +Hugging Face Hub +---------------- + +In the tutorial, you learned how to load a dataset from the Hub. This method relies on a dataset loading script that downloads and builds the dataset. However, you can also load a dataset from any dataset repository on the Hub **without** a loading script! + +First, create a dataset repository and upload your data files. Then you can use :func:`datasets.load_dataset` like you learned in the tutorial. For example, load the files from this `demo repository `_ by providing the repository namespace and dataset name: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('lhoestq/demo1') + +This dataset repository contains CSV files, and this code loads all the data from the CSV files. + +Some datasets may have more than one version, based on Git tags, branches or commits. Use the ``script_version`` flag to specifiy which dataset version you want to load: + +.. code-block:: + + >>> dataset = load_dataset( + >>> "lhoestq/custom_squad", + >>> script_version="main" # tag name, or branch name, or commit hash + >>> ) + +.. seealso:: + + Refer to the :ref:`upload_dataset_repo` guide for more instructions on how to create a dataset repository on the Hub, and how to upload your data files. + +If the dataset doesn't have a dataset loading script, then by default, all the data will be loaded in the ``train`` split. Use the ``data_files`` parameter to map data files to splits like ``train``, ``validation`` and ``test``: + +.. code-block:: + + >>> data_files = {"train": "train.csv", "test": "test.csv"} + >>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files) + +.. important:: + + If you don't specify which data files to use, ``load_dataset`` will return all the data files. This can take a long time if you are loading a large dataset like C4, which is approximately 13TB of data. + +You can also load a specific subset of the files with the ``data_files`` parameter. The example below loads files from the `C4 dataset `_: + +.. code-block:: + + >>> from datasets import load_dataset + >>> c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz') + +Specify a custom split with the ``split`` parameter: + +.. code-block:: + + >>> data_files = {"validation": "en/c4-validation.*.json.gz"} + >>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation") + +Local and remote files +----------------------- + +πŸ€— Datasets can be loaded from local files stored on your computer, and also from remote files. The datasets are most likely stored as a ``csv``, ``json``, ``txt`` or ``parquet`` file. The :func:`datasets.load_dataset` method is able to load each of these file types. + +CSV +^^^ + +πŸ€— Datasets can read a dataset made up of one or several CSV files: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('csv', data_files='my_file.csv') + +If you have more than one CSV file: + +.. code:: + + >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) + +You can also map the training and test splits to specific CSV files: + +.. code:: + + >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'}) + +To load remote CSV files via HTTP, you can pass the URLs: + +.. code:: + + >>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/" + >>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'}) + +JSON +^^^^ + +JSON files are loaded directly with :func:`datasets.load_dataset` as shown below: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('json', data_files='my_file.json') + +JSON files can have diverse formats, but we think the most efficient format is to have multiple JSON objects; each line represents an individual row of data. For example: + +.. code-block:: + + {"a": 1, "b": 2.0, "c": "foo", "d": false} + {"a": 4, "b": -5.5, "c": null, "d": true} + +Another JSON format you may encounter is a nested field, in which case you will need to specify the ``field`` argument as shown in the following: + +.. code-block:: + + {"version": "0.1.0", + "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false}, + {"a": 4, "b": -5.5, "c": null, "d": true}] + } + + >>> from datasets import load_dataset + >>> dataset = load_dataset('json', data_files='my_file.json', field='data') + +To load remote JSON files via HTTP, you can pass the URLs: + +.. code-block:: + + >>> base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/" + >>> dataset = load_dataset('json', data_files={'train': base_url + 'train-v1.1.json', 'validation': base_url + 'dev-v1.1.json'}, field="data") + +While these are the most common JSON formats, you will see other datasets that are formatted differently. πŸ€— Datasets recognizes these other formats, and will fallback accordingly on the Python JSON loading methods to handle them. + +Text files +^^^^^^^^^^ + +Text files are one of the most common file types for storing a dataset. πŸ€— Datasets will read the text file line by line to build the dataset. + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'}) + +To load remote TXT files via HTTP, you can pass the URLs: + +.. code-block:: + + >>> dataset = load_dataset('text', data_files='https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt') + +Parquet +^^^^^^^ + +Parquet files are stored in a columnar format unlike row-based files like CSV. Large datasets may be stored in a Parquet file because it is more efficient, and faster at returning your query. Load a Parquet file as shown in the following example: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset("parquet", data_files={'train': 'train.parquet', 'test': 'test.parquet'}) + +To load remote parquet files via HTTP, you can pass the URLs: + + >>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/" + >>> data_files = {"train": base_url + "wikipedia-train.parquet"} + >>> wiki = load_dataset("parquet", data_files=data_files, split="train") + +In-memory data +-------------- + +πŸ€— Datasets will also allow you to create a :class:`datasets.Dataset` directly from in-memory data structures like Python dictionaries and Pandas DataFrames. + +Python dictionary +^^^^^^^^^^^^^^^^^ + +Load Python dictionaries with :func:`datasets.Dataset.from_dict`: + +.. code-block:: + + >>> from datasets import Dataset + >>> my_dict = {"a": [1, 2, 3]} + >>> dataset = Dataset.from_dict(my_dict) + +Pandas DataFrame +^^^^^^^^^^^^^^^^ + +Load Pandas DataFrames with :func:`datasets.Dataset.from_pandas`: + +.. code-block:: + + >>> from datasets import Dataset + >>> import pandas as pd + >>> df = pd.DataFrame({"a": [1, 2, 3]}) + >>> dataset = Dataset.from_pandas(df) + +.. important:: + + An object data type in `pandas.Series `_ doesn't always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. Avoid potential errors by constructing an explicit schema with :class:`datasets.Features` using the ``from_dict`` or ``from_pandas`` methods. See the :ref:`troubleshoot` for more details on how to explicitly specify your own features. + +Offline +------- + +Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or πŸ€— Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline. + +If you know you won't have internet access, you can run πŸ€— Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, πŸ€— Datasets will look directly in the cache. Set the environment variable ``HF_DATASETS_OFFLINE`` to ``1`` to enable full offline mode. + +Slice splits +------------ + +For even greater control over how to load a split, you can choose to only load specific slices of a split. There are two options for slicing a split: using strings or :class:`datasets.ReadInstruction`. Strings are more compact and readable for simple cases, while :class:`datasets.ReadInstruction` is easier to use with variable slicing parameters. + +Concatenate the ``train`` and ``test`` split by: + +.. tab:: String API + + >>> train_test_ds = datasets.load_dataset('bookcorpus', split='train+test') + +.. tab:: ReadInstruction + + >>> ri = datasets.ReadInstruction('train') + datasets.ReadInstruction('test') + >>> train_test_ds = datasets.load_dataset('bookcorpus', split=ri) + +Select specific rows of the ``train`` split: + +.. tab:: String API + + >>> train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]') + +.. tab:: ReadInstruction + + >>> train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train', from_=10, to=20, unit='abs')) + +Or select a percentage of the split with: + +.. tab:: String API + + >>> train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]') + +.. tab:: ReadInstruction + + >>> train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train', to=10, unit='%')) + +You can even select a combination of percentages from each split: + +.. tab:: String API + + >>> train_10_80pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]+train[-80%:]') + +.. tab:: ReadInstruction + + >>> ri = (datasets.ReadInstruction('train', to=10, unit='%') + datasets.ReadInstruction('train', from_=-80, unit='%')) + >>> train_10_80pct_ds = datasets.load_dataset('bookcorpus', split=ri) + +Finally, create cross-validated dataset splits by: + +.. tab:: String API + + >>> # 10-fold cross-validation (see also next section on rounding behavior): + >>> # The validation datasets are each going to be 10%: + >>> # [0%:10%], [10%:20%], ..., [90%:100%]. + >>> # And the training datasets are each going to be the complementary 90%: + >>> # [10%:100%] (for a corresponding validation set of [0%:10%]), + >>> # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., + >>> # [0%:90%] (for a validation set of [90%:100%]). + >>> vals_ds = datasets.load_dataset('bookcorpus', split=[f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)]) + >>> trains_ds = datasets.load_dataset('bookcorpus', split=[f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)]) + +.. tab:: ReadInstruction + + >>> # 10-fold cross-validation (see also next section on rounding behavior): + >>> # The validation datasets are each going to be 10%: + >>> # [0%:10%], [10%:20%], ..., [90%:100%]. + >>> # And the training datasets are each going to be the complementary 90%: + >>> # [10%:100%] (for a corresponding validation set of [0%:10%]), + >>> # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., + >>> # [0%:90%] (for a validation set of [90%:100%]). + >>> vals_ds = datasets.load_dataset('bookcorpus', [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%') for k in range(0, 100, 10)]) + >>> trains_ds = datasets.load_dataset('bookcorpus', [(datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')) for k in range(0, 100, 10)]) + +Percent slicing and rounding +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For datasets where the requested slice boundaries do not divide evenly by 100, the default behavior is to round the boundaries to the nearest integer. As a result, some slices may contain more examples than others as shown in the following example: + +.. code-block:: + + # Assuming `train` split contains 999 records. + # 19 records, from 500 (included) to 519 (excluded). + >>> train_50_52_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%]') + # 20 records, from 519 (included) to 539 (excluded). + >>> train_52_54_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%]') + +If you want equal sized splits, use ``pct1_dropremainder`` rounding instead. This will treat the specified percentage boundaries as multiples of 1%. + +.. code-block:: + + # 18 records, from 450 (included) to 468 (excluded). + >>> train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction( 'train', from_=50, to=52, unit='%', rounding='pct1_dropremainder')) + # 18 records, from 468 (included) to 486 (excluded). + >>> train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train',from_=52, to=54, unit='%', rounding='pct1_dropremainder')) + # Or equivalently: + >>> train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%](pct1_dropremainder)') + >>> train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%](pct1_dropremainder)') + +.. important:: + + Using ``pct1_dropremainder`` rounding may truncate the last examples in a dataset if the number of examples in your dataset don't divide evenly by 100. + +.. _troubleshoot: + +Troubleshooting +--------------- + +Sometimes, you may get unexpected results when you load a dataset. In this section, you will learn how to solve two common issues you may encounter when you load a dataset: manually download a dataset, and specify features of a dataset. + +Manual download +^^^^^^^^^^^^^^^ + +Certain datasets require you to manually download the dataset files due to licensing incompatibility, or if the files are hidden behind a login page. This will cause :func:`datasets.load_dataset` to throw an ``AssertionError``. But πŸ€— Datasets provides detailed instructions for downloading the missing files. After you have downloaded the files, use the ``data_dir`` argument to specify the path to the files you just downloaded. + +For example, if you try to download a configuration from the `MATINF `_ dataset: + +.. code-block:: + + >>> dataset = load_dataset("matinf", "summarization") + Downloading and preparing dataset matinf/summarization (download: Unknown size, generated: 246.89 MiB, post-processed: Unknown size, total: 246.89 MiB) to /root/.cache/huggingface/datasets/matinf/summarization/1.0.0/82eee5e71c3ceaf20d909bca36ff237452b4e4ab195d3be7ee1c78b53e6f540e... + AssertionError: The dataset matinf with config summarization requires manual data. + Please follow the manual download instructions: To use MATINF you have to download it manually. Please fill this google form (https://forms.gle/nkH4LVE4iNQeDzsc9). You will receive a download link and a password once you complete the form. Please extract all files in one folder and load the dataset with: `datasets.load_dataset('matinf', data_dir='path/to/folder/folder_name')`. + Manual data can be loaded with `datasets.load_dataset(matinf, data_dir='') + +Specify features +^^^^^^^^^^^^^^^^ + +When you create a dataset from local files, the :class:`datasets.Features` are automatically inferred by `Apache Arrow `_. However, the features of the dataset may not always align with your expectations or you may want to define the features yourself. + +The following example shows how you can add custom labels with :class:`datasets.ClassLabel`. First, define your own labels using the :class:`datasets.Features` class: + +.. code-block:: + + >>> class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"] + >>> emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)}) + +Next, specify the ``features`` argument in :func:`datasets.load_dataset` with the features you just created: + +.. code:: + + >>> dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features) + +Now when you look at your dataset features, you can see it uses the custom labels you defined: + +.. code:: + + >>> dataset['train'].features + {'text': Value(dtype='string', id=None), + 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)} + +Metrics +------- + +When the metric you want to use is not supported by πŸ€— Datasets, you can write and use your own metric script. Load your metric by providing the path to your local metric loading script: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('PATH/TO/MY/METRIC/SCRIPT') + + >>> # Example of typical usage + >>> for batch in dataset: + ... inputs, references = batch + ... predictions = model(inputs) + ... metric.add_batch(predictions=predictions, references=references) + >>> score = metric.compute() + +.. seealso:: + + See the :ref:`metric_script` guide for more details on how to write your own metric loading script. + +Load configurations +^^^^^^^^^^^^^^^^^^^ + +It is possible for a metric to have different configurations. The configurations are stored in the :attr:`datasets.Metric.config_name` attribute. When you load a metric, provide the configuration name as shown in the following: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('bleurt', name='bleurt-base-128') + >>> metric = load_metric('bleurt', name='bleurt-base-512') + +Distributed setup +^^^^^^^^^^^^^^^^^ + +When you work in a distributed or parallel processing environment, loading and computing a metric can be tricky because these processes are executed in parallel on separate subsets of the data. πŸ€— Datasets supports distributed usage with a few additional arguments when you load a metric. + +For example, imagine you are training and evaluating on eight parallel processes. Here's how you would load a metric in this distributed setting: + +1. Define the total number of processes with the ``num_process`` argument. + +2. Set the process ``rank`` as an integer between zero and ``num_process - 1``. + +3. Load your metric with :func:`datasets.load_metric` with these arguments: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank) + +.. tip:: + + Once you've loaded a metric for distributed usage, you can compute the metric as usual. Behind the scenes, :func:`datasets.Metric.compute` gathers all the predictions and references from the nodes, and computes the final metric. + +In some instances, you may be simultaneously running multiple independent distributed evaluations on the same server and files. To avoid any conflicts, it is important to provide an ``experiment_id`` to distinguish the separate evaluations: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id, experiment_id="My_experiment_10") \ No newline at end of file diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst deleted file mode 100644 index 260ac2ef3f6..00000000000 --- a/docs/source/loading_datasets.rst +++ /dev/null @@ -1,571 +0,0 @@ -Loading a Dataset -============================================================== - -A :class:`datasets.Dataset` can be created from various sources of data: - -- from the `Hugging Face Hub `__, -- from local or remote files, e.g. CSV/JSON/text/parquet/pandas files, or -- from in-memory data like python dict or a pandas dataframe. - -In this section we study each option. - -From the Hugging Face Hub -------------------------------------------------- - -Over 1,000 datasets for many NLP tasks like text classification, question answering, language modeling, etc, are provided on the `Hugging Face Hub `__ and can be viewed and explored online with the `πŸ€— Datasets viewer `__. - -.. note:: - - You can also add a new dataset to the Hub to share with the community as detailed in the guide on :doc:`adding a new dataset `. - -All the datasets currently available on the `Hub `__ can be listed using :func:`datasets.list_datasets`: - -.. code-block:: - - >>> from datasets import list_datasets - >>> datasets_list = list_datasets() - >>> len(datasets_list) - 1103 - >>> print(', '.join(dataset for dataset in datasets_list)) - acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, - allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, - aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, - arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli, - bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp, - blog_authorship_corpus, bn_hate_speech [...] - - -To load a dataset from the Hub we use the :func:`datasets.load_dataset` command and give it the short name of the dataset you would like to load as listed above or on the `Hub `__. - -Let's load the **SQuAD dataset for Question Answering**. You can explore this dataset and find more details about it `on the online viewer here `__ (which is actually just a wrapper on top of the :class:`datasets.Dataset` we will now create): - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('squad', split='train') - -This call to :func:`datasets.load_dataset` does the following steps under the hood: - -1. Download and import in the library the **SQuAD python processing script** from Hugging Face github repository or AWS bucket if it's not already stored in the library. - -.. note:: - - Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original SQuAD JSON files and the code to load examples from the original SQuAD JSON files. You can find the SQuAD processing script `here `__ for instance. - -2. Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached) and process and cache all SQuAD in a cache Arrow table for each standard split stored on the drive. - -.. note:: - - An Apache Arrow Table is the internal storing format for πŸ€— Datasets. It allows to store an arbitrarily long dataframe, - typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you - to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use - memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory - (RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_dataset` to ``True``. - The default in πŸ€— Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE`` - different from ``0`` bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than - ``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and memory-mapped otherwise. This behavior can be enabled by setting - either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment - variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence) to nonzero. - -3. Return a **dataset built from the splits** asked by the user (default: all); in the above example we create a dataset with the train split. - - -Selecting a split -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If you don't provide a :obj:`split` argument to :func:`datasets.load_dataset`, this method will return a dictionary containing a datasets for each split in the dataset. - -.. code-block:: - - >>> from datasets import load_dataset - >>> datasets = load_dataset('squad') - >>> print(datasets) - DatasetDict({ - train: Dataset({ - features: ['id', 'title', 'context', 'question', 'answers'], - num_rows: 87599 - }) - validation: Dataset({ - features: ['id', 'title', 'context', 'question', 'answers'], - num_rows: 10570 - }) - }) - -The :obj:`split` argument can actually be used to control extensively the generated dataset split. You can use this argument to build a split from only a portion of a split in absolute number of examples or in proportion (e.g. :obj:`split='train[:10%]'` will load only the first 10% of the train split) or to mix splits (e.g. :obj:`split='train[:100]+validation[:100]'` will create a split from the first 100 examples of the train split and the first 100 examples of the validation split). - -You can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. - -Selecting a configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Some datasets comprise several :obj:`configurations`. A Configuration defines a sub-part of a dataset which can be selected. Unlike split, you have to select a single configuration for the dataset, you cannot mix several configurations. Examples of dataset with several configurations are: - -- the **GLUE** dataset which is an agregated benchmark comprised of 10 subsets: COLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI and the diagnostic subset AX. -- the **wikipedia** dataset which is provided for several languages. - -When a dataset is provided with more than one :obj:`configuration`, you will be requested to explicitely select a configuration among the possibilities. - -Selecting a configuration is done by providing :func:`datasets.load_dataset` with a :obj:`name` argument. Here is an example for **GLUE**: - -.. code-block:: - - >>> from datasets import load_dataset - - >>> dataset = load_dataset('glue') - ValueError: Config name is missing. - Please pick one among the available configs: ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax'] - Example of usage: - `load_dataset('glue', 'cola')` - - >>> dataset = load_dataset('glue', 'sst2') - Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0... - Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7.44M/7.44M [00:01<00:00, 7.03MB/s] - Dataset glue downloaded and prepared to /Users/huggignface/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will reuse this data. - >>> print(dataset) - DatasetDict({ - train: Dataset({ - features: ['sentence', 'label', 'idx'], - num_rows: 67349 - }) - validation: Dataset({ - features: ['sentence', 'label', 'idx'], - num_rows: 872 - }) - test: Dataset({ - features: ['sentence', 'label', 'idx'], - num_rows: 1821 - }) - }) - -Manually downloading files -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Some dataset require you to download manually some files, usually because of licencing issues or when these files are behind a login page. - -In this case specific instruction for dowloading the missing files will be provided when running the script with :func:`datasets.load_dataset` for the first time to explain where and how you can get the files. - -After you've downloaded the files, you can point to the folder hosting them locally with the :obj:`data_dir` argument as follows: - -.. code-block:: - - >>> dataset = load_dataset("xtreme", "PAN-X.fr") - Downloading and preparing dataset xtreme/PAN-X.fr (download: Unknown size, generated: 5.80 MiB, total: 5.80 MiB) to /Users/thomwolf/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0... - AssertionError: The dataset xtreme with config PAN-X.fr requires manual data. - Please follow the manual download instructions: You need to manually download the AmazonPhotos.zip file on Amazon Cloud Drive (https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN). The folder containing the saved file can be used to load the dataset via 'datasets.load_dataset("xtreme", data_dir="")' - - -Apart from :obj:`name` and :obj:`split`, the :func:`datasets.load_dataset` method provide a few arguments which can be used to control where the data is cached (:obj:`cache_dir`), some options for the download process it-self like the proxies and whether the download cache should be used (:obj:`download_config`, :obj:`download_mode`). - -The use of these arguments is discussed in the :ref:`load_dataset_cache_management` section below. You can also find the full details on these arguments on the package reference page for :func:`datasets.load_dataset`. - -From a community dataset on the Hugging Face Hub ------------------------------------------------------------ - -The community shares hundreds of datasets on the Hugging Face Hub using **dataset repositories**. -A dataset repository is a versioned repository of data files. -Everyone can create a dataset repository on the Hugging Face Hub and upload their data. - -For example we have created a demo dataset at https://huggingface.co/datasets/lhoestq/demo1. -In this dataset repository we uploaded some CSV files, and you can load the dataset with: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('lhoestq/demo1') - -You can even choose which files to load from a dataset repository. -For example you can load a subset of the **C4 dataset for language modeling**, hosted by AllenAI on the Hub. -You can browse the dataset repository at https://huggingface.co/datasets/allenai/c4 - -In the following example we specify which subset of the files to use with the ``data_files`` parameter: - -.. code-block:: - - >>> from datasets import load_dataset - >>> c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz') - - -You can also specify custom splits: - -.. code-block:: - - >>> data_files = {"validation": "en/c4-validation.*.json.gz"} - >>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation") - -In these examples, ``load_dataset`` will return all the files that match the Unix style pattern passed in ``data_files``. -If you don't specify which data files to use, it will use all the data files (here all C4 is about 13TB of data). - - -.. _loading-from-local-files: - -From local or remote files ------------------------------------------------------------ - -It's also possible to create a dataset from your own local or remote files. - -Generic loading scripts are provided for: - -- CSV files (with the :obj:`csv` script), -- JSON files (with the :obj:`json` script), -- text files (read as a line-by-line dataset with the :obj:`text` script), -- parquet files (with the :obj:`parquet` script). -- pandas pickled dataframe (with the :obj:`pandas` script). - -If you want more fine-grained control on how your files are loaded or if you have a file format that matches the format for one of the datasets provided on the `Hugging Face Hub `__, it can be more simpler to create **your own loading script**, from scratch or by adapting one of the provided loading scripts. In this case, please go check the :doc:`add_dataset` section. - -The :obj:`data_files` argument in :func:`datasets.load_dataset` is used to provide paths to one or several data source files. This argument currently accepts three types of inputs: - -- :obj:`str`: A single string as the path to a single file (considered to constitute the `train` split by default). -- :obj:`Sequence[str]`: A list of strings as paths to a list of files (also considered to constitute the `train` split by default). -- :obj:`Mapping[str, Union[str, Sequence[str]]`: A dictionary mapping splits names to a single file path or a list of file paths. - -Let's see an example of all the various ways you can provide files to :func:`datasets.load_dataset`: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('csv', data_files='my_file.csv') - >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) - >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], - 'test': 'my_test_file.csv'}) - >>> base_url = 'https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/' - >>> dataset = load_dataset('csv', data_files={'train': base_url + 'train.csv', 'test': base_url + 'test.csv'}) - -.. note:: - - The :obj:`split` argument will work similarly to what we detailed above for the datasets on the Hub and you can find more details on the syntax for using :obj:`split` on the :doc:`dedicated tutorial on split <./splits>`. The only specific behavior related to loading local files is that if you don't indicate which split each files is related to, the provided files are assumed to belong to the **train** split. - - -.. note:: - - If you use a private dataset repository on the Hub, you just need to pass ``use_auth_token=True`` to ``load_dataset`` after logging in with the ``huggingface-cli login`` bash command. Alternatively you can pass your `API token `__ in ``use_auth_token``. - - -CSV files -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -πŸ€— Datasets can read a dataset made of one or several CSV files. - -All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns. - -A few interesting features are provided out-of-the-box by the Apache Arrow backend: - -- multi-threaded or single-threaded reading -- automatic decompression of input files (based on the filename extension, such as my_data.csv.gz) -- fetching column names from the first row in the CSV file -- column-wise type inference and conversion to one of null, int64, float64, timestamp[s], string or binary data -- detecting various spellings of null values such as NaN or #N/A - -Here is an example loading two CSV file to create a ``train`` split (default split unless specify otherwise): - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv']) - -You can also provide the URLs of remote csv files: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('csv', data_files="https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/train.csv") - -The ``csv`` loading script provides a few simple access options to control parsing and reading the CSV files: - - - :obj:`skiprows` (int) - Number of first rows in the file to skip (default is 0) - - :obj:`column_names` (list, optional) – The column names of the target table. If empty, fall back on autogenerate_column_names (default: empty). - - :obj:`delimiter` (1-character string) – The character delimiting individual cells in the CSV data (default ``,``). - - :obj:`quotechar` (1-character string) – The character used optionally for quoting CSV values (default ``"``). - - :obj:`quoting` (int) – Control quoting behavior (default 0, setting this to 3 disables quoting, refer to `pandas.read_csv documentation ` for more details). - - -JSON files -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -πŸ€— Datasets supports building a dataset from JSON files in various formats. - -The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows: - -.. code-block:: - - {"a": 1, "b": 2.0, "c": "foo", "d": false} - {"a": 4, "b": -5.5, "c": null, "d": true} - -In this case, interesting features are provided out-of-the-box by the Apache Arrow backend: - -- multi-threaded reading -- automatic decompression of input files (based on the filename extension, such as my_data.json.gz) -- sophisticated type inference (see below) - -You can load such a dataset direcly with: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('json', data_files='my_file.json') - -You can also provide the URLs of remote JSON files: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('json', data_files='https://huggingface.co/datasets/allenai/c4/resolve/main/en/c4-train.00000-of-01024.json.gz') - -In real-life though, JSON files can have diverse format and the ``json`` script will accordingly fallback on using python JSON loading methods to handle various JSON file format. - -One common occurence is to have a JSON file with a single root dictionary where the dataset is contained in a specific field, as a list of dicts or a dict of lists. - -.. code-block:: - - {"version": "0.1.0", - "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false}, - {"a": 4, "b": -5.5, "c": null, "d": true}] - } - -In this case you will need to specify which field contains the dataset using the :obj:`field` argument as follows: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('json', data_files='my_file.json', field='data') - - -Text files -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -πŸ€— Datasets also supports building a dataset from text files read line by line (each line will be a row in the dataset). - -This is simply done using the ``text`` loading script which will generate a dataset with a single column called ``text`` containing all the text lines of the input files as strings. - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('text', data_files={'train': ['my_text_1.txt', 'my_text_2.txt'], 'test': 'my_test_file.txt'}) - -You can also provide the URLs of remote text files: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('text', data_files={'train': 'https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt'}) - - -Specifying the features of the dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When you create a dataset from local files, the :class:`datasets.Features` of the dataset are automatically guessed using an automatic type inference system based on `Apache Arrow Automatic Type Inference `__. - -However sometime you may want to define yourself the features of the dataset, for instance to control the names and indices of labels using a :class:`datasets.ClassLabel`. - -In this case you can use the :obj:`features` arguments to :func:`datasets.load_dataset` to supply a :class:`datasets.Features` instance definining the features of your dataset and overriding the default pre-computed features. - -From in-memory data ------------------------------------------------------------ - -Eventually, it's also possible to instantiate a :class:`datasets.Dataset` directly from in-memory data, currently: - -- a python dict, or -- a pandas dataframe. - -From a python dictionary -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Let's say that you have already loaded some data in a in-memory object in your python session: - -.. code-block:: - - >>> my_dict = {'id': [0, 1, 2], - >>> 'name': ['mary', 'bob', 'eve'], - >>> 'age': [24, 53, 19]} - -You can then directly create a :class:`datasets.Dataset` object using the :func:`datasets.Dataset.from_dict` or the :func:`datasets.Dataset.from_pandas` class methods of the :class:`datasets.Dataset` class: - -.. code-block:: - - >>> from datasets import Dataset - >>> dataset = Dataset.from_dict(my_dict) - -From a pandas dataframe -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can similarly instantiate a Dataset object from a ``pandas`` DataFrame: - -.. code-block:: - - >>> from datasets import Dataset - >>> import pandas as pd - >>> df = pd.DataFrame({"a": [1, 2, 3]}) - >>> dataset = Dataset.from_pandas(df) - -.. note:: - - The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of `object`, we need to guess the datatype by looking at the Python objects in this Series. - - Be aware that Series of the `object` dtype don't carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function. - -To be sure that the schema and type of the instantiated :class:`datasets.Dataset` are as intended, you can explicitely provide the features of the dataset as a :class:`datasets.Features` object to the ``from_dict`` and ``from_pandas`` methods. - -Using a custom dataset loading script ------------------------------------------------------------ - -If the provided loading scripts for Hub dataset or for local files are not adapted for your use case, you can also easily write and use your own dataset loading script. - -You can use a local loading script by providing its path instead of the usual shortcut name: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE') - -We provide more details on how to create your own dataset generation script on the :doc:`add_dataset` page and you can also find some inspiration in all the already provided loading scripts on the `GitHub repository `__. - -.. _load_dataset_cache_management: - - -Loading datasets in streaming mode ------------------------------------------------------------ - -When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset. -The data are downloaded progressively as you iterate over the dataset. -You can enable dataset streaming by passing ``streaming=True`` in the :func:`load_dataset` function to get an iterable dataset. - -For example, you can start iterating over big datasets like OSCAR without having to download terabytes of data using this code: - - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) - >>> print(next(iter(dataset))) - {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... - -.. note:: - - A dataset in streaming mode is not a :class:`datasets.Dataset` object, but an :class:`datasets.IterableDataset` object. You can find more information about iterable datasets in the `dataset streaming documentation `__ - -Cache management and integrity verifications ------------------------------------------------------------ - -Cache directory -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -To avoid re-downloading the whole dataset every time you use it, the `datasets` library caches the data on your computer. - -By default, the `datasets` library caches the datasets and the downloaded data files under the following directory: `~/.cache/huggingface/datasets`. - -If you want to change the location where the datasets cache is stored, simply set the `HF_DATASETS_CACHE` environment variable. For example, if you're using linux: - -.. code-block:: - - $ export HF_DATASETS_CACHE="/path/to/another/directory" - -In addition, you can control where the data is cached when invoking the loading script, by setting the :obj:`cache_dir` parameter: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR") - -Download mode -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can control the way the the :func:`datasets.load_dataset` function handles already downloaded data by setting its :obj:`download_mode` parameter. - -By default, :obj:`download_mode` is set to ``"reuse_dataset_if_exists"``. The :func:`datasets.load_dataset` function will reuse both raw downloads and the prepared dataset, if they exist in the cache directory. - -The following table describes the three available modes for download: - -.. list-table:: Behavior of :func:`datasets.load_dataset` depending on :obj:`download_mode` - :header-rows: 1 - - * - :obj:`download_mode` parameter value - - Downloaded files (raw data) - - Dataset object - * - ``"reuse_dataset_if_exists"`` (default) - - Reuse - - Reuse - * - ``"reuse_cache_if_exists"`` - - Reuse - - Fresh - * - ``"force_redownload"`` - - Fresh - - Fresh - -For example, you can run the following if you want to force the re-download of the SQuAD raw data files: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('squad', download_mode="force_redownload") - - -Integrity verifications -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When downloading a dataset from the πŸ€— Datasets Hub, the :func:`datasets.load_dataset` function performs by default a number of verifications on the downloaded files. These verifications include: - -- Verifying the list of downloaded files -- Verifying the number of bytes of the downloaded files -- Verifying the SHA256 checksums of the downloaded files -- Verifying the number of splits in the generated `DatasetDict` -- Verifying the number of samples in each split of the generated `DatasetDict` - -You can disable these verifications by setting the :obj:`ignore_verifications` parameter to ``True``. - -You also have the possibility to locally override the informations used to perform the integrity verifications by setting the :obj:`save_infos` parameter to ``True``. - -For example, run the following to skip integrity verifications when loading the IMDB dataset: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('imdb', ignore_verifications=True) - - -Loading datasets offline -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Each dataset builder (e.g. "squad") is a Python script that is downloaded and cached either from the πŸ€— Datasets GitHub repository or from the `Hugging Face Hub `__. -Only the ``text``, ``csv``, ``json``, ``parquet`` and ``pandas`` builders are included in ``datasets`` without requiring external downloads. - -Therefore if you don't have an internet connection you can't load a dataset that is not packaged with ``datasets``, unless the dataset is already cached. -Indeed, if you've already loaded the dataset once before (when you had an internet connection), then the dataset is reloaded from the cache and you can use it offline. - -You can even set the environment variable `HF_DATASETS_OFFLINE` to ``1`` to tell ``datasets`` to run in full offline mode. -This mode disables all the network calls of the library. -This way, instead of waiting for a dataset builder download to time out, the library looks directly at the cache. - -.. _load_dataset_load_builder: - -Loading a dataset builder ------------------------------------------------------------ - -You can use :func:`datasets.load_dataset_builder` to inspect metadata (cache directory, configs, dataset info, etc.) that is required to build a dataset without downloading the dataset itself. - -For example, run the following to get the path to the cache directory of the IMDB dataset: - -.. code-block:: - - >>> from datasets import load_dataset_builder - >>> dataset_builder = load_dataset_builder('imdb') - >>> print(dataset_builder.cache_dir) - /Users/thomwolf/.cache/huggingface/datasets/imdb/plain_text/1.0.0/fdc76b18d5506f14b0646729b8d371880ef1bc48a26d00835a7f3da44004b676 - >>> print(dataset_builder.info.features) - {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)} - >>> print(dataset_builder.info.splits) - {'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000, dataset_name='imdb'), 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000, dataset_name='imdb'), 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000, dataset_name='imdb')} - -You can see all the attributes of ``dataset_builder.info`` in the documentation of :class:`datasets.DatasetInfo` - - -.. _load_dataset_enhancing_performance: - -Enhancing performance ------------------------------------------------------------ - -If you would like to speed up dataset operations, you can disable caching and copy the dataset in-memory by setting -``datasets.config.IN_MEMORY_MAX_SIZE`` to a nonzero size (in bytes) that fits in your RAM memory. In that case, the -dataset will be copied in-memory if its size is smaller than ``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and -memory-mapped otherwise. This behavior can be enabled by setting either the configuration option -``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment variable -``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence) to nonzero. diff --git a/docs/source/loading_metrics.rst b/docs/source/loading_metrics.rst deleted file mode 100644 index 9945eaa1bcf..00000000000 --- a/docs/source/loading_metrics.rst +++ /dev/null @@ -1,205 +0,0 @@ -Loading a Metric -============================================================== - -The library also provides a selection of metrics focusing in particular on: - -- providing a common API accross a range of NLP metrics, -- providing metrics associated to some benchmark datasets provided by the libray such as GLUE or SQuAD, -- providing access to recent and somewhat complex metrics such as BLEURT or BERTScore, -- allowing simple use of metrics in distributed and large-scale settings. - -Metrics in the `datasets` library have a lot in common with how :class:`datasets.Datasets` are loaded and provided using :func:`datasets.load_dataset`. - -Like datasets, metrics are added to the library as small scripts wrapping them in a common API. - -A :class:`datasets.Metric` can be created from various source: - -- from a metric script provided on the `HuggingFace Hub `__, or -- from a metric script provide at a local path in the filesystem. - -In this section we detail these options to access metrics. - -From the HuggingFace Hub -------------------------------------------------- - -A range of metrics are provided on the `HuggingFace Hub `__. - -.. note:: - - You can also add new metric to the Hub to share with the community as detailed in the guide on :doc:`adding a new metric`. - -All the metrics currently available on the `Hub `__ can be listed using :func:`datasets.list_metrics`: - -.. code-block:: - - >>> from datasets import list_metrics - >>> metrics_list = list_metrics() - >>> len(metrics_list) - 13 - >>> print(', '.join(metric.id for metric in metrics_list)) - bertscore, bleu, bleurt, coval, gleu, glue, meteor, - rouge, sacrebleu, seqeval, squad, squad_v2, xnli - - -To load a metric from the Hub we use the :func:`datasets.load_metric` command and give it the short name of the metric you would like to load as listed above. - -Let's load the metric associated to the **MRPC subset of the GLUE benchmark for Natural Language Understanding**. You can explore this dataset and find more details about it `on the online viewer here `__ : - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc') - >>> - >>> # Example of typical usage - >>> for batch in dataset: - >>> inputs, references = batch - >>> predictions = model(inputs) - >>> metric.add_batch(predictions=predictions, references=references) - >>> score = metric.compute() - -This call to :func:`datasets.load_metric` does the following steps under the hood: - -1. Download and import the **GLUE metric python script** from the Hub if it's not already stored in the library. - - .. note:: - - Metric scripts are small python scripts that define the metrics API and contain the meta-information on the metric (citation, homepage, etc). - Metric scripts sometime need to import additional packages. If these packages are not installed, an explicit message with information on how to install the additional packages should be raised. - You can find the GLUE metric script `here `__ for instance. - -2. Run the python metric script which will **instantiate and return a** :class:`datasets.Metric` **object**, which is in charge of storing the predictions/references and computing the metric values. - - .. note:: - - The :class:`datasets.Metric` object uses Apache Arrow Tables as the internal storing format for predictions and references. It allows to store predictions and references directly on disk with memory-mapping and thus do lazy computation of the metrics, in particular to easily gather the predictions in a distributed setup. The default in πŸ€— Datasets is to always memory-map metrics data on drive. - -Using a custom metric script ------------------------------------------------------------ - -If the provided metrics are not adapted for your use case or you want to test and use a novel metric script, you can also easily write and use your own metric script. - -You can use a local metric script just by providing its path instead of the usual shortcut name: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('PATH/TO/MY/METRIC/SCRIPT') - >>> - >>> # Example of typical usage - >>> for batch in dataset: - >>> inputs, references = batch - >>> predictions = model(inputs) - >>> metric.add_batch(predictions=predictions, references=references) - >>> score = metric.compute() - -We provide more details on how to create your own metric script on the :doc:`add_metric` page and you can also find some inspiration in all the already provided metric scripts on the `GitHub repository `__. - - -Special arguments for loading ------------------------------------------------------------ - -In addition to the name of the metric, the :func:`datasets.load_metric` function accept a few arguments to customize the behaviors of the metrics. We detail them in this section. - -Selecting a configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Some metrics comprise several :obj:`configurations`. A Configuration define a specific behavior for a metric which can be selected among several behaviors. - -This is in particular useful for composite benchmarks like GLUE which comprise several sub-sets with different associated metrices. - -For instance the GLUE benchmark comprise 11 sub-sets and this metric was further extended with support for the adversarial `HANS dataset by McCoy et al. `__. Therefore, the GLUE metric is provided with 12 configurations coresponding to various sub-set of this Natural Language Inference benchmark: "sst2", "mnli", "mnli_mismatched", "mnli_matched", "cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans". - -To select a specific configuration of a metric, just provide the configuration name as the second argument to :func:`datasets.load_metric`. - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc') - -Distributed setups -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Computing metrics in distributed and parallel processing environments can be tricky since the evaluation on different sub-sets of the data is done in separate python processes. The ``datasets`` library overcomes this difficulty using the method described in this section. - -.. note:: - - When a metric score is additive with regards to the dataset sub-set (meaning that ``f(AβˆͺB) = f(A) + f(B)``) you can use distributed reduce operations to gather the scores computed by different processes. But when a metric is non-additive (``f(AβˆͺB) β‰  f(A) + f(B)``) which happens even for simple metrics like F1, you cannot simply gather the results of metrics evaluation on different sub-sets. A usual way to overcome this issue is to fallback on (inefficient) single process evaluation (e.g. evaluating metrics on a single GPU). The ``datasets`` library solves this problem by allowing distributed evaluation for any type of metric as detailed in this section. - -Let's first see how to use a metric in a distributed setting before giving a few words about the internals. Let's say we train and evaluate a model in 8 parallel processes (e.g. using PyTorch's `DistributedDataParallel `__ on a server with 8 GPUs). - -We assume your python script has access to: - -1. the total number of processes as an integer we'll call ``num_process`` (in our example 8). -2. the process rank as an integer between 0 and ``num_process-1`` that we'll call ``rank`` (in our example between 0 and 7 included). - -Here is how we can instantiate the metric in such a distributed script: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank) - -And that's it, you can use the metric on each node as described in :doc:`using_metrics` without taking special care for the distributed setting. In particular, the predictions and references can be computed and provided to the metric separately on each process. By default, the final evaluation of the metric will be done on the first node (rank 0) only when calling :func:`datasets.Metric.compute` after gathering the predictions and references from all the nodes. Computing on other processes (rank > 0) returns ``None``. - -Under the hood :class:`datasets.Metric` uses an Apache Arrow table to store (temporarily) predictions and references for each node on the filesystem, thereby not cluttering the GPU or CPU memory. Once the final metric evalution is requested with :func:`datasets.Metric.compute`, the first node gets access to all the nodes' temp files and reads them to compute the metric at once. - -This way it's possible to perform distributed predictions (which is important for evaluation speed in distributed setting) while allowing to use complex non-additive metrics and not wasting GPU/CPU memory with prediction data. - -The synchronization is performed with the help of file locks on the filesystem. - - -Multiple and independent distributed setups -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In some cases, several **independent and not related** distributed evaluations might be running on the same server and the same file system at the same time (e.g. two independent multiprocessing trainings running on the same server) and it is then important to distinguish these experiemnts and allow them to operate in independently. - -In this situation you should provide an ``experiment_id`` to :func:`datasets.load_metric` which has to be a unique identifier of the current distributed experiment. - -This identifier will be added to the cache file used by each process of this evaluation to avoid conflicting access to the same cache files for storing predictions and references for each node. - -.. note:: - Specifying an ``experiment_id`` to :func:`datasets.load_metric` is only required in the specific situation where you have **independent (i.e. not related) distributed** evaluations running on the same file system at the same time. - -Here is an example: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id, experiment_id="My_experiment_10") - -Cache file and in-memory -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -As detailed in :doc:`using_metrics`, each time you call :func:`datasets.Metric.add_batch` or :func:`datasets.Metric.add` in a typical setup as illustrated below, the new predictions and references are added to a temporary storing table. - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc') - >>> - >>> # Example of typical usage - >>> for batch in dataset: - >>> inputs, references = batch - >>> predictions = model(inputs) - >>> metric.add_batch(predictions=predictions, references=references) - >>> score = metric.compute() - -By default this table is stored on the drive to avoid consuming GPU/CPU memory. - -You can control the location where this temporary table is stored with the ``cache_dir`` argument of :func:`datasets.load_metric`. ``cache_dir`` should be provided with the path of a directory in a writable file system. - -Here is an example: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY") - -Alternatively, it's possible to avoid storing the predictions and references on the drive and keep them in CPU memory (RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_metric` to ``True`` as shown here: - -.. code-block:: - - >>> from datasets import load_metric - >>> metric = load_metric('glue', 'mrpc', keep_in_memory=True) - - -.. note:: - Keeping the predictions in-memory is not possible in distributed setting since the CPU memory spaces of the various process are not shared. diff --git a/docs/source/metrics.rst b/docs/source/metrics.rst new file mode 100644 index 00000000000..a8fe7a5f662 --- /dev/null +++ b/docs/source/metrics.rst @@ -0,0 +1,93 @@ +Evaluate predictions +==================== + +πŸ€— Datasets provides various common and NLP-specific `metrics `_ for you to measure your models performance. In the final part of the tutorials, you will load a metric and use it to evaluate your models predictions. + +You can see what metrics are available with :func:`datasets.list_metrics`: + +.. code-block:: + + >>> from datasets import list_metrics + >>> metrics_list = list_metrics() + >>> len(metrics_list) + 28 + >>> print(metrics_list) + ['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli'] + +Load metric +------------- + +It is very easy to load a metric with πŸ€— Datasets. In fact, you will notice that it is very similar to loading a dataset! Load a metric from the Hub with :func:`datasets.load_metric`: + +.. code-block:: + + >>> from datasets import load_metric + >>> metric = load_metric('glue', 'mrpc') + +This will load the metric associated with the MRPC dataset from the GLUE benchmark. + +Select a configuration +---------------------- + +If you are using a benchmark dataset, you need to select a metric that is associated with the configuration you are using. Select a metric configuration by providing the configuration name: + +.. code:: + + >>> metric = load_metric('glue', 'mrpc') + +Metrics object +-------------- + +Before you begin using a :class:`datasets.Metric` object, you should get to know it a little better. As with a dataset, you can return some basic information about a metric. For example, use :obj:`datasets.Metric.inputs_descriptions` to get more information about a metrics expected input format and some usage examples: + +.. code-block:: + + >>> print(metric.inputs_description) + Compute GLUE evaluation metric associated to each GLUE dataset. + Args: + predictions: list of predictions to score. + Each translation should be tokenized into a list of tokens. + references: list of lists of references for each translation. + Each reference should be tokenized into a list of tokens. + Returns: depending on the GLUE subset, one or several of: + "accuracy": Accuracy + "f1": F1 score + "pearson": Pearson Correlation + "spearmanr": Spearman Correlation + "matthews_correlation": Matthew Correlation + Examples: + >>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] + >>> references = [0, 1] + >>> predictions = [0, 1] + >>> results = glue_metric.compute(predictions=predictions, references=references) + >>> print(results) + {'accuracy': 1.0} + ... + >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' + >>> references = [0, 1] + >>> predictions = [0, 1] + >>> results = glue_metric.compute(predictions=predictions, references=references) + >>> print(results) + {'accuracy': 1.0, 'f1': 1.0} + ... + +Notice for the MRPC configuration, the metric expects the input format to be zero or one. For a complete list of attributes you can return with your metric, take a look at :class:`datasets.MetricInfo`. + +Compute metric +-------------- + +Once you have loaded a metric, you are ready to use it to evaluate a models predictions. Provide the model predictions and references to :obj:`datasets.Metric.compute`: + +.. code-block:: + + >>> model_predictions = model(model_inputs) + >>> final_score = metric.compute(predictions=model_predictions, references=gold_references) + +What's next? +------------ + +Congratulations, you have completed your first πŸ€— Datasets tutorial! + +Over the course of these tutorials, you learned the basic steps of using πŸ€— Datasets. You loaded a dataset from the Hub, and learned how to access the information stored inside the dataset. Next, you tokenized the dataset into sequences of integers, and formatted it so you can use it with PyTorch or TensorFlow. Finally, you loaded a metric to evaluate your models predictions. This is all you need to get started with πŸ€— Datasets! + +Now that you have a solid grasp of what πŸ€— Datasets can do, you can begin formulating your own questions about how you can use it with your dataset. Please take a look at our :doc:`How-to guides <./how_to>` for more practical help on solving common use-cases, or read our :doc:`Conceptual guides <./about_arrow>` to deepen your understanding about πŸ€— Datasets. \ No newline at end of file diff --git a/docs/source/package_reference/builder_classes.rst b/docs/source/package_reference/builder_classes.rst index ec641528780..b540c22fc61 100644 --- a/docs/source/package_reference/builder_classes.rst +++ b/docs/source/package_reference/builder_classes.rst @@ -1,7 +1,7 @@ -Classes used during the dataset building process ----------------------------------------------------- +Builder classes +--------------- -Two main classes are mostly used during the dataset building process. +πŸ€— Datasets relies on two main classes during the dataset building process: :class:`datasets.DatasetBuilder` and :class:`datasets.BuilderConfig`. .. autoclass:: datasets.DatasetBuilder diff --git a/docs/source/package_reference/loading_methods.rst b/docs/source/package_reference/loading_methods.rst index f2a76985c8a..5f3ed95af22 100644 --- a/docs/source/package_reference/loading_methods.rst +++ b/docs/source/package_reference/loading_methods.rst @@ -1,7 +1,7 @@ Loading methods ----------------------------------------------------- +--------------- -Methods are provided to list and load datasets and metrics. +Methods for listing and loading datasets and metrics: Datasets ~~~~~~~~~~~~~~~~~~~~~ @@ -14,9 +14,17 @@ Datasets .. autofunction:: datasets.load_dataset_builder +.. autofunction:: datasets.get_dataset_config_names + +.. autofunction:: datasets.get_dataset_infos + +.. autofunction:: datasets.inspect_dataset + Metrics ~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: datasets.list_metrics .. autofunction:: datasets.load_metric + +.. autofunction:: datasets.inspect_metric diff --git a/docs/source/package_reference/logging_methods.rst b/docs/source/package_reference/logging_methods.rst index 7ca4cb757b3..f9dc9846001 100644 --- a/docs/source/package_reference/logging_methods.rst +++ b/docs/source/package_reference/logging_methods.rst @@ -1,37 +1,33 @@ Logging methods ---------------------------------------------------- -πŸ€— Datasets tries to be very transparent and explicit about its inner working, but this can be quite verbose at times. +πŸ€— Datasets strives to be transparent and explicit about how it works, but this can be quite verbose at times. We have included a series of logging methods which allow you to easily adjust the level of verbosity of the entire library. Currently the default verbosity of the library is set to ``WARNING``. -A series of logging methods let you easily adjust the level of verbosity of the whole library. - -Currently the default verbosity of the library is ``WARNING``. - -To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity to the INFO level. +To change the level of verbosity, use one of the direct setters. For instance, here is how to change the verbosity to the ``INFO`` level: .. code-block:: python import datasets datasets.logging.set_verbosity_info() -You can also use the environment variable ``DATASETS_VERBOSITY`` to override the default verbosity. You can set it to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example: +You can also use the environment variable ``DATASETS_VERBOSITY`` to override the default verbosity, and set it to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``: .. code-block:: bash DATASETS_VERBOSITY=error ./myprogram.py -All the methods of this logging module are documented below, the main ones are -:func:`datasets.logging.get_verbosity` to get the current level of verbosity in the logger and -:func:`datasets.logging.set_verbosity` to set the verbosity to the level of your choice. In order (from the least -verbose to the most verbose), those levels (with their corresponding int values in parenthesis) are: - -- :obj:`datasets.logging.CRITICAL` or :obj:`datasets.logging.FATAL` (int value, 50): only report the most - critical errors. -- :obj:`datasets.logging.ERROR` (int value, 40): only report errors. -- :obj:`datasets.logging.WARNING` or :obj:`datasets.logging.WARN` (int value, 30): only reports error and - warnings. This the default level used by the library. -- :obj:`datasets.logging.INFO` (int value, 20): reports error, warnings and basic information. -- :obj:`datasets.logging.DEBUG` (int value, 10): report all information. +All the methods of this logging module are documented below. The main ones are: + +* :func:`datasets.logging.get_verbosity` to get the current level of verbosity in the logger +* :func:`datasets.logging.set_verbosity` to set the verbosity to the level of your choice + +In order from the least to the most verbose (with their corresponding ``int`` values): + +1. :obj:`datasets.logging.CRITICAL` or :obj:`datasets.logging.FATAL` (int value, 50): only report the most critical errors. +2. :obj:`datasets.logging.ERROR` (int value, 40): only report errors. +3. :obj:`datasets.logging.WARNING` or :obj:`datasets.logging.WARN` (int value, 30): only reports error and warnings. This the default level used by the library. +4. :obj:`datasets.logging.INFO` (int value, 20): reports error, warnings and basic information. +5. :obj:`datasets.logging.DEBUG` (int value, 10): report all information. Functions diff --git a/docs/source/package_reference/main_classes.rst b/docs/source/package_reference/main_classes.rst index 2e0c075734c..e7e9bf27a4c 100644 --- a/docs/source/package_reference/main_classes.rst +++ b/docs/source/package_reference/main_classes.rst @@ -140,4 +140,10 @@ The base class ``Metric`` implements a Metric backed by one or several :class:`d .. autofunction:: datasets.filesystems.extract_path_from_uri -.. autofunction:: datasets.filesystems.is_remote_filesystem \ No newline at end of file +.. autofunction:: datasets.filesystems.is_remote_filesystem + +``Fingerprint`` +~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: datasets.fingerprint.Hasher + :members: diff --git a/docs/source/package_reference/table_classes.rst b/docs/source/package_reference/table_classes.rst index 01f30dde7c9..381f97a487a 100644 --- a/docs/source/package_reference/table_classes.rst +++ b/docs/source/package_reference/table_classes.rst @@ -1,9 +1,9 @@ Table Classes ---------------------------------------------------- -Each :obj:`datasets.Dataset` object is backed by a pyarrow Table. -A Table can be loaded either from the disk (memory mapped) or in memory. -Several Table types are available, and they all inherit from datasets.table.Table. +Each :obj:`datasets.Dataset` object is backed by a PyArrow Table. +A Table can be loaded from either the disk (memory mapped) or in memory. +Several Table types are available, and they all inherit from :class:`datasets.table.Table`. .. autoclass:: datasets.table.Table diff --git a/docs/source/process.rst b/docs/source/process.rst new file mode 100644 index 00000000000..4026fb4c54b --- /dev/null +++ b/docs/source/process.rst @@ -0,0 +1,597 @@ +Process +======= + +πŸ€— Datasets provides many tools for modifying the structure and content of a dataset. You can rearrange the order of rows or extract nested fields into their own columns. For more powerful processing applications, you can even alter the contents of a dataset by applying a function to the entire dataset to generate new rows and columns. These processing methods provide a lot of control and flexibility to mold your dataset into the desired shape and size with the appropriate features. + +This guide will show you how to: + +* Reorder rows and split the dataset. +* Rename and remove columns, and other common column operations. +* Apply processing functions to each example in a dataset. +* Concatenate datasets. +* Apply a custom formatting transform. +* Save and export processed datasets. + +Load the MRPC dataset from the GLUE benchmark to follow along with our examples: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue', 'mrpc', split='train') + +.. attention:: + + All the processing methods in this guide return a new :class:`datasets.Dataset`. Modification is not done in-place. Be careful about overriding your previous dataset! + +Sort, shuffle, select, split, and shard +--------------------------------------- + +There are several methods for rearranging the structure of a dataset. These methods are useful for selecting only the rows you want, creating train and test splits, and sharding very large datasets into smaller chunks. + +``Sort`` +^^^^^^^^ + +Use :func:`datasets.Dataset.sort` to sort a columns values according to their numerical values. The provided column must be NumPy compatible. + +.. code-block:: + + >>> dataset['label'][:10] + [1, 0, 1, 0, 1, 1, 0, 1, 0, 0] + >>> sorted_dataset = dataset.sort('label') + >>> sorted_dataset['label'][:10] + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] + >>> sorted_dataset['label'][-10:] + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + +``Shuffle`` +^^^^^^^^^^^ + +The :func:`datasets.Dataset.shuffle` method randomly rearranges the values of a column. You can specify the ``generator`` argument in this method to use a different ``numpy.random.Generator`` if you want more control over the algorithm used to shuffle the dataset. + +.. code-block:: + + >>> shuffled_dataset = sorted_dataset.shuffle(seed=42) + >>> shuffled_dataset['label'][:10] + [1, 1, 1, 0, 1, 1, 1, 1, 1, 0] + +``Select`` and ``Filter`` +^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are two options for filtering rows in a dataset: :func:`datasets.Dataset.select` and :func:`datasets.Dataset.filter`. + +* :func:`datasets.Dataset.select` returns rows according to a list of indices: + +.. code-block:: + + >>> small_dataset = dataset.select([0, 10, 20, 30, 40, 50]) + >>> len(small_dataset) + 6 + +* :func:`datasets.Dataset.filter` returns rows that match a specified condition: + +.. code-block:: + + >>> start_with_ar = dataset.filter(lambda example: example['sentence1'].startswith('Ar')) + >>> len(start_with_ar) + 6 + >>> start_with_ar['sentence1'] + ['Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', + 'Arison said Mann may have been one of the pioneers of the world music movement and he had a deep love of Brazilian music .', + 'Arts helped coach the youth on an eighth-grade football team at Lombardi Middle School in Green Bay .', + 'Around 9 : 00 a.m. EDT ( 1300 GMT ) , the euro was at $ 1.1566 against the dollar , up 0.07 percent on the day .', + "Arguing that the case was an isolated example , Canada has threatened a trade backlash if Tokyo 's ban is not justified on scientific grounds .", + 'Artists are worried the plan would harm those who need help most - performers who have a difficult time lining up shows .' + ] + +:func:`datasets.Dataset.filter` can also filter by indices if you set ``with_indices=True``: + +.. code-block:: + + >>> even_dataset = dataset.filter(lambda example, indice: indice % 2 == 0, with_indices=True) + >>> len(even_dataset) + 1834 + >>> len(dataset) / 2 + 1834.0 + +``Split`` +^^^^^^^^^ + +:func:`datasets.Dataset.train_test_split` creates train and test splits, if your dataset doesn't already have them. This allows you to adjust the relative proportions or absolute number of samples in each split. In the example below, use the ``test_size`` argument to create a test split that is 10% of the original dataset: + +.. code-block:: + + >>> dataset.train_test_split(test_size=0.1) + {'train': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 3301), + 'test': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 367)} + >>> 0.1 * len(dataset) + 366.8 + +The splits are shuffled by default, but you can set ``shuffle=False`` to prevent shuffling. + +``Shard`` +^^^^^^^^^ + +πŸ€— Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the ``num_shards`` argument in :func:`datasets.Dataset.shard` to determine the number of shards to split the dataset into. You will also need to provide the shard you want to return with the ``index`` argument. + +For example, the `imdb `_ dataset has 25000 examples: + +.. code-block:: + + >>> from datasets import load_dataset + >>> datasets = load_dataset('imdb', split='train') + >>> print(dataset) + Dataset({ + features: ['text', 'label'], + num_rows: 25000 + }) + +After you shard the dataset into four chunks, the first shard only has 6250 examples: + +.. code-block:: + + >>> dataset.shard(num_shards=4, index=0) + Dataset({ + features: ['text', 'label'], + num_rows: 6250 + }) + >>> print(25000/4) + 6250.0 + +Rename, remove, cast, and flatten +--------------------------------- + +The following methods allow you to modify the columns of a dataset. These methods are useful for renaming or removing columns, changing columns to a new set of features, and flattening nested column structures. + +``Rename`` +^^^^^^^^^^ + +Use :func:`datasets.Dataset.rename_column` when you need to rename a column in your dataset. Features associated with the original column are actually moved under the new column name, instead of just replacing the original column in-place. + +Provide :func:`datasets.Dataset.rename_column` with the name of the original column, and the new column name: + +.. code-block:: + + >>> dataset + Dataset({ + features: ['sentence1', 'sentence2', 'label', 'idx'], + num_rows: 3668 + }) + >>> dataset = dataset.rename_column("sentence1", "sentenceA") + >>> dataset = dataset.rename_column("sentence2", "sentenceB") + >>> dataset + Dataset({ + features: ['sentenceA', 'sentenceB', 'label', 'idx'], + num_rows: 3668 + }) + +``Remove`` +^^^^^^^^^^ + +When you need to remove one or more columns, give :func:`datasets.Dataset.remove_columns` the name of the column to remove. Remove more than one column by providing a list of column names: + +.. code-block:: + + >>> dataset = dataset.remove_columns("label") + >>> dataset + Dataset({ + features: ['sentence1', 'sentence2', 'idx'], + num_rows: 3668 + }) + >>> dataset = dataset.remove_columns(['sentence1', 'sentence2']) + >>> dataset + Dataset({ + features: ['idx'], + num_rows: 3668 + }) + +``Cast`` +^^^^^^^^ + +:func:`datasets.Dataset.cast` changes the feature type of one or more columns. This method takes your new :obj:`datasets.Features` as its argument. The following sample code shows how to change the feature types of :obj:`datasets.ClassLabel` and :obj:`datasets.Value`: + +.. code-block:: + + >>> dataset.features + {'sentence1': Value(dtype='string', id=None), + 'sentence2': Value(dtype='string', id=None), + 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), + 'idx': Value(dtype='int32', id=None)} + + >>> from datasets import ClassLabel, Value + >>> new_features = dataset.features.copy() + >>> new_features["label"] = ClassLabel(names=['negative', 'positive']) + >>> new_features["idx"] = Value('int64') + >>> dataset = dataset.cast(new_features) + >>> dataset.features + {'sentence1': Value(dtype='string', id=None), + 'sentence2': Value(dtype='string', id=None), + 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None), + 'idx': Value(dtype='int64', id=None)} + +.. tip:: + + Casting only works if the original feature type and new feature type are compatible. For example, you can cast a column with the feature type ``Value('int32')`` to ``Value('bool')`` if the original column only contains ones and zeros. + +.. _flatten: + +``Flatten`` +^^^^^^^^^^^ + +Sometimes a column can be a nested structure of several types. Use :func:`datasets.Dataset.flatten` to extract the subfields into their own separate columns. Take a look at the nested structure below from the SQuAD dataset: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('squad', split='train') + >>> dataset.features + {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), + 'context': Value(dtype='string', id=None), + 'id': Value(dtype='string', id=None), + 'question': Value(dtype='string', id=None), + 'title': Value(dtype='string', id=None)} + +The ``answers`` field contains two subfields: ``text`` and ``answer_start``. Flatten them with :func:`datasets.Dataset.flatten`: + +.. code-block:: + + >>> flat_dataset = dataset.flatten() + >>> flat_dataset + Dataset({ + features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], + num_rows: 87599 + }) + +Notice how the subfields are now their own independent columns: ``answers.text`` and ``answers.answer_start``. + +.. _map: + +``Map`` +------- + +Some of the more powerful applications of πŸ€— Datasets come from using :func:`datasets.Dataset.map`. The primary purpose of :func:`datasets.Dataset.map` is to speed up processing functions. It allows you to apply a processing function to each example in a dataset, independently or in batches. This function can even create new rows and columns. + +In the following example, you will prefix each ``sentence1`` value in the dataset with ``'My sentence: '``. First, create a function that adds ``'My sentence: '`` to the beginning of each sentence. The function needs to accept and output a ``dict``: + +.. code-block:: + + >>> def add_prefix(example): + ... example['sentence1'] = 'My sentence: ' + example['sentence1'] + ... return example + +Next, apply this function to the dataset with :func:`datasets.Dataset.map`: + +.. code-block:: + + >>> updated_dataset = small_dataset.map(add_prefix) + >>> updated_dataset['sentence1'][:5] + ['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + "My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", + 'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .', + 'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', + ] + +Let's take a look at another example, except this time, you will remove a column with :func:`datasets.Dataset.map`. When you remove a column, it is only removed after the example has been provided to the mapped function. This allows the mapped function to use the content of the columns before they are removed. + +Specify the column to remove with the ``remove_columns`` argument in :func:`datasets.Dataset.map`: + +.. code-block:: + + >>> updated_dataset = dataset.map(lambda example: {'new_sentence': example['sentence1']}, remove_columns=['sentence1']) + >>> updated_dataset.column_names + ['sentence2', 'label', 'idx', 'new_sentence'] + +.. tip:: + + πŸ€— Datasets also has a :func:`datasets.Dataset.remove_columns` method that is functionally identical, but faster, because it doesn't copy the data of the remaining columns. + +You can also use :func:`datasets.Dataset.map` with indices if you set ``with_indices=True``. The example below adds the index to the beginning of each sentence: + +.. code-block:: + + >>> updated_dataset = dataset.map(lambda example, idx: {'sentence2': f'{idx}: ' + example['sentence2']}, with_indices=True) + >>> updated_dataset['sentence2'][:5] + ['0: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', + "1: Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", + "2: On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .", + '3: Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', + '4: PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .' + ] + +Multiprocessing +^^^^^^^^^^^^^^^ + +Multiprocessing can significantly speed up processing by parallelizing the processes on your CPU. Set the ``num_proc`` argument in :func:`datasets.Dataset.map` to set the number of processes to use: + +.. code:: + + >>> updated_dataset = dataset.map(lambda example, idx: {'sentence2': f'{idx}: ' + example['sentence2']}, num_proc=4) + +Batch processing +^^^^^^^^^^^^^^^^ + +:func:`datasets.Dataset.map` also supports working with batches of examples. Operate on batches by setting ``batched=True``. The default batch size is 1000, but you can adjust it with the ``batch_size`` argument. This opens the door to many interesting applications such as tokenization, splitting long sentences into shorter chunks, and data augmentation. + +Tokenization +"""""""""""" + +One of the most obvious use-cases for batch processing is tokenization, which accepts batches of inputs. + +First, load the tokenizer from the BERT model: + +.. code-block:: + + >>> from transformers import BertTokenizerFast + >>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased') + +Apply the tokenizer to batches of the ``sentence1`` field: + +.. code-block:: + + >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True) + >>> encoded_dataset.column_names + ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'] + >>> encoded_dataset[0] + {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', + 'label': 1, + 'idx': 0, + 'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + +Now you have three new columns, ``input_ids``, ``token_type_ids``, ``attention_mask``, that contain the encoded version of the ``sentence1`` field. + +Split long examples +""""""""""""""""""" + +When your examples are too long, you may want to split them into several smaller snippets. Begin by creating a function that: + +1. Splits the ``sentence1`` field into snippets of 50 characters. + +2. Stacks all the snippets together to create the new dataset. + +.. code-block:: + + >>> def chunk_examples(examples): + ... chunks = [] + ... for sentence in examples['sentence1']: + ... chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)] + ... return {'chunks': chunks} + +Apply the function with :func:`datasets.Dataset.map`: + +.. code-block:: + + >>> chunked_dataset = dataset.map(chunk_examples, batched=True, remove_columns=dataset.column_names) + >>> chunked_dataset[:10] + {'chunks': ['Amrozi accused his brother , whom he called " the ', + 'witness " , of deliberately distorting his evidenc', + 'e .', + "Yucaipa owned Dominick 's before selling the chain", + ' to Safeway in 1998 for $ 2.5 billion .', + 'They had published an advertisement on the Interne', + 't on June 10 , offering the cargo for sale , he ad', + 'ded .', + 'Around 0335 GMT , Tab shares were up 19 cents , or', + ' 4.4 % , at A $ 4.56 , having earlier set a record']} + +Notice how the sentences are split into shorter chunks now, and there are more rows in the dataset. + +.. code-block:: + + >>> dataset + Dataset({ + features: ['sentence1', 'sentence2', 'label', 'idx'], + num_rows: 3668 + }) + >>> chunked_dataset + Dataset(schema: {'chunks': 'string'}, num_rows: 10470) + +Data augmentation +""""""""""""""""" + +With batch processing, you can even augment your dataset with additional examples. In the following example, you will generate additional words for a masked token in a sentence. + +Load the `RoBERTA `_ model for use in the πŸ€— Transformer `FillMaskPipeline `_: + +.. code-block:: + + >>> from random import randint + >>> from transformers import pipeline + + >>> fillmask = pipeline('fill-mask', model='roberta-base') + >>> mask_token = fillmask.tokenizer.mask_token + >>> smaller_dataset = dataset.filter(lambda e, i: i<100, with_indices=True) + +Create a function to randomly select a word to mask in the sentence. The function should also return the original sentence and the top two replacements generated by RoBERTA. + +.. code-block:: + + >>> def augment_data(examples): + ... outputs = [] + ... for sentence in examples['sentence1']: + ... words = sentence.split(' ') + ... K = randint(1, len(words)-1) + ... masked_sentence = " ".join(words[:K] + [mask_token] + words[K+1:]) + ... predictions = fillmask(masked_sentence) + ... augmented_sequences = [predictions[i]['sequence'] for i in range(3)] + ... outputs += [sentence] + augmented_sequences + ... + ... return {'data': outputs} + +Use :func:`datasets.Dataset.map` to apply the function over the whole dataset: + +.. code-block:: + + >>> augmented_dataset = smaller_dataset.map(augment_data, batched=True, remove_columns=dataset.column_names, batch_size=8) + >>> augmented_dataset[:9]['data'] + ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'Amrozi accused his brother, whom he called " the witness ", of deliberately withholding his evidence.', + 'Amrozi accused his brother, whom he called " the witness ", of deliberately suppressing his evidence.', + 'Amrozi accused his brother, whom he called " the witness ", of deliberately destroying his evidence.', + "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", + 'Yucaipa owned Dominick Stores before selling the chain to Safeway in 1998 for $ 2.5 billion.', + "Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $ 2.5 billion.", + 'Yucaipa owned Dominick Pizza before selling the chain to Safeway in 1998 for $ 2.5 billion.' + ] + +For each original sentence, RoBERTA augmented a random word with three alternatives. In the first sentence, the word ``distorting`` is augmented with ``withholding``, ``suppressing``, and ``destroying``. + +Process multiple splits +^^^^^^^^^^^^^^^^^^^^^^^ + +Many datasets have splits that you can process simultaneously with :func:`datasets.DatasetDict.map`. For example, tokenize the ``sentence1`` field in the train and test split by: + +.. code-block:: + + >>> from datasets import load_dataset + + # load all the splits + >>> dataset = load_dataset('glue', 'mrpc') + >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True) + >>> encoded_dataset["train"][0] + {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', + 'label': 1, + 'idx': 0, + 'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + +Distributed usage +^^^^^^^^^^^^^^^^^ + +When you use :func:`datasets.Dataset.map` in a distributed setting, you should also use `torch.distributed.barrier `_. This ensures the main process performs the mapping, while the other processes load the results, thereby avoiding duplicate work. + +The following example shows how you can use ``torch.distributed.barrier`` to synchronize the processes: + +.. code-block:: + + >>> from datasets import Dataset + >>> import torch.distributed + + >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) + + >>> if training_args.local_rank > 0: + ... print("Waiting for main process to perform the mapping") + ... torch.distributed.barrier() + + >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) + + >>> if training_args.local_rank == 0: + ... print("Loading results from main process") + ... torch.distributed.barrier() + +Concatenate +------------ + +Separate datasets can be concatenated if they share the same column types. Concatenate datasets with :func:`datasets.concatenate_datasets`: + +.. code-block:: + + >>> from datasets import concatenate_datasets, load_dataset + + >>> bookcorpus = load_dataset("bookcorpus", split="train") + >>> wiki = load_dataset("wikipedia", "20200501.en", split="train") + >>> wiki = wiki.remove_columns("title") # only keep the text + + >>> assert bookcorpus.features.type == wiki.features.type + >>> bert_dataset = concatenate_datasets([bookcorpus, wiki]) + +.. seealso:: + + You can also mix several datasets together by taking alternating examples from each one to create a new dataset. This is known as interleaving, and you can use it with :func:`datasets.interleave_datasets`. Both :func:`datasets.interleave_datasets` and :func:`datasets.concatenate_datasets` will work with regular :class:`datasets.Dataset` and :class:`datasets.IterableDataset` objects. Refer to the :ref:`interleave_datasets` section for an example of how it's used. + +You can also concatenate two datasets horizontally (axis=1) as long as they have the same number of rows: + + >>> from datasets import Dataset + >>> bookcorpus_ids = Dataset.from_dict({"ids": list(range(len(bookcorpus)))}) + >>> bookcorpus_with_ids = concatenate_datasets([bookcorpus, bookcorpus_ids], axis=1) + +Format +------ + +:func:`datasets.Dataset.with_format` provides an alternative method to set the format. This method will return a new :class:`datasets.Dataset` object with your specified format: + +.. code:: + + >>> dataset.with_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) + +Use :func:`datasets.Dataset.reset_format` if you need to reset the dataset to the original format: + +.. code-block:: + + >>> dataset.format + {'type': 'torch', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False} + >>> dataset.reset_format() + >>> dataset.format + {'type': 'python', 'format_kwargs': {}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} + +Format transform +^^^^^^^^^^^^^^^^ + +:func:`datasets.Dataset.set_transform` allows you to apply a custom formatting transform on-the-fly. This will replace any previously specified format. For example, you can use this method to tokenize and pad tokens on-the-fly: + +.. code-block:: + + >>> from transformers import BertTokenizer + >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + >>> def encode(batch): + ... return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt") + >>> dataset.set_transform(encode) + >>> dataset.format + {'type': 'custom', 'format_kwargs': {'transform': }, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} + >>> dataset[:2] + {'input_ids': tensor([[ 101, 2572, 3217, ... 102]]), 'token_type_ids': tensor([[0, 0, 0, ... 0]]), 'attention_mask': tensor([[1, 1, 1, ... 1]])} + +In this case, the tokenization is applied only when the examples are accessed. + + +Save +---- + +Once you are done processing your dataset, you can save and reuse it later with :func:`datasets.Dataset.save_to_disk`. + +Save your dataset by providing the path to the directory you wish to save it to: + +.. code:: + + >>> encoded_dataset.save_to_disk("path/of/my/dataset/directory") + +When you want to use your dataset again, use :func:`datasets.load_from_disk` to reload it: + +.. code-block:: + + >>> from datasets import load_from_disk + >>> reloaded_encoded_dataset = load_from_disk("path/of/my/dataset/directory") + +.. tip:: + + Want to save your dataset to a cloud storage provider? Read our :doc:`Cloud Storage <./filesystems>` guide on how to save your dataset to AWS or Google Cloud Storage! + +Export +------ + +πŸ€— Datasets supports exporting as well, so you can work with your dataset in other applications. The following table shows currently supported file formats you can export to: + +.. list-table:: + :header-rows: 1 + + * - File type + - Export method + * - CSV + - :func:`datasets.Dataset.to_csv` + * - JSON + - :func:`datasets.Dataset.to_json` + * - Parquet + - :func:`datasets.Dataset.to_parquet` + * - In-memory Python object + - :func:`datasets.Dataset.to_pandas` or :func:`datasets.Dataset.to_dict` + +For example, export your dataset to a CSV file like this: + +.. code:: + + >>> encoded_dataset.to_csv("path/of/my/dataset.csv") diff --git a/docs/source/processing.rst b/docs/source/processing.rst deleted file mode 100644 index 9b7408f007c..00000000000 --- a/docs/source/processing.rst +++ /dev/null @@ -1,708 +0,0 @@ -Processing data in a Dataset -============================================================== - -πŸ€— Datasets provides many methods to modify a Dataset, be it to reorder, split or shuffle the dataset or to apply data processing functions or evaluation functions to its elements. - -We'll start by presenting the methods which change the order or number of elements before presenting methods which access and can change the content of the elements themselves. - -As always, let's start by loading a small dataset for our demonstrations: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('glue', 'mrpc', split='train') - -.. note:: - - **No in-place policy** All the methods in this chapter return a new :class:`datasets.Dataset`. No modification is done in-place and it's thus responsibility of the user to decide to override the previous dataset with the newly returned one. - -.. note:: - - **Caching policy** All the methods in this chapter store the updated dataset in a cache file indexed by a hash of current state and all the arguments used to call the method. - - A subsequent call to any of the methods detailed here (like :func:`datasets.Dataset.sort`, :func:`datasets.Dataset.map`, etc) will thus **reuse the cached file instead of recomputing the operation** (even in another python session). - - This usually makes it very efficient to process data with πŸ€— Datasets. - - If the disk space is critical, these methods can be called with arguments to avoid this behavior (see the last section), or the cache files can be cleaned using the method :func:`datasets.Dataset.cleanup_cache_files`. - - -Selecting, sorting, shuffling, splitting rows --------------------------------------------------- - -Several methods are provided to reorder rows and/or split the dataset: - -- sorting the dataset according to a column (:func:`datasets.Dataset.sort`) -- shuffling the dataset (:func:`datasets.Dataset.shuffle`) -- filtering rows either according to a list of indices (:func:`datasets.Dataset.select`) or with a filter function returning true for the rows to keep (:func:`datasets.Dataset.filter`), -- splitting the dataset in a (potentially shuffled) train and a test split (:func:`datasets.Dataset.train_test_split`), -- splitting the dataset in a deterministic list of shards (:func:`datasets.Dataset.shard`), -- concatenate datasets that have the same column types (:func:`datasets.concatenate_datasets`). - -These methods have quite simple signature and should be for the most part self-explanatory. - -Let's see them in action: - -Sorting the dataset according to a column: ``sort`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The provided column has to be NumPy compatible (typically a column containing numerical values). - -.. code-block:: - - >>> dataset['label'][:10] - [1, 0, 1, 0, 1, 1, 0, 1, 0, 0] - >>> sorted_dataset = dataset.sort('label') - >>> sorted_dataset['label'][:10] - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] - >>> sorted_dataset['label'][-10:] - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - -Shuffling the dataset: ``shuffle`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: - - >>> shuffled_dataset = sorted_dataset.shuffle(seed=42) - >>> shuffled_dataset['label'][:10] - [1, 1, 1, 0, 1, 1, 1, 1, 1, 0] - -You can also provide a :obj:`numpy.random.Generator` to :func:`datasets.Dataset.shuffle` to control more finely the algorithm used to shuffle the dataset. - -Filtering rows: ``select`` and ``filter`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can filter rows according to a list of indices (:func:`datasets.Dataset.select`) or with a filter function returning true for the rows to keep (:func:`datasets.Dataset.filter`): - -.. code-block:: - - >>> small_dataset = dataset.select([0, 10, 20, 30, 40, 50]) - >>> len(small_dataset) - 6 - - >>> start_with_ar = dataset.filter(lambda example: example['sentence1'].startswith('Ar')) - >>> len(start_with_ar) - 6 - >>> start_with_ar['sentence1'] - ['Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', - 'Arison said Mann may have been one of the pioneers of the world music movement and he had a deep love of Brazilian music .', - 'Arts helped coach the youth on an eighth-grade football team at Lombardi Middle School in Green Bay .', - 'Around 9 : 00 a.m. EDT ( 1300 GMT ) , the euro was at $ 1.1566 against the dollar , up 0.07 percent on the day .', - "Arguing that the case was an isolated example , Canada has threatened a trade backlash if Tokyo 's ban is not justified on scientific grounds .", - 'Artists are worried the plan would harm those who need help most - performers who have a difficult time lining up shows .' - ] - -:func:`datasets.Dataset.filter` expects a function which can accept a single example of the dataset, i.e. the python dictionary returned by :obj:`dataset[i]` and returns a boolean value. It's also possible to use the index of each example in the function by setting :obj:`with_indices=True` in :func:`datasets.Dataset.filter`. In this case, the signature of the function given to :func:`datasets.Dataset.filter` should be :obj:`function(example: dict, index: int) -> bool`: - -.. code-block:: - - >>> even_dataset = dataset.filter(lambda example, index: index % 2 == 0, with_indices=True) - >>> len(even_dataset) - 1834 - >>> len(dataset) / 2 - 1834.0 - -Splitting the dataset in train and test split: ``train_test_split`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This method is adapted from scikit-learn celebrated: obj:`train_test_split` `method `_ with the omission of the stratified options. - -You can select the test and train sizes as relative proportions or absolute number of samples. - -The splits will be **shuffled by default** using the above described :func:`datasets.Dataset.shuffle` method. You can deactivate this behavior by setting :obj:`shuffle=False` in the arguments of :func:`datasets.Dataset.train_test_split`. - -The two splits are returned as a dictionary of :class:`datasets.Dataset`. - -.. code-block:: - - >>> dataset.train_test_split(test_size=0.1) - {'train': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 3301), - 'test': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 367)} - >>> 0.1 * len(dataset) - 366.8 - -We can see that the test split is 10% of the original dataset. - -The :func:`datasets.Dataset.train_test_split` has many ways to select the relative sizes of the train and test split so we refer the reader to the package reference of :func:`datasets.Dataset.train_test_split` for all the details. - -Sharding the dataset: ``shard`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Eventually, it's possible to "shard" the dataset, i.e. divide it in a deterministic list of datasets of (almost) the same size. - -The :func:`datasets.Dataset.shard` takes as arguments the total number of shards (:obj:`num_shards`) and the index of the currently requested shard (:obj:`index`) and return a :class:`datasets.Dataset` instance constituted by the requested shard. - -This method can be used to slice a very large dataset in a predefined number of chunks. - -.. code-block:: - - >>> dataset_shard = dataset.shard(num_shards=40, index=3) - >>> print(dataset_shard.num_rows) - 92 - >>> print(dataset.num_rows /40) - 91.7 - -Renaming, removing, casting and flattening columns --------------------------------------------------- - -Renaming a column: ``rename_column`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This method renames a column in the dataset, and moves the features associated to the original column under the new column name. This operation will fail if the new column name already exists. - -:func:`datasets.Dataset.rename_column` takes the name of the original column and the new name as arguments. - -.. code-block:: - - >>> dataset = dataset.rename_column("sentence1", "sentenceA") - >>> dataset = dataset.rename_column("sentence2", "sentenceB") - >>> dataset - Dataset({ - features: ['sentenceA', 'sentenceB', 'label', 'idx'], - num_rows: 3668 - }) - - -Removing one or several columns: ``remove_columns`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It allows to remove one or several column(s) in the dataset and the features associated to them. - -You can also remove a column using :func:`Dataset.map` with `remove_columns` but the present method -doesn't copy the data to a new dataset object and is thus faster. - -:func:`datasets.Dataset.remove_columns` takes the names of the column to remove as argument. -You can provide one single column name or a list of column names. - -.. code-block:: - - >>> dataset = dataset.remove_columns("label") - >>> dataset - Dataset({ - features: ['sentence1', 'sentence2', 'idx'], - num_rows: 3668 - }) - >>> dataset = dataset.remove_columns(['sentence1', 'sentence2']) - >>> dataset - Dataset({ - features: ['idx'], - num_rows: 3668 - }) - -Casting the dataset to a new set of features types: ``cast`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This method is used to cast the dataset to a new set of features. -You can change the feature type of one or several columns. - -For the dataset casting to work, the original features type and the new feature types must be compatible for casting one to the other. -For example you can cast a column with the feature type ``Value("int32")`` to ``Value("bool")`` only if it only contains ones and zeros. -In general, you can only cast a column to a new type if pyarrow allows to cast between the underlying pyarrow data types. - -:func:`datasets.Dataset.cast` takes the new :obj:`datasets.Features` definition as argument. - -In this example, we change the :obj:`datasets.ClassLabel` label names, and we also change the ``idx`` from ``int32`` to ``int64``: - -.. code-block:: - - >>> dataset.features - {'sentence1': Value(dtype='string', id=None), - 'sentence2': Value(dtype='string', id=None), - 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), - 'idx': Value(dtype='int32', id=None)} - >>> from datasets import ClassLabel, Value - >>> new_features = dataset.features.copy() - >>> new_features["label"] = ClassLabel(names=['negative', 'positive']) - >>> new_features["idx"] = Value('int64') - >>> dataset = dataset.cast(new_features) - >>> dataset.features - {'sentence1': Value(dtype='string', id=None), - 'sentence2': Value(dtype='string', id=None), - 'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None), - 'idx': Value(dtype='int64', id=None)} - - -Flattening columns: ``flatten`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A column type can be a nested struct of several types. -For example a column "answers" may have two subfields "answer_start" and "text". -In this case if you want each of the two subfields to be actual columns, you can use :func:`datasets.Dataset.flatten`: - -.. code-block:: - - >>> squad = load_dataset("squad", split="train") - >>> squad - Dataset({ - features: ['id', 'title', 'context', 'question', 'answers'], - num_rows: 87599 - }) - >>> flattened_squad = squad.flatten() - >>> flattened_squad - Dataset({ - features: ['answers.answer_start', 'answers.text', 'context', 'id', 'question', 'title'], - num_rows: 87599 - }) - - - -Processing data with ``map`` --------------------------------- - -All the methods we've seen up to now operate on examples taken as a whole and don't inspect (excepted for the ``filter`` method) or modify the content of the samples. - -We now turn to the :func:`datasets.Dataset.map` method which is a powerful method inspired by ``tf.data.Dataset`` map method and which you can use to apply a processing function to each example in a dataset, independently or in batch and even generate new rows or columns. - -:func:`datasets.Dataset.map` takes a callable accepting a dict as argument (same dict as returned by :obj:`dataset[i]`) and iterates over the dataset by calling the function with each example. - -Let's print the length of the ``sentence1`` value for each sample in our dataset: - -.. code-block:: - - >>> from datasets.utils import disable_progress_bar - >>> disable_progress_bar() - >>> small_dataset = dataset.select(range(10)) - >>> small_dataset - Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 10) - >>> small_dataset.map(lambda example: print(len(example['sentence1']))) - 103 - 89 - 105 - 119 - 105 - 97 - 88 - 54 - 85 - 108 - Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 10) - -This is basically the same as doing - -.. code-block:: - - for example in dataset: - function(example) - -The above example had no effect on the dataset because the method we supplied to :func:`datasets.Dataset.map` didn't return a :obj:`dict` or a :obj:`abc.Mapping` that could be used to update the examples in the dataset. - -In such a case, :func:`datasets.Dataset.map` will return the original dataset (:obj:`self`) and the user is usually only interested in side effects of the provided method. - -Now let's see how we can use a method that actually modifies the dataset with :func:`datasets.Dataset.map`. - -Processing data row by row -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The main interest of :func:`datasets.Dataset.map` is to update and modify the content of the table and leverage smart caching and fast backend. - -To use :func:`datasets.Dataset.map` to update elements in the table you need to provide a function with the following signature: :obj:`function(example: dict) -> dict`. - -Let's add a prefix ``'My sentence: '`` to each ``sentence1`` value in our small dataset: - -.. code-block:: - - >>> def add_prefix(example): - ... example['sentence1'] = 'My sentence: ' + example['sentence1'] - ... return example - - >>> updated_dataset = small_dataset.map(add_prefix) - >>> updated_dataset['sentence1'][:5] - ['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - "My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", - 'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .', - 'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', - ] - -This call to :func:`datasets.Dataset.map` computed and returned an updated table. - -.. note:: - - Calling :func:`datasets.Dataset.map` also stored the updated table in a cache file indexed by the current state and the mapped function. - A subsequent call to :func:`datasets.Dataset.map` (even in another python session) will reuse the cached file instead of recomputing the operation. - You can test this by running again the previous cell, you will see that the result is directly loaded from the cache and not re-computed again. - -The function you provide to :func:`datasets.Dataset.map` should accept an input with the format of an item of the dataset: :obj:`function(dataset[0])` and return a python dict. - -The columns and type of the outputs **can be different** from columns and type of the input dict. In this case the new keys will be **added** as additional columns in the dataset. - -Each dataset example dict is updated with the dictionary returned by the function. Under the hood :obj:`map` operates like this: - -.. code-block:: - - new_dataset = [] - for example in dataset: - processed_example = function(example) - example.update(processed_example) - new_dataset.append(example) - return new_dataset - -Since the input example dict is **updated** with output dict generated by our :obj:`add_prefix` function, we could have actually just returned the updated ``sentence1`` field, instead of the full example which is simpler to write: - -.. code-block:: - - >>> updated_dataset = small_dataset.map(lambda example: {'sentence1': 'My sentence: ' + example['sentence1']}) - >>> updated_dataset['sentence1'][:5] - ['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - "My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .', - 'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', - 'My sentence: The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'] - -If a dataset was formatted using :func:`datasets.Dataset.set_format`, then: - -- if a format type was set, then the format type doesn't change -- if a list of columns that :func:`datasets.Dataset.__getitem__` should return was set, then the new columns added by map are added to this list - -Removing columns -^^^^^^^^^^^^^^^^^^^^^^^^ - -This process of **updating** the original example with the output of the mapped function is simpler to write when mostly adding new columns to a dataset but we need an additional mechanism to easily remove columns. - - -To this aim, the :obj:`remove_columns=List[str]` argument can be used and provided with a single name or a list of names of columns which should be removed during the :func:`datasets.Dataset.map` operation. - -Columns to remove are removed **after** the example has been provided to the mapped function so that the mapped function can use the content of these columns before they are removed. - -Here is an example removing the ``sentence1`` column while adding a ``new_sentence`` column with the content of the ``sentence1``. Said more simply, we are renaming the ``sentence1`` column as ``new_sentence``: - -.. code-block:: - - >>> updated_dataset = small_dataset.map(lambda example: {'new_sentence': example['sentence1']}, remove_columns=['sentence1']) - >>> updated_dataset.column_names - ['sentence2', 'label', 'idx', 'new_sentence'] - - -Using row indices -^^^^^^^^^^^^^^^^^^^^^^ - -When the argument :obj:`with_indices` is set to :obj:`True`, the indices of the rows (from ``0`` to ``len(dataset)``) will be provided to the mapped function. This function must then have the following signature: :obj:`function(example: dict, index: int) -> Union[None, dict]`. - -In the following example, we add the index of the example as a prefix to the ``sentence2`` field of each example: - -.. code-block:: - - >>> updated_dataset = small_dataset.map(lambda example, idx: {'sentence2': f'{idx}: ' + example['sentence2']}, with_indices=True) - >>> updated_dataset['sentence2'][:5] - ['0: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', - "1: Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", - "2: On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .", - '3: Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', - '4: PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .'] - - -Processing data in batches -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:func:`datasets.Dataset.map` can also work with batches of examples (slices of the dataset). - -This is particularly interesting if you have a mapped function which can efficiently handle batches of inputs like the tokenizers of the fast `HuggingFace tokenizers library `__. - -To operate on batch of examples, just set :obj:`batched=True` when calling :func:`datasets.Dataset.map` and provide a function with the following signature: :obj:`function(examples: Dict[List]) -> Dict[List]` or, if you use indices (:obj:`with_indices=True`): :obj:`function(examples: Dict[List], indices: List[int]) -> Dict[List])`. - -In other words, the mapped function should accept an input with the format of a slice of the dataset: :obj:`function(dataset[:10])`. - -Let's take an example with a fast tokenizer of the πŸ€— Transformers library. - -First install this library if you haven't already done it: - -.. code-block:: - - pip install transformers - -Then we will import a fast tokenizer, for instance the tokenizer of the Bert model: - -.. code-block:: - - >>> from transformers import BertTokenizerFast - >>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased') - -Now let's batch tokenize the ``sentence1`` fields of our dataset. The tokenizers of the πŸ€— Transformers library can accept lists of texts as inputs and tokenize them efficiently in batch (for the fast tokenizers in particular). - -For more details on the tokenizers of the πŸ€— Transformers library please refer to its `guide on processing data `__. - -This tokenizer will output a dictionary-like object with three fields: ``input_ids``, ``token_type_ids``, ``attention_mask`` corresponding to Bert model's required inputs. Each field contains a list (batch) of samples. - -The output of the tokenizer is thus compatible with the :func:`datasets.Dataset.map` method which is also expected to return a dictionary. We can thus directly return the dictionary generated by the tokenizer as the output of our mapped function: - -.. code-block:: - - >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True) - >>> encoded_dataset.column_names - ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'] - >>> encoded_dataset[0] - {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', - 'label': 1, - 'idx': 0, - 'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - -We have indeed added the columns for ``input_ids``, ``token_type_ids``, ``attention_mask`` which contain the encoded version of the ``sentence1`` field. - -The batch size provided to the mapped function can be controlled by the :obj:`batch_size` argument. The default value is ``1000``, i.e. batches of 1000 samples will be provided to the mapped function by default. - -Multiprocessing ---------------------------- - -Multiprocessing can speed up significantly the processing of your dataset. - -The :func:`datasets.Dataset.map` method has an argument ``num_proc`` that allows you to set the number of processes to use. - -In this case, each process takes care of processing one shard of the dataset and all the processes are ran in parallel. - -Augmenting the dataset ---------------------------- - -Using :func:`datasets.Dataset.map` in batched mode (i.e. with :obj:`batched=True`) actually let you control the size of the generated dataset freely. - -More precisely, in batched mode :func:`datasets.Dataset.map` will provide a batch of examples (as a dict of lists) to the mapped function and expect the mapped function to return back a batch of examples (as a dict of lists) but **the input and output batch are not required to be of the same size**. - -In other words, a batch mapped function can take as input a batch of size ``N`` and return a batch of size ``M`` where ``M`` can be greater or less than ``N`` and can even be zero. - -The resulting dataset can thus have a different size from the original dataset. - -This can be taken advantage of for several use-cases: - -- the :func:`datasets.Dataset.filter` method makes use of variable size batched mapping under the hood to change the size of the dataset and filter some columns, -- it's possible to cut examples which are too long in several snippets, -- it's also possible to do data augmentation on each example. - -.. note:: - - **One important condition on the output of the mapped function.** Each field in the output dictionary returned by the mapped function must contain the **same number of elements** as the other field in this output dictionary otherwise it's not possible to define the number of examples in the output returned the mapped function. This number can vary between the successive batches processed by the mapped function but in a single batch, all fields of the output dictionary should have the same number of elements. - -Let's show how we can implemented the two simple examples we mentioned: "cutting examples which are too long in several snippets" and do some "data augmentation". - -We'll start by chunking the ``sentence1`` field of our dataset in chunks of 50 characters and stack all these chunks to make our new dataset. - -We will also remove all the columns of the dataset and only keep the chunks in order to avoid the issue of uneven field lengths mentioned in the above note (we could also duplicate the other fields to compensated but let's make it as simple as possible here): - -.. code-block:: - - >>> def chunk_examples(examples): - ... chunks = [] - ... for sentence in examples['sentence1']: - ... chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)] - ... return {'chunks': chunks} - - >>> chunked_dataset = dataset.map(chunk_examples, batched=True, remove_columns=dataset.column_names) - >>> chunked_dataset - Dataset(schema: {'chunks': 'string'}, num_rows: 10470) - >>> chunked_dataset[:10] - {'chunks': ['Amrozi accused his brother , whom he called " the ', - 'witness " , of deliberately distorting his evidenc', - 'e .', - "Yucaipa owned Dominick 's before selling the chain", - ' to Safeway in 1998 for $ 2.5 billion .', - 'They had published an advertisement on the Interne', - 't on June 10 , offering the cargo for sale , he ad', - 'ded .', - 'Around 0335 GMT , Tab shares were up 19 cents , or', - ' 4.4 % , at A $ 4.56 , having earlier set a record']} - -As we can see, our dataset is now much longer (10470 row) and contains a single column with chunks of 50 characters. Some chunks are smaller since they are the last part of the sentences which were smaller than 50 characters. We could then filter them with :func:`datasets.Dataset.filter` for instance. - -Now let's finish with the other example and try to do some data augmentation. We will use a Roberta model to sample some masked tokens. - -Here we can use the `FillMaskPipeline of πŸ€— Transformers `__ to generate options for a masked token in a sentence. - -We will randomly select a word to mask in the sentence and return the original sentence plus the two top replacements by Roberta. - -Since the Roberta model is quite large to run on a small laptop CPU, we will restrict this example to a small dataset of 100 examples and we will lower the batch size to be able to follow the processing more precisely. - -.. code-block:: - - >>> from random import randint - >>> from transformers import pipeline - >>> - >>> fillmask = pipeline('fill-mask') - >>> mask_token = fillmask.tokenizer.mask_token - >>> smaller_dataset = dataset.filter(lambda e, i: i<100, with_indices=True) - >>> - >>> def augment_data(examples): - ... outputs = [] - ... for sentence in examples['sentence1']: - ... words = sentence.split(' ') - ... K = randint(1, len(words)-1) - ... masked_sentence = " ".join(words[:K] + [mask_token] + words[K+1:]) - ... predictions = fillmask(masked_sentence) - ... augmented_sequences = [predictions[i]['sequence'] for i in range(3)] - ... outputs += [sentence] + augmented_sequences - ... - ... return {'data': outputs} - - >>> augmented_dataset = smaller_dataset.map(augment_data, batched=True, remove_columns=dataset.column_names, batch_size=8) - >>> len(augmented_dataset) - 400 - >>> augmented_dataset[:8]['data'] - ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'Amrozi accused his brother, whom he called " the witness ", of deliberately withholding his evidence.', - 'Amrozi accused his brother, whom he called " the witness ", of deliberately suppressing his evidence.', - 'Amrozi accused his brother, whom he called " the witness ", of deliberately destroying his evidence.', - "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", - 'Yucaipa owned Dominick Stores before selling the chain to Safeway in 1998 for $ 2.5 billion.', - "Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $ 2.5 billion.", - 'Yucaipa owned Dominick Pizza before selling the chain to Safeway in 1998 for $ 2.5 billion.'] - -Here we have now multiplied the size of our dataset by ``4`` by adding three alternatives generated with Roberta to each example. We can see that the word ``distorting`` in the first example was augmented with other possibilities by the Roberta model: ``withholding``, ``suppressing``, ``destroying``, while in the second sentence, it was the ``'s`` token which was randomly sampled and replaced by ``Stores`` and ``Pizza``. - -Obviously this is a very simple example for data augmentation and it could be improved in several ways, the most interesting take-away is probably how this can be written in roughly ten lines of code without any loss in flexibility. - -Processing several splits at once ------------------------------------ - -When you load a dataset that has various splits, :func:`datasets.load_dataset` returns a :obj:`datasets.DatasetDict` that is a dictionary with split names as keys ('train', 'test' for example), and :obj:`datasets.Dataset` objects as values. -You can directly call map, filter, shuffle, and sort directly on a :obj:`datasets.DatasetDict` object: - -.. code-block:: - - >>> from datasets import load_dataset - >>> - >>> dataset = load_dataset('glue', 'mrpc') # load all the splits - >>> dataset.keys() - dict_keys(['train', 'validation', 'test']) - >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True) - >>> encoded_dataset["train"][0] - {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', - 'label': 1, - 'idx': 0, - 'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - -Concatenate several datasets ----------------------------- - -When you have several :obj:`datasets.Dataset` objects that share the same column types, you can create a new :obj:`datasets.Dataset` object that is the concatenation of them: - -.. code-block:: - - >>> from datasets import concatenate_datasets, load_dataset - >>> - >>> bookcorpus = load_dataset("bookcorpus", split="train") - >>> wiki = load_dataset("wikipedia", "20200501.en", split="train") - >>> wiki = wiki.remove_columns("title") # only keep the text - >>> - >>> assert bookcorpus.features.type == wiki.features.type - >>> bert_dataset = concatenate_datasets([bookcorpus, wiki]) - -If you want to interleave the datasets instead of concatenating them, you can use :func:`datasets.interleave_datasets`. - - -Saving a processed dataset on disk and reload it ------------------------------------------------- - -Once you have your final dataset you can save it on your disk and reuse it later using :obj:`datasets.load_from_disk`. -Saving a dataset creates a directory with various files: - -- arrow files: they contain your dataset's data -- dataset_info.json: contains the description, citations, etc. of the dataset -- state.json: contains the list of the arrow files and other information like the dataset format type, if any (torch or tensorflow for example) - -.. code-block:: - - >>> encoded_dataset.save_to_disk("path/of/my/dataset/directory") - >>> from datasets import load_from_disk - >>> reloaded_encoded_dataset = load_from_disk("path/of/my/dataset/directory") - -Both :obj:`datasets.Dataset` and :obj:`datasets.DatasetDict` objects can be saved on disk, by using respectively :func:`datasets.Dataset.save_to_disk` and :func:`datasets.DatasetDict.save_to_disk`. - -Furthermore it is also possible to save :obj:`datasets.Dataset` and :obj:`datasets.DatasetDict` to other filesystems and cloud storages such as S3 by using respectively :func:`datasets.Dataset.save_to_disk` -and :func:`datasets.DatasetDict.save_to_disk` and providing a ``Filesystem`` as input ``fs``. To learn more about saving your ``datasets`` to other filesystem take a look at :doc:`filesystems`. - -Exporting a dataset to csv/json/parquet, or to python objects ------------------------------------------------------------------------- - -In order to use your dataset in other applications, you can save your dataset in non-arrow formats. Currently natively supported are: - -* CSV: :func:`datasets.Dataset.to_csv` -* JSON/JSON Lines: :func:`datasets.Dataset.to_json` (JSON Lines by default, JSON with ``lines=False``) -* Parquet: :func:`datasets.Dataset.to_parquet` - -To get python objects directly, you can use :func:`datasets.Dataset.to_pandas` or :func:`datasets.Dataset.to_dict` to export the dataset as a pandas DataFrame or a python dict. - -Controlling the cache behavior ------------------------------------ - -When applying transforms on a dataset, the data are stored in cache files. -The caching mechanism allows to reload an existing cache file if it's already been computed. - -Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform. - -Note that the caching extends beyond sessions. Re-running the very same dataset processing methods (in the same order and on the same data files) in a different session will load from the same cache files. -This is possible thanks to a custom hashing function that works with most python objects (see fingerprinting section below). - - -Fingerprinting -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The fingerprint of a dataset in a given state is an internal value computed by combining the fingerprint of the previous state and a hash of the latest transform that was applied (transforms are all the processing methods for transforming a dataset that we listed in this chapter: :func:`datasets.Dataset.map`, :func:`datasets.Dataset.shuffle`, etc). The initial fingerprint is computed using a hash of the arrow table, or a hash of the arrow files if the dataset lives on disk. - -For example: - -.. code-block:: - - >>> from datasets import Dataset - >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) - >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) - >>> print(dataset1._fingerprint, dataset2._fingerprint) - d19493523d95e2dc 5b86abacd4b42434 - -The new fingerprint is a combination of the previous fingerprint and the hash of the given transform. For a transform to be hashable, it needs to be pickleable using `dill `_ or `pickle `_. In particular for :func:`datasets.Dataset.map`, you need to provide a pickleable processing method to apply on the dataset so that a determinist fingerprint can be computed by hashing the full state of the provided method (the fingerprint is computed taking into account all the dependencies of the method you provide). -For non-hashable transform, a random fingerprint is used and a warning is raised. -Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. -If you reuse a non-hashable transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. - -Enable or disable caching -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Locally you can prevent the library from reloading a cached file by using ``load_from_cache=False`` in transforms like :func:`datasets.Dataset.map` for example. -You can also specify the name of path where the cache file will be written using the parameter ``cache_file_name``. - -It is also possible to disable caching globally with :func:`datasets.set_caching_enabled`. - -If the caching is disabled, the library will no longer reload cached dataset files when applying transforms to the datasets. -More precisely, if the caching is disabled: - -- cache files are always recreated -- cache files are written to a temporary directory that is deleted when session closes -- cache files are named using a random hash instead of the dataset fingerprint -- use :func:`datasets.Dataset.save_to_disk` to save a transformed dataset or it will be deleted when session closes -- caching doesn't affect :func:`datasets.load_dataset`. If you want to regenerate a dataset from scratch you should use the ``download_mode`` parameter in :func:`datasets.load_dataset`. - -To disable caching you can run: - -.. code-block:: - - >>> from datasets import set_caching_enabled - >>> set_caching_enabled(False) - -You can also query the current status of the caching with :func:`datasets.is_caching_enabled`: - -Mapping in a distributed setting -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In a distributed setting, you may use caching and a :func:`torch.distributed.barrier` to make sure that only the main process performs the mapping, while the other ones load its results. This avoids duplicating work between all the processes, or worse, requesting more CPUs than your system can handle. For example: - -.. code-block:: - - >>> from datasets import Dataset - >>> import torch.distributed - - >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) - - >>> if training_args.local_rank > 0: - ... print("Waiting for main process to perform the mapping") - ... torch.distributed.barrier() - - >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) - >>> - >>> if training_args.local_rank == 0: - ... print("Loading results from main process") - ... torch.distributed.barrier() - - -When it encounters a barrier, each process will stop until all other processes have reached the barrier. The non-main processes reach the barrier first, before the mapping, and wait there. The main processes creates the cache for the processed dataset. It then reaches the barrier, at which point the other processes resume, and load the cache instead of performing the processing themselves. diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst new file mode 100644 index 00000000000..ff4e8bc0b9c --- /dev/null +++ b/docs/source/quickstart.rst @@ -0,0 +1,185 @@ +Quick Start +=========== + +The quick start is intended for developers who are ready to dive in to the code, and see an end-to-end example of how they can integrate πŸ€— Datasets into their model training workflow. For beginners who are looking for a gentler introduction, we recommend you begin with the :doc:`tutorials <./tutorial>`. + +In the quick start, you will walkthrough all the steps to fine-tune `BERT `_ on a paraphrase classification task. Depending on the specific dataset you use, these steps may vary, but the general steps of how to load a dataset and process it are the same. + +.. tip:: + + For more detailed information on loading and processing a dataset, take a look at `Chapter 3 `_ of the Hugging Face course! It covers additional important topics like dynamic padding, and fine-tuning with the Trainer API. + +Get started by installing πŸ€— Datasets: + +.. code:: + + pip install datasets + +Load the dataset and model +-------------------------- + +Begin by loading the `Microsoft Research Paraphrase Corpus (MRPC) `_ training dataset from the `General Language Understanding Evaluation (GLUE) benchmark `_. MRPC is a corpus of human annotated sentence pairs used to train a model to determine whether sentence pairs are semantically equivalent. + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('glue', 'mrpc', split='train') + +Next, import the pre-trained BERT model and its tokenizer from the `πŸ€— Transformers `_ library: + +.. tab:: PyTorch + + >>> from transformers import AutoModelForSequenceClassification, AutoTokenizer + >>> model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased') + Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] + - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). + - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). + Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] + You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') + +.. tab:: TensorFlow + + >>> from transformers import TFAutoModelForSequenceClassification, AutoTokenizer + >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") + Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls'] + - This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). + - This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). + Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['dropout_37', 'classifier'] + You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') + +Tokenize the dataset +-------------------- + +The next step is to tokenize the text in order to build sequences of integers the model can understand. Encode the entire dataset with :func:`datasets.Dataset.map`, and truncate and pad the inputs to the maximum length of the model. This ensures the appropriate tensor batches are built. + +.. code-block:: + + >>> def encode(examples): + ... return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length') + + >>> dataset = dataset.map(encode, batched=True) + >>> dataset[0] + {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', + 'label': 1, + 'idx': 0, + 'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]), + 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), + 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])} + +Notice how there are three new columns in the dataset: ``input_ids``, ``token_type_ids``, and ``attention_mask``. These columns are the inputs to the model. + +Format the dataset +------------------ + +Depending on whether you are using PyTorch, TensorFlow, or JAX, you will need to format the dataset accordingly. There are three changes you need to make to the dataset: + +1. Rename the ``label`` column to ``labels``, the expected input name in `BertForSequenceClassification `__ or `TFBertForSequenceClassification `__: + +.. code:: + + >>> dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True) + +2. Retrieve the actual tensors from the Dataset object instead of using the current Python objects. +3. Filter the dataset to only return the model inputs: ``input_ids``, ``token_type_ids``, and ``attention_mask``. + +:func:`datasets.Dataset.set_format` completes the last two steps on-the-fly. After you set the format, wrap the dataset in ``torch.utils.data.DataLoader`` or ``tf.data.Dataset``: + +.. tab:: PyTorch + + >>> import torch + >>> dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']) + >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) + >>> next(iter(dataloader)) + {'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], + [1, 1, 1, ..., 0, 0, 0], + [1, 1, 1, ..., 0, 0, 0], + ..., + [1, 1, 1, ..., 0, 0, 0], + [1, 1, 1, ..., 0, 0, 0], + [1, 1, 1, ..., 0, 0, 0]]), + 'input_ids': tensor([[ 101, 7277, 2180, ..., 0, 0, 0], + [ 101, 10684, 2599, ..., 0, 0, 0], + [ 101, 1220, 1125, ..., 0, 0, 0], + ..., + [ 101, 16944, 1107, ..., 0, 0, 0], + [ 101, 1109, 11896, ..., 0, 0, 0], + [ 101, 1109, 4173, ..., 0, 0, 0]]), + 'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]), + 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + ..., + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0], + [0, 0, 0, ..., 0, 0, 0]])} + +.. tab:: TensorFlow + + >>> import tensorflow as tf + >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']) + >>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']} + >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["labels"])).batch(32) + >>> next(iter(tfdataset)) + ({'input_ids': , 'token_type_ids': , 'attention_mask': }, ) + +Train the model +--------------- + +Lastly, create a simple training loop and start training: + +.. tab:: PyTorch + + >>> from tqdm import tqdm + >>> device = 'cuda' if torch.cuda.is_available() else 'cpu' + >>> model.train().to(device) + >>> optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5) + >>> for epoch in range(3): + ... for i, batch in enumerate(tqdm(dataloader)): + ... batch = {k: v.to(device) for k, v in batch.items()} + ... outputs = model(**batch) + ... loss = outputs[0] + ... loss.backward() + ... optimizer.step() + ... optimizer.zero_grad() + ... if i % 10 == 0: + ... print(f"loss: {loss}") + +.. tab:: TensorFlow + + >>> loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True) + >>> opt = tf.keras.optimizers.Adam(learning_rate=3e-5) + >>> model.compile(optimizer=opt, loss=loss_fn, metrics=["accuracy"]) + >>> model.fit(tfdataset, epochs=3) + +What's next? +------------ + +This completes the basic steps of loading a dataset to train a model. You loaded and processed the MRPC dataset to fine-tune BERT to determine whether sentence pairs have the same meaning. + +For your next steps, take a look at our :doc:`How-to guides <./how_to>` and learn how to achieve a specific task (e.g. load a dataset offline, add a dataset to the Hub, change the name of a column). Or if you want to deepen your knowledge of πŸ€— Datasets core concepts, read our :doc:`Conceptual Guides <./about_arrow>`. \ No newline at end of file diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst deleted file mode 100644 index d6dc132710c..00000000000 --- a/docs/source/quicktour.rst +++ /dev/null @@ -1,292 +0,0 @@ -Quick tour -========== - -Let's have a quick look at the πŸ€— Datasets library. This library has three main features: - -- It provides a very **efficient way to load and process data** from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python. -- It provides a very **simple way to access and share datasets** with the research and practitioner communities (over 1,000 datasets are already accessible in one line with the library as we'll see below). -- It was designed with a particular focus on interoperabilty with frameworks like **pandas, NumPy, PyTorch and TensorFlow**. - -πŸ€— Datasets provides datasets for many NLP tasks like text classification, question answering, language modeling, etc., and obviously these datasets can always be used for other tasks than their originally assigned task. Let's list all the currently provided datasets using :func:`datasets.list_datasets`: - -.. code-block:: - - >>> from datasets import list_datasets - >>> datasets_list = list_datasets() - >>> len(datasets_list) - 1067 - >>> print(', '.join(dataset for dataset in datasets_list)) - acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, - allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, - aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, - arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli, - bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp, - blog_authorship_corpus, bn_hate_speech [...] - -All these datasets can also be browsed on the `HuggingFace Hub `__ and can be viewed and explored online with the `πŸ€— Datasets viewer `__. - -Loading a dataset --------------------- - -Now let's load a simple dataset for classification, we'll use the MRPC dataset provided in the GLUE banchmark which is small enough for quick prototyping. You can explore this dataset and read more details `on the online viewer here `__: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('glue', 'mrpc', split='train') - -When typing this command for the first time, a processing script called a ``builder`` which is in charge of loading the MRPC/GLUE dataset will be downloaded, cached and imported. Then the dataset files themselves are downloaded and cached (usually from the original dataset URLs) and are processed to return a :class:`datasets.Dataset` comprising the training split of MRPC/GLUE as requested here. - -If you want to create a :class:`datasets.Dataset` from local CSV, JSON, text or pandas files instead of a community provided dataset, you can use one of the ``csv``, ``json``, ``text`` or ``pandas`` builder. They all accept a variety of file paths as inputs: a path to a single file, a list of paths to files or a dict of paths to files for each split. Here are some examples to load from CSV files: - -.. code-block:: - - >>> from datasets import load_dataset - >>> dataset = load_dataset('csv', data_files='my_file.csv') - >>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv']) - >>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'], - >>> 'test': 'my_test_file.csv'}) - -.. note:: - - If you don't provide a :obj:`split` argument to :func:`datasets.load_dataset`, this method will return a dictionary containing a dataset for each split in the dataset. This dictionary is a :obj:`datasets.DatasetDict` object that lets you process all the splits at once using :func:`datasets.DatasetDict.map`, :func:`datasets.DatasetDict.filter`, etc. - -Now let's have a look at our newly created :class:`datasets.Dataset` object. It basically behaves like a normal python container. You can query its length, get a single row but also get multiple rows and even index along columns (see all the details in :doc:`exploring `): - -.. code-block:: - - >>> len(dataset) - 3668 - >>> dataset[0] - {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', - 'label': 1, - 'idx': 0} - -A lot of metadata are available in the dataset attributes (description, citation, split sizes, etc) and we'll dive in this in the :doc:`exploring ` page. -We'll just say here that :class:`datasets.Dataset` has columns which are typed with types which can be arbitrarily nested complex types (e.g. list of strings or list of lists of int64 values). - -Let's take a look at the column in our dataset by printing its :attr:`datasets.Dataset.features`: - -.. code-block:: - - >>> dataset.features - {'idx': Value(dtype='int32', id=None), - 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), - 'sentence1': Value(dtype='string', id=None), - 'sentence2': Value(dtype='string', id=None)} - -Fine-tuning a deep-learning model ------------------------------------------- - -In the rest of this quick-tour we will use this dataset to fine-tune a Bert model on the sentence pair classification task of Paraphrase Classification. Let's have a quick look at our task. - -As you can see from the above features, the labels are a :class:`datasets.ClassLabel` instance with two classes: ``not_equivalent`` and ``equivalent``. - -We can print one example of each class using :func:`datasets.Dataset.filter` and a name-to-integer conversion method of the feature :class:`datasets.ClassLabel` called :func:`datasets.ClassLabel.str2int` (we explain these methods in more detail in :doc:`processing ` and :doc:`exploring `): - -.. code-block:: - - >>> dataset.filter(lambda example: example['label'] == dataset.features['label'].str2int('equivalent'))[0] - {'idx': 0, - 'label': 1, - 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .' - } - >>> dataset.filter(lambda example: example['label'] == dataset.features['label'].str2int('not_equivalent'))[0] - {'idx': 1, - 'label': 0, - 'sentence1': "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", - 'sentence2': "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 ." - } - -Now our goal will be to train a model which can predict the correct label (``not_equivalent`` or ``equivalent``) from a pair of sentences. - -Let's import a pretrained Bert model and its tokenizer using πŸ€— Transformers. - -.. code-block:: - - >>> ## PYTORCH CODE - >>> from transformers import AutoModelForSequenceClassification, AutoTokenizer - >>> model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased') - Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] - - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). - Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] - You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. - >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') - >>> ## TENSORFLOW CODE - >>> from transformers import TFAutoModelForSequenceClassification, AutoTokenizer - >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") - Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls'] - - This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - - This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). - Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['dropout_37', 'classifier'] - You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. - >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') - -πŸ€— Transformers warns us that we should probably train this model on a downstream task before using it which is exactly what we are going to do. -If you want more details on the models and tokenizers of πŸ€— Transformers, you should refer to the documentation and tutorials of this library `which are available here `__. - -Tokenizing the dataset -^^^^^^^^^^^^^^^^^^^^^^ - -The first step is to tokenize our sentences in order to build sequences of integers that our model can digest from the pairs of sequences. Bert's tokenizer knows how to do that and we can simply feed it with a pair of sentences as inputs to generate the right inputs for our model: - -.. code-block:: - - >>> print(tokenizer(dataset[0]['sentence1'], dataset[0]['sentence2'])) - {'input_ids': [101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - >>> tokenizer.decode(tokenizer(dataset[0]['sentence1'], dataset[0]['sentence2'])['input_ids']) - '[CLS] Amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence. [SEP] Referring to him as only " the witness ", Amrozi accused his brother of deliberately distorting his evidence. [SEP]' - -As you can see, the tokenizer has merged the pair of sequences in a single input separating them by some special tokens ``[CLS]`` and ``[SEP]`` expected by Bert. For more details on this, you can refer to `πŸ€— Transformers's documentation on data processing `__. - -In our case, we want to tokenize our full dataset, so we will use a method called :func:`datasets.Dataset.map` to apply the encoding process to the whole dataset. -To be sure we can easily build tensor batches for our model, we will truncate and pad the inputs to the max length of our model. - -.. code-block:: - - >>> def encode(examples): - >>> return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length') - >>> - >>> dataset = dataset.map(encode, batched=True) - 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00, 1.75it/s] - >>> dataset[0] - {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', - 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', - 'label': 1, - 'idx': 0, - 'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]), - 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), - 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])} - -This operation has added three new columns to our dataset: ``input_ids``, ``token_type_ids`` and ``attention_mask``. These are the inputs our model needs for training. - -.. note:: - - Note that this is not the most efficient padding strategy, we could also avoid padding at this stage and use ``tokenizer.pad`` as the ``collate_fn`` method in the ``torch.utils.data.DataLoader`` further below. - -Formatting the dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Now that we have encoded our dataset, we want to use it in a ``torch.Dataloader`` or a ``tf.data.Dataset`` and use it to train our model. - -To be able to train our model with this dataset and PyTorch, we will need to do three modifications: - -- rename our ``label`` column in ``labels`` which is the expected input name for labels in `BertForSequenceClassification `__ or `TFBertForSequenceClassification `__, -- get pytorch (or tensorflow, or jax) tensors out of our :class:`datasets.Dataset`, instead of python objects, and -- filter the columns to return only the subset of the columns that we need for our model inputs (``input_ids``, ``token_type_ids`` and ``attention_mask``). - -.. note:: - - We don't want the columns `sentence1` or `sentence2` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model. πŸ€— Datasets let you control the output format of :func:`datasets.Dataset.__getitem__` to just mask them as detailed in :doc:`exploring <./exploring>`. - -The first modification is just a matter of renaming the column as follows (we could have done it during the tokenization process as well): - -.. code-block:: - - >>> dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True) - -The two other modifications can be handled by the :func:`datasets.Dataset.set_format` method which will convert, on the fly, the returned output from :func:`datasets.Dataset.__getitem__` to filter the unwanted columns and convert python objects in PyTorch tensors. - -Here is how we can apply the right format to our dataset using :func:`datasets.Dataset.set_format` and wrap it in a ``torch.utils.data.DataLoader`` or a ``tf.data.Dataset``: - -.. code-block:: - - >>> ## PYTORCH CODE - >>> import torch - >>> dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']) - >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) - >>> next(iter(dataloader)) - {'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], - [1, 1, 1, ..., 0, 0, 0], - [1, 1, 1, ..., 0, 0, 0], - ..., - [1, 1, 1, ..., 0, 0, 0], - [1, 1, 1, ..., 0, 0, 0], - [1, 1, 1, ..., 0, 0, 0]]), - 'input_ids': tensor([[ 101, 7277, 2180, ..., 0, 0, 0], - [ 101, 10684, 2599, ..., 0, 0, 0], - [ 101, 1220, 1125, ..., 0, 0, 0], - ..., - [ 101, 16944, 1107, ..., 0, 0, 0], - [ 101, 1109, 11896, ..., 0, 0, 0], - [ 101, 1109, 4173, ..., 0, 0, 0]]), - 'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]), - 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0], - [0, 0, 0, ..., 0, 0, 0], - [0, 0, 0, ..., 0, 0, 0], - ..., - [0, 0, 0, ..., 0, 0, 0], - [0, 0, 0, ..., 0, 0, 0], - [0, 0, 0, ..., 0, 0, 0]])} - >>> ## TENSORFLOW CODE - >>> import tensorflow as tf - >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels']) - >>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']} - >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["labels"])).batch(32) - >>> next(iter(tfdataset)) - ({'input_ids': , 'token_type_ids': , 'attention_mask': }, ) - - -We are now ready to train our model. Let's write a simple training loop and start the training: - -.. code-block:: - - >>> ## PYTORCH CODE - >>> from tqdm import tqdm - >>> device = 'cuda' if torch.cuda.is_available() else 'cpu' - >>> model.train().to(device) - >>> optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5) - >>> for epoch in range(3): - >>> for i, batch in enumerate(tqdm(dataloader)): - >>> batch = {k: v.to(device) for k, v in batch.items()} - >>> outputs = model(**batch) - >>> loss = outputs[0] - >>> loss.backward() - >>> optimizer.step() - >>> optimizer.zero_grad() - >>> if i % 10 == 0: - >>> print(f"loss: {loss}") - >>> ## TENSORFLOW CODE - >>> loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True) - >>> opt = tf.keras.optimizers.Adam(learning_rate=3e-5) - >>> model.compile(optimizer=opt, loss=loss_fn, metrics=["accuracy"]) - >>> model.fit(tfdataset, epochs=3) - - -Now this was a very simple tour, you should continue with either the detailed notebook which is `here `__ or the in-depth guides on - -- :doc:`loading datasets <./loading_datasets>` -- :doc:`exploring the dataset object attributes <./exploring>` -- :doc:`processing dataset data <./processing>` -- :doc:`indexing a dataset with FAISS or Elastic Search <./faiss_and_ea>` -- :doc:`Adding new datasets <./add_dataset>` -- :doc:`Sharing datasets <./share_dataset>` diff --git a/docs/source/share.rst b/docs/source/share.rst new file mode 100644 index 00000000000..ba8ee7d2ddd --- /dev/null +++ b/docs/source/share.rst @@ -0,0 +1,223 @@ +Share +====== + +At Hugging Face, we are on a mission to democratize NLP and we believe in the value of open source. That's why we designed πŸ€— Datasets so that anyone can share a dataset with the greater NLP community. There are currently over 900 datasets in over 100 languages in the Hugging Face Hub, and the Hugging Face team always welcomes new contributions! + +This guide will show you how to share a dataset that can be easily accessed by anyone. + +There are two options to share a new dataset: + +- Directly upload it on the Hub as a community provided dataset. +- Add it as a canonical dataset by opening a pull-request on the `GitHub repository for πŸ€— Datasets `__. + +Community vs. canonical +----------------------- + +Both options offer the same features such as: + +- Dataset versioning +- Commit history and diffs +- Metadata for discoverability +- Dataset cards for documentation, licensing, limitations, etc. + +The main differences between the two are highlighted in the table below: + +.. list-table:: + :header-rows: 1 + + * - Community datasets + - Canonical datasets + * - Faster to share, no review process. + - Slower to add, needs to be reviewed. + * - Data files can be stored on the Hub. + - Data files are typically retrieved from the original host URLs. + * - Identified by a user or organization namespace like **thomwolf/my_dataset** or **huggingface/our_dataset**. + - Identified by a root namespace. Need to select a short name that is available. + * - Requires data files and/or a dataset loading script. + - Always requires a dataset loading script. + * - Flagged as **unsafe** because the dataset contains executable code. + - Flagged as **safe** because the dataset has been reviewed. + +For community datasets, if your dataset is in a supported format, you can skip directly below to learn how to upload your files and add a :doc:`dataset card `. There is no need to write your own dataset loading script (unless you want more control over how to load your dataset). However, if the dataset isn't in one of the supported formats, you will need to write a :doc:`dataset loading script `. The dataset loading script is a Python script that defines the dataset splits, feature types, and how to download and process the data. + +On the other hand, a dataset script is always required for canonical datasets. + +.. important:: + + The distinction between a canonical and community dataset is based solely on the selected sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself. + +.. _upload_dataset_repo: + +Add a community dataset +----------------------- + +You can share your dataset with the community with a dataset repository on the Hugging Face Hub. +In a dataset repository, you can either host all your data files and/or use a dataset script. + +The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet. +The script also supports many kinds of compressed file types such as: GZ, BZ2, LZ4, LZMA or ZSTD. +For example, your dataset can be made of ``.json.gz`` files. + +On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script. + +When loading a dataset from the Hub: + +- If there's no dataset script, all the files in the supported formats are loaded. +- If there's a dataset script, it is downloaded and executed to download and prepare the dataset. + +For more information on how to load a dataset from the Hub, see how to load from the :ref:`load-from-the-hub`. + +Create the repository +^^^^^^^^^^^^^^^^^^^^^ + +Sharing a community dataset will require you to create an account on `hf.co `_ if you don't have one yet. +You can directly create a `new dataset repository `_ from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal. + +1. Make sure you are in the virtual environment where you installed Datasets, and run the following command: + +.. code:: + + huggingface-cli login + +2. Login using your Hugging Face Hub credentials, and create a new dataset repository: + +.. code:: + + huggingface-cli repo create your_dataset_name --type dataset + +Add the ``-organization`` flag to create a repository under a specific organization: + +.. code:: + + huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name + +Clone the repository +^^^^^^^^^^^^^^^^^^^^ + +3. Install `Git LFS `_ and clone your repository: + +.. code-block:: + + # Make sure you have git-lfs installed + # (https://git-lfs.github.com/) + git lfs install + + git clone https://huggingface.co/datasets/namespace/your_dataset_name + +Here the ``namespace`` is either your username or your organization name. + +Prepare your files +^^^^^^^^^^^^^^^^^^ + +4. Now is a good time to check your directory to ensure the only files you're uploading are: + +* ``README.md`` is a Dataset card that describes the datasets contents, creation, and usage. To write a Dataset card, see the :doc:`dataset card ` page. + +* The raw data files of the dataset (optional, if they are hosted elsewhere you can specify the URLs in the dataset script). + +* ``your_dataset_name.py`` is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the :doc:`dataset script ` page. + +* ``dataset_infos.json`` contains metadata about the dataset (required only if you have a dataset script). + +Upload your files +^^^^^^^^^^^^^^^^^ + +You can directly upload your files from your repository on the Hugging Face Hub, but this guide will show you how to upload the files from the terminal. + +5. It is important to add the large data files first with ``git lfs track`` or else you will encounter an error later when you push your files: + +.. code-block:: + + cp /somewhere/data/*.json . + git lfs track *.json + git add .gitattributes + git add *.json + git commit -m "add json files" + +6. Add the dataset loading script and metadata file: + +.. code-block:: + + cp /somewhere/data/dataset_infos.json . + cp /somewhere/data/load_script.py . + git add --all + +7. Verify the files have been correctly staged. Then you can commit and push your files: + +.. code-block:: + + git status + git commit -m "First version of the your_dataset_name dataset." + git push + + +Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! πŸ₯³ + +.. code:: + + dataset = load_dataset("namespace/your_dataset_name") + +Add a canonical dataset +----------------------- + +Canonical datasets are dataset scripts hosted in the GitHub repository of the πŸ€— Dataset library. +The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested. + +Clone the repository +^^^^^^^^^^^^^^^^^^^^ + +To share a canonical dataset: + +1. Fork the πŸ€— `Datasets repository `_ by clicking on the **Fork** button. + +2. Clone your fork to your local disk, and add the base repository as a remote: + +.. code-block:: + + git clone https://github.com//datasets + cd datasets + git remote add upstream https://github.com/huggingface/datasets.git + +Prepare your files +^^^^^^^^^^^^^^^^^^ + +3. Create a new branch to hold your changes. You can name the new branch using the short name of your dataset: + +.. code:: + + git checkout -b my-new-dataset + +4. Set up a development environment by running the following command in a virtual environment: + +.. code:: + + pip install -e ".[dev]" + +5. Create a new folder with the dataset name inside ``huggingface/datasets``, and add the dataset loading script. To create a dataset script, see the :doc:`dataset script ` page. + +6. Check your directory to ensure the only files you're uploading are: + +* ``README.md`` is a Dataset card that describes the datasets contents, creation, and usage. To write a Dataset card, see the :doc:`dataset card ` page. + +* ``your_dataset_name.py`` is your dataset loading script. + +* ``dataset_infos.json`` contains metadata about the dataset. + +* ``dummy`` folder with ``dummy_data.zip`` files that hold a small subset of data from the dataset for tests and preview. + +7. Run `Black `_ and `isort `_ to tidy up your code and files: + +.. code-block:: + + make style + make quality + +8. Add your changes, and make a commit to record your changes locally. Then you can push the changes to your account: + +.. code-block:: + + git add datasets/ + git commit + git push -u origin my-new-dataset + +9. Go back to your fork on GitHub, and click on **Pull request** to open a pull request on the main πŸ€— `Datasets repository `_ for review. diff --git a/docs/source/share_dataset.rst b/docs/source/share_dataset.rst deleted file mode 100644 index 3f8fca5dcfb..00000000000 --- a/docs/source/share_dataset.rst +++ /dev/null @@ -1,587 +0,0 @@ -Sharing your dataset -============================================= - -Once you have your dataset, you may want to share it with the community for instance on the `HuggingFace Hub `__. There are two options to do that: - -- directly upload it on the Hub as a community provided dataset. -- add it as a canonical dataset by opening a pull-request on the `GitHub repository for πŸ€— Datasets `__, - -Both options offer the same features such as: - -- dataset versioning -- commit history and diffs -- metadata for discoverability -- dataset cards for documentation, licensing, limitations, etc. - -Here are the main differences between these two options. - -- **Community provided** datasets: - * are faster to share (no reviewing process) - * can contain the data files themselves on the Hub - * are identified under the namespace of a user or organization: ``thomwolf/my_dataset`` or ``huggingface/our_dataset`` - * are flagged as ``unsafe`` by default because a dataset may contain executable code so the users need to inspect and opt-in to use the datasets - -- **Canonical** datasets: - * are slower to add (need to go through the reviewing process on the githup repo) - * are identified under the root namespace (``my_dataset``) so they need to select a shortname which is still free - * usually don't contain the data files which are retrieved from the original URLs (but this can be changed under specific request to add the files to the Hub) - * are flagged as ``safe`` by default since they went through the reviewing process (no need to opt-in). - -.. note:: - - The distinctions between "community provided" and "canonical" datasets is made purely based on the selected sharing workflow and don't involve any ranking, decision or opinion regarding the content of the dataset it-self. - -.. _community-dataset: - -Sharing a "community provided" dataset ------------------------------------------ - -In this page, we will show you how to share a dataset with the community on the `πŸ€— Datasets Hub `__. - -.. note:: - - You will need to create an account on `huggingface.co `__ for this. - - Optionally, you can join an existing organization or create a new one. - -Prepare your dataset for uploading -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can either have your dataset in a supported format (csv/jsonl/json/parquet/txt), or use a dataset script to define how to load your data. - -If your dataset is in a supported format, you're all set ! -Otherwise, you need a dataset script. It simply is a python script and its role is to define: - -- the feature types of your data -- how your dataset is split into train/validation/test (or any other splits) -- how to download the data -- how to process the data - -The dataset script is mandatory if your dataset is not in the supported formats, or if you need more control on how to define our dataset. - -We have seen in the :doc:`dataset script tutorial `: how to write a dataset loading script. Let's see how you can share it on the -`πŸ€— Datasets Hub `__. - -Dataset versioning -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Since version 2.0, the πŸ€— Datasets Hub has built-in dataset versioning based on git and git-lfs. It is based on the paradigm -that one dataset *is* one repo. - -This allows: - -- built-in versioning -- access control -- scalability - -This is built around *revisions*, which is a way to pin a specific version of a dataset, using a commit hash, tag or -branch. - -For instance: - -.. code-block:: - - >>> dataset = load_dataset( - >>> "lhoestq/custom_squad", - >>> script_version="main" # tag name, or branch name, or commit hash - >>> ) - -Basic steps -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In order to upload a dataset, you'll need to first create a git repo. This repo will live on the πŸ€— Datasets Hub, allowing -users to clone it and you (and your organization members) to push to it. - -You can create a dataset repo directly from `the /new-dataset page on the website `__. - -Alternatively, you can use the ``huggingface-cli``. The next steps describe that process: - -Go to a terminal and run the following command. It should be in the virtual environment where you installed πŸ€— -Datasets, since that command :obj:`huggingface-cli` comes from the library. - -.. code-block:: bash - - huggingface-cli login - - -Once you are logged in with your πŸ€— Datasets Hub credentials, you can start building your repositories. To create a repo: - -.. code-block:: bash - - huggingface-cli repo create your_dataset_name --type dataset - - -If you want to create a repo under a specific organization, you should add a `--organization` flag: - -.. code-block:: bash - - huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name - - -This creates a repo on the πŸ€— Datasets Hub, which can be cloned. - -.. code-block:: bash - - # Make sure you have git-lfs installed - # (https://git-lfs.github.com/) - git lfs install - - git clone https://huggingface.co/datasets/username/your_dataset_name - -When you have your local clone of your repo and lfs installed, you can then add/remove from that clone as you would -with any other git repo. - -.. code-block:: bash - - # Commit as usual - cd your_dataset_name - echo "hello" >> README.md - git add . && git commit -m "Update from $USER" - -We are intentionally not wrapping git too much, so that you can go on with the workflow you're used to and the tools -you already know. - -The only learning curve you might have compared to regular git is the one for git-lfs. The documentation at -`git-lfs.github.com `__ is decent, but we'll work on a tutorial with some tips and tricks -in the coming weeks! - -Additionally, if you want to change multiple repos at once, the `change_config.py script -`__ can probably save you some time. - - -Check the directory before pushing to the πŸ€— Datasets Hub. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Make sure there are no garbage files in the directory you'll upload. It should only have: - -- a `your_dataset_name.py` file, which is the dataset script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt); -- the raw data files (json, csv, txt, mp3, png, etc.) that you need for your dataset -- an optional `dataset_infos.json` file, which contains metadata about your dataset like the split sizes; -- optional dummy data files, which contains only a small subset from the dataset for tests and preview; - -Other files can safely be deleted. - - -Uploading your files -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Once the repo is cloned, If you need to add data files, instead of relying on the data to be hosted -elsewhere, add these files using the following steps. Let's say that the files you're adding are ``*.json`` files, then: - -.. code-block:: bash - - cp /somewhere/data/*.json . - git lfs track *.json - git add .gitattributes - git add *.json - git commit -m "add json files" - -It's crucial that ``git lfs track`` gets run on the large data files before ``git add``. If later during ``git push`` you get the error: - -.. code-block:: bash - - remote: Your push was rejected because it contains files larger than 10M. - remote: Please use https://git-lfs.github.com/ to store larger files. - -it means you ``git add``\ed the data files before telling ``lfs`` to track those. - -Now you can add the dataset script and `dataset_infos.json` file: - -.. code-block:: bash - - cp /somewhere/data/dataset_infos.json . - cp /somewhere/data/load_script.py . - git add --all - -Quickly verify that they have been correctly staged with: - -.. code-block:: bash - - git status - -Finally, the files are ready to be committed and pushed to the remote: - -.. code-block:: bash - - git commit -m "First version of the your_dataset_name dataset." - git push - -This will upload the folder containing the dataset script and dataset infos and data files. - - -Using your dataset -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Your dataset now has a page on huggingface.co/datasets πŸ”₯ - -Anyone can load it from code: - -.. code-block:: - - >>> dataset = load_dataset("namespace/your_dataset_name") - - -If your dataset doesn't have a dataset script, then by default all your data will be loaded in the "train" split. -You can specify which files goes to which split by specifying the ``data_files`` parameter. - -Let's say your dataset repository contains one CSV file for the train split, and one CSV file for your test split. Then you can load it with: - - -.. code-block:: - - >>> data_files = {"train": "train.csv", "test": "test.csv"} - >>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files) - - -You may specify a version by using the ``script_version`` flag in the ``load_dataset`` function: - -.. code-block:: - - >>> dataset = load_dataset( - >>> "lhoestq/custom_squad", - >>> script_version="main" # tag name, or branch name, or commit hash - >>> ) - -You can find more information in the guide on :doc:`how to load a dataset ` - -.. _canonical-dataset: - -Sharing a "canonical" dataset --------------------------------- - -Add your dataset to the GitHub repository -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -To add a "canonical" dataset to the library, you need to go through the following steps: - -**1. Fork the** `πŸ€— Datasets repository `__ by clicking on the 'Fork' button on the repository's home page. This creates a copy of the code under your GitHub user account. - -**2. Clone your fork** to your local disk, and add the base repository as a remote: - -.. code:: - - git clone https://github.com//datasets - cd datasets - git remote add upstream https://github.com/huggingface/datasets.git - - -**3. Create a new branch** to hold your development changes: - -.. code:: - - git checkout -b my-new-dataset - -.. note:: - - **Do not** work on the ``master`` branch. - -**4. Set up a development environment** by running the following command **in a virtual environment**: - -.. code:: - - pip install -e ".[dev]" - -.. note:: - - If πŸ€— Datasets was already installed in the virtual environment, remove - it with ``pip uninstall datasets`` before reinstalling it in editable - mode with the ``-e`` flag. - -**5. Create a new folder with your dataset name** inside the `datasets folder `__ of the repository and add the dataset script you wrote and tested while following the instructions on the :doc:`add_dataset` page. - -**6. Format your code.** Run black and isort so that your newly added files look nice with the following command: - -.. code:: - - make style - make quality - - -**7.** Once you're happy with your dataset script file, add your changes and make a commit to **record your changes locally**: - -.. code:: - - git add datasets/ - git commit - -It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes: - -.. code:: - - git fetch upstream - git rebase upstream/master - -Push the changes to your account using: - -.. code:: - - git push -u origin my-new-dataset - -**8.** We also recommend adding **tests** and **metadata** to the dataset script if possible. Go through the :ref:`adding-tests` section to do so. - -**9.** Once you are satisfied with the dataset, go the webpage of your fork on GitHub and click on "Pull request" to **open a pull-request** on the `main github repository `__ for review. - - -.. _adding-tests: - -Adding tests and metadata to the dataset -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -We recommend adding testing data and checksum metadata to your dataset so its behavior can be tested and verified, and the generated dataset can be certified. In this section we'll explain how you can add two objects to the repository to do just that: - -- ``dummy data`` which are used for testing the behavior of the script (without having to download the full data files), and - -- ``dataset_infos.json`` which are metadata used to store the metadata of the dataset including the data files checksums and the number of examples required to confirm that the dataset generation procedure went well. - -.. note:: - - In the rest of this section, you should make sure that you run all of the commands **from the root** of your local ``datasets`` repository. - -1. Adding metadata -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can check that the new dataset loading script works correctly and create the ``dataset_infos.json`` file at the same time by running the command: - -.. code-block:: - - datasets-cli test datasets/ --save_infos --all_configs - -If the command was succesful, you should now have a ``dataset_infos.json`` file created in the folder of your dataset loading script. Here is a dummy example of the content for a dataset with a single configuration: - -.. code-block:: - - { - "default": { - "description": "The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 ...\n", - "citation": "@inproceedings{li-roth-2002-learning,\n title = \"Learning Question Classifiers\",..\",\n}\n", - "homepage": "https://cogcomp.seas.upenn.edu/Data/QA/QC/", - "license": "", - "features": { - "label-coarse": { - "num_classes": 6, - "names": ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"], - "names_file": null, - "id": null, - "_type": "ClassLabel" - }, - "text": { - "dtype": "string", - "id": null, - "_type": "Value" - } - }, - "supervised_keys": null, - "builder_name": "trec", - "config_name": "default", - "version": { - "version_str": "1.1.0", "description": null, - "datasets_version_to_prepare": null, - "major": 1, "minor": 1, "patch": 0 - }, - "splits": { - "train": { - "name": "train", - "num_bytes": 385090, - "num_examples": 5452, - "dataset_name": "trec" - }, - "test": { - "name": "test", - "num_bytes": 27983, - "num_examples": 500, - "dataset_name": "trec" - } - }, - "download_checksums": { - "http://cogcomp.org/Data/QA/QC/train_5500.label": { - "num_bytes": 335858, - "checksum": "9e4c8bdcaffb96ed61041bd64b564183d52793a8e91d84fc3a8646885f466ec3" - }, - "http://cogcomp.org/Data/QA/QC/TREC_10.label": { - "num_bytes": 23354, - "checksum": "033f22c028c2bbba9ca682f68ffe204dc1aa6e1cf35dd6207f2d4ca67f0d0e8e" - } - }, - "download_size": 359212, - "dataset_size": 413073, - "size_in_bytes": 772285 - } - } - -2. Adding dummy data -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Now that we have the metadata prepared we can also create some dummy data for automated testing. You can use the following command to get in-detail instructions on how to create the dummy data: - -.. code-block:: - - datasets-cli dummy_data datasets/ - -This command will output instructions specifically tailored to your dataset and will look like: - -.. code-block:: - - ==============================DUMMY DATA INSTRUCTIONS============================== - - In order to create the dummy data for my-dataset, please go into the folder './datasets/my-dataset/dummy/1.1.0' with `cd ./datasets/my-dataset/dummy/1.1.0` . - - - Please create the following dummy data files 'dummy_data/TREC_10.label, dummy_data/train_5500.label' from the folder './datasets/my-dataset/dummy/1.1.0' - - - For each of the splits 'train, test', make sure that one or more of the dummy data files provide at least one example - - - If the method `_generate_examples(...)` includes multiple `open()` statements, you might have to create other files in addition to 'dummy_data/TREC_10.label, dummy_data/train_5500.label'. In this case please refer to the `_generate_examples(...)` method - - - After all dummy data files are created, they should be zipped recursively to 'dummy_data.zip' with the command `zip -r dummy_data.zip dummy_data/` - - - You can now delete the folder 'dummy_data' with the command `rm -r dummy_data` - - - To get the folder 'dummy_data' back for further changes to the dummy data, simply unzip dummy_data.zip with the command `unzip dummy_data.zip` - - - Make sure you have created the file 'dummy_data.zip' in './datasets/my-dataset/dummy/1.1.0' - =================================================================================== - -There is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml. -If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with: - -.. code-block:: - - datasets-cli dummy_data datasets/ --auto_generate - -Examples: - -.. code-block:: - - datasets-cli dummy_data ./datasets/snli --auto_generate - datasets-cli dummy_data ./datasets/squad --auto_generate --json_field data - datasets-cli dummy_data ./datasets/iwslt2017 --auto_generate --xml_tag seg --match_text_files "train*" --n_lines 15 - # --xml_tag seg => each sample corresponds to a "seg" tag in the xml tree - # --match_text_files "train*" => also match text files that don't have a proper text file extension (no suffix like ".txt" for example) - # --n_lines 15 => some text files have headers so we have to use at least 15 lines - -Usage of the command: - -.. code-block:: - - usage: datasets-cli [] dummy_data [-h] [--auto_generate] - [--n_lines N_LINES] - [--json_field JSON_FIELD] - [--xml_tag XML_TAG] - [--match_text_files MATCH_TEXT_FILES] - [--keep_uncompressed] - [--cache_dir CACHE_DIR] - [--encoding ENCODING] - path_to_dataset - - positional arguments: - path_to_dataset Path to the dataset (example: ./datasets/squad) - - optional arguments: - -h, --help show this help message and exit - --auto_generate Automatically generate dummy data - --n_lines N_LINES Number of lines or samples to keep when auto- - generating dummy data - --json_field JSON_FIELD - Optional, json field to read the data from when auto- - generating dummy data. In the json data files, this - field must point to a list of samples as json objects - (ex: the 'data' field for squad-like files) - --xml_tag XML_TAG Optional, xml tag name of the samples inside the xml - files when auto-generating dummy data. - --match_text_files MATCH_TEXT_FILES - Optional, a comma separated list of file patterns that - looks for line-by-line text files other than *.txt or - *.csv. Example: --match_text_files *.label - --keep_uncompressed Whether to leave the dummy data folders uncompressed - when auto-generating dummy data. Useful for debugging - for to do manual adjustements before compressing. - --cache_dir CACHE_DIR - Cache directory to download and cache files when auto- - generating dummy data - --encoding ENCODING Encoding to use when auto-generating dummy data. - Defaults to utf-8 - - -3. Testing -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Now test that both the real data and the dummy data work correctly. Go back to the root of your datasets folder and use the following command: - -*For the real data*: - -.. code-block:: - - RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_ - - -And *for the dummy data*: - -.. code-block:: - - RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_ - - -If all tests pass, your dataset works correctly. Awesome! You can now follow the last steps of the :ref:`canonical-dataset` or :ref:`community-dataset` sections to share the dataset with the community. If you experienced problems with the dummy data tests, here are some additional tips: - -- Verify that all filenames are spelled correctly. Rerun the command: - -.. code-block:: - - datasets-cli dummy_data datasets/ - -and make sure you follow the exact instructions provided by the command. - -- Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function ``_split_generators(...)`` and expected by the function ``_generate_examples(...)`` of your dataset script. Also take a look at `tests/README.md` which lists different possible cases of how the dummy data should be created. - -- If the dummy data tests still fail, open a PR in the main repository on github and make a remark in the description that you need help creating the dummy data and we will be happy to help you. - - -Add a Dataset Card --------------------------------- - -Once your dataset is ready for sharing, feel free to write and add a Dataset Card to document your dataset. - -The Dataset Card is a file ``README.md`` file that you may add in your dataset repository. - -At the top of the Dataset Card, you can define the metadata of your dataset for discoverability: - -- annotations_creators -- language_creators -- languages -- licenses -- multilinguality -- pretty_name -- size_categories -- source_datasets -- task_categories -- task_ids -- paperswithcode_id - -It may contain diverse sections to document all the relevant aspects of your dataset: - -- Dataset Description - - Dataset Summary - - Supported Tasks and Leaderboards - - Languages -- Dataset Structure - - Data Instances - - Data Fields - - Data Splits -- Dataset Creation - - Curation Rationale - - Source Data - - Initial Data Collection and Normalization - - Who are the source language producers? - - Annotations - - Annotation process - - Who are the annotators? - - Personal and Sensitive Information -- Considerations for Using the Data - - Social Impact of Dataset - - Discussion of Biases - - Other Known Limitations -- Additional Information - - Dataset Curators - - Licensing Information - - Citation Information - - Contributions - -You can find more information about each section in the `Dataset Card guide `_. diff --git a/docs/source/splits.rst b/docs/source/splits.rst deleted file mode 100644 index 993d3242681..00000000000 --- a/docs/source/splits.rst +++ /dev/null @@ -1,135 +0,0 @@ -Splits and slicing -=========================== - -Similarly to Tensorfow Datasets, all :class:`DatasetBuilder` s expose various data subsets defined as splits (eg: -``train``, ``test``). When constructing a :class:`datasets.Dataset` instance using either -:func:`datasets.load_dataset()` or :func:`datasets.DatasetBuilder.as_dataset()`, one can specify which -split(s) to retrieve. It is also possible to retrieve slice(s) of split(s) -as well as combinations of those. - -Slicing API ---------------------------------------------------- - -Slicing instructions are specified in :obj:`datasets.load_dataset` or :obj:`datasets.DatasetBuilder.as_dataset`. - -Instructions can be provided as either strings or :obj:`ReadInstruction`. Strings -are more compact and readable for simple cases, while :obj:`ReadInstruction` -might be easier to use with variable slicing parameters. - -Examples -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Examples using the string API: - -.. code-block:: - - # The full `train` split. - train_ds = datasets.load_dataset('bookcorpus', split='train') - - # The full `train` split and the full `test` split as two distinct datasets. - train_ds, test_ds = datasets.load_dataset('bookcorpus', split=['train', 'test']) - - # The full `train` and `test` splits, concatenated together. - train_test_ds = datasets.load_dataset('bookcorpus', split='train+test') - - # From record 10 (included) to record 20 (excluded) of `train` split. - train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]') - - # The first 10% of `train` split. - train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]') - - # The first 10% of `train` + the last 80% of `train`. - train_10_80pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]+train[-80%:]') - - # 10-fold cross-validation (see also next section on rounding behavior): - # The validation datasets are each going to be 10%: - # [0%:10%], [10%:20%], ..., [90%:100%]. - # And the training datasets are each going to be the complementary 90%: - # [10%:100%] (for a corresponding validation set of [0%:10%]), - # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., - # [0%:90%] (for a validation set of [90%:100%]). - vals_ds = datasets.load_dataset('bookcorpus', split=[ - f'train[{k}%:{k+10}%]' for k in range(0, 100, 10) - ]) - trains_ds = datasets.load_dataset('bookcorpus', split=[ - f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10) - ]) - - -Examples using the ``ReadInstruction`` API (equivalent as above): - -.. code-block:: - - # The full `train` split. - train_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction('train')) - - # The full `train` split and the full `test` split as two distinct datasets. - train_ds, test_ds = datasets.load_dataset('bookcorpus', split=[ - datasets.ReadInstruction('train'), - datasets.ReadInstruction('test'), - ]) - - # The full `train` and `test` splits, concatenated together. - ri = datasets.ReadInstruction('train') + datasets.ReadInstruction('test') - train_test_ds = datasets.load_dataset('bookcorpus', split=ri) - - # From record 10 (included) to record 20 (excluded) of `train` split. - train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction( - 'train', from_=10, to=20, unit='abs')) - - # The first 10% of `train` split. - train_10_20_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction( - 'train', to=10, unit='%')) - - # The first 10% of `train` + the last 80% of `train`. - ri = (datasets.ReadInstruction('train', to=10, unit='%') + - datasets.ReadInstruction('train', from_=-80, unit='%')) - train_10_80pct_ds = datasets.load_dataset('bookcorpus', split=ri) - - # 10-fold cross-validation (see also next section on rounding behavior): - # The validation datasets are each going to be 10%: - # [0%:10%], [10%:20%], ..., [90%:100%]. - # And the training datasets are each going to be the complementary 90%: - # [10%:100%] (for a corresponding validation set of [0%:10%]), - # [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ..., - # [0%:90%] (for a validation set of [90%:100%]). - vals_ds = datasets.load_dataset('bookcorpus', [ - datasets.ReadInstruction('train', from_=k, to=k+10, unit='%') - for k in range(0, 100, 10)]) - trains_ds = datasets.load_dataset('bookcorpus', [ - (datasets.ReadInstruction('train', to=k, unit='%') + - datasets.ReadInstruction('train', from_=k+10, unit='%')) - for k in range(0, 100, 10)]) - -Percent slicing and rounding -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If a slice of a split is requested using the percent (``%``) unit, and the -requested slice boundaries do not divide evenly by 100, then the default -behaviour is to round boundaries to the nearest integer (``closest``). This means -that some slices may contain more examples than others. For example: - -.. code-block:: - - # Assuming `train` split contains 999 records. - # 19 records, from 500 (included) to 519 (excluded). - train_50_52_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%]') - # 20 records, from 519 (included) to 539 (excluded). - train_52_54_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%]') - -Alternatively, the ``pct1_dropremainder`` rounding can be used, so specified -percentage boundaries are treated as multiples of 1%. This option should be used -when consistency is needed (eg: ``len(5%) == 5 * len(1%)``). This means the last -examples may be truncated if ``info.splits[split_name].num_examples % 100 != 0``. - -.. code-block:: - - # 18 records, from 450 (included) to 468 (excluded). - train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction( - 'train', from_=50, to=52, unit='%', rounding='pct1_dropremainder')) - # 18 records, from 468 (included) to 486 (excluded). - train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split=datasets.ReadInstruction( - 'train', from_=52, to=54, unit='%', rounding='pct1_dropremainder')) - # Or equivalently: - train_50_52pct1_ds = datasets.load_dataset('bookcorpus', split='train[50%:52%](pct1_dropremainder)') - train_52_54pct1_ds = datasets.load_dataset('bookcorpus', split='train[52%:54%](pct1_dropremainder)') diff --git a/docs/source/stream.rst b/docs/source/stream.rst new file mode 100644 index 00000000000..85c34dd4764 --- /dev/null +++ b/docs/source/stream.rst @@ -0,0 +1,105 @@ +Stream +====== + +Dataset streaming lets you get started with a dataset without waiting for the entire dataset to download. The data is downloaded progressively as you iterate over the dataset. This is especially helpful when: + +* You don't want to wait for an extremely large dataset to download. +* The dataset size exceeds the amount of disk space on your computer. + +For example, the English split of the `OSCAR `_ dataset is 1.2 terabytes, but you can use it instantly with streaming. Stream a dataset by setting ``streaming=True`` in :func:`datasets.load_dataset` as shown below: + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) + >>> print(next(iter(dataset))) + {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... + +Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic :class:`datasets.Dataset` object), known as an :class:`datasets.IterableDataset`. This special type of dataset has its own set of processing methods shown below. + +.. tip:: + + An :class:`datasets.IterableDataset` is useful for iterative jobs like training a model. You shouldn't use a :class:`datasets.IterableDataset` for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples. + +``Shuffle`` +^^^^^^^^^^^ + +Like a regular :class:`datasets.Dataset` object, you can also shuffle a :class:`datasets.IterableDataset` with :func:`datasets.IterableDataset.shuffle`. + +The ``buffer_size`` argument controls the size of the buffer to randomly sample examples from. Let's say your dataset has one million examples, and you set the ``buffer_size`` to ten thousand. :func:`datasets.IterableDataset.shuffle` will randomly select examples from the first ten thousand examples in the buffer. Selected examples in the buffer are replaced with new examples. + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) + >>> shuffled_dataset = dataset.shuffle(buffer_size=10_000, seed=42) + +.. tip:: + + :func:`datasets.IterableDataset.shuffle` will also shuffle the order of the shards if the dataset is sharded into multiple sets. + +Reshuffle +^^^^^^^^^ + +Sometimes you may want to reshuffle the dataset after each epoch. This will require you to set a different seed for each epoch. Use :func:`datasets.IterableDataset.set_epoch` in between epochs to tell the dataset what epoch you're on. + +Your seed effectively becomes: ``initial seed + current epoch``. + +.. code-block:: + + >>> for epoch in range(epochs): + ... shuffled_dataset.set_epoch(epoch) + ... for example in shuffled_dataset: + ... ... + +Split dataset +^^^^^^^^^^^^^ + +You can split your dataset one of two ways: + +* :func:`datasets.IterableDataset.take` returns the first ``n`` examples in a dataset: + +.. code-block:: + + >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) + >>> dataset_head = dataset.take(2) + >>> list(dataset_head) + [{'id': 0, 'text': 'Mtendere Village was...'}, '{id': 1, 'text': 'Lily James cannot fight the music...'}] + +* :func:`datasets.IterableDataset.skip` omits the first ``n`` examples in a dataset and returns the remaining examples: + +.. code:: + + >>> train_dataset = shuffled_dataset.skip(1000) + +.. important:: + + ``take`` and ``skip`` prevent future calls to ``shuffle`` because they lock in the order of the shards. You should ``shuffle`` your dataset before splitting it. + +.. _interleave_datasets: + +``Interleave`` +^^^^^^^^^^^^^^ + +:func:`datasets.interleave_datasets` can combine an :class:`datasets.IterableDataset` with other datasets. The combined dataset returns alternating examples from each of the original datasets. + +.. code-block:: + + >>> from datasets import interleave_datasets + >>> from itertools import islice + >>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) + >>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True) + + >>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset]) + >>> print(list(islice(multilingual_dataset, 2))) + [{'text': 'Mtendere Village was inspired by the vision...}, {'text': "MΓ©dia de dΓ©bat d'idΓ©es, de culture et de littΓ©rature....}] + +Define sampling probabilities from each of the original datasets for more control over how each of them are sampled and combined. Set the ``probabilities`` argument with your desired sampling probabilities: + +.. code-block:: + + >>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42) + >>> print(list(islice(multilingual_dataset_with_oversampling, 2))) + [{'text': 'Mtendere Village was inspired by the vision...}, {'text': 'Lily James cannot fight the music...}] + +Around 80% of the final dataset is made of the ``en_dataset``, and 20% of the ``fr_dataset``. diff --git a/docs/source/torch_tensorflow.rst b/docs/source/torch_tensorflow.rst deleted file mode 100644 index 18f530dc2ee..00000000000 --- a/docs/source/torch_tensorflow.rst +++ /dev/null @@ -1,102 +0,0 @@ -Using a Dataset with PyTorch/Tensorflow -============================================================== - -Once your dataset is processed, you often want to use it with a framework such as PyTorch, Tensorflow, Numpy or Pandas. For instance we may want to use our dataset in a ``torch.Dataloader`` or a ``tf.data.Dataset`` and train a model with it. - -πŸ€— Datasets provides a simple way to do this through what is called the format of a dataset. - -The format of a :class:`datasets.Dataset` instance defines which columns of the dataset are returned by the :func:`datasets.Dataset.__getitem__` method and cast them in PyTorch, Tensorflow, Numpy or Pandas types. - -By default, all the columns of the dataset are returned as a python object. Setting a specific format allows to cast dataset examples as PyTorch/Tensorflow/Numpy/Pandas tensors, arrays or DataFrames and to filter out some columns. A typical examples is columns with strings which are usually not used to train a model and cannot be converted in PyTorch tensors. We may still want to keep them in the dataset though, for instance for the evaluation of the model so it's interesting to just "mask" them during model training. - -.. note:: - The format of the dataset has no effect on the internal table storing the data, it just dynamically change the view of the dataset and examples which is returned when calling :func:`datasets.Dataset.__getitem__`. - -Setting the format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The format of a :class:`datasets.Dataset` instance can be set using the :func:`datasets.Dataset.set_format` which take as arguments: - -- ``type``: an optional string defining the type of the objects that should be returned by :func:`datasets.Dataset.__getitem__`: - - - ``None``/``'python'`` (default): return python objects, - - ``'torch'``/``'pytorch'``/``'pt'``: return PyTorch tensors, - - ``'tensorflow'``/``'tf'``: return Tensorflow tensors, - - ``'jax'``: return JAX arrays, - - ``'numpy'``/``'np'``: return Numpy arrays, - - ``'pandas'``/``'pd'``: return Pandas DataFrames. - -- ``columns``: an optional list of column names (string) defining the list of the columns which should be formatted and returned by :func:`datasets.Dataset.__getitem__`. Set to None to return all the columns in the dataset (default). -- ``output_all_columns``: an optional boolean to return as python object the columns which are not selected to be formatted (see the above arguments). This can be used for instance if you cannot format some columns (e.g. string columns cannot be formatted as PyTorch Tensors) but would still like to have these columns returned. See an example below. - -Here is how we can apply a format to a simple dataset using :func:`datasets.Dataset.set_format` and wrap it in a ``torch.utils.data.DataLoader`` or a ``tf.data.Dataset``: - -.. code-block:: - - >>> ## PYTORCH CODE - >>> import torch - >>> from datasets import load_dataset - >>> from transformers import AutoTokenizer - >>> dataset = load_dataset('glue', 'mrpc', split='train') - >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') - >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True) - >>> - >>> dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) - >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) - >>> next(iter(dataloader)) - {'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], - ..., - [1, 1, 1, ..., 0, 0, 0]]), - 'input_ids': tensor([[ 101, 7277, 2180, ..., 0, 0, 0], - ..., - [ 101, 1109, 4173, ..., 0, 0, 0]]), - 'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]), - 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0], - ..., - [0, 0, 0, ..., 0, 0, 0]])} - >>> ## TENSORFLOW CODE - >>> import tensorflow as tf - >>> from datasets import load_dataset - >>> from transformers import AutoTokenizer - >>> dataset = load_dataset('glue', 'mrpc', split='train') - >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') - >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True) - >>> - >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) - >>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']} - >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["label"])).batch(32) - >>> next(iter(tfdataset)) - ({'input_ids': , 'token_type_ids': , 'attention_mask': }, ) - -In this example we filtered out the string columns `sentence1` and `sentence2` since they cannot be converted easily as tensors (at least in PyTorch). As detailed above, we could still output them as python object by setting ``output_all_columns=True``. - -We can also pass ``**kwargs`` to the respective convert functions like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array`` by adding keyword arguments to :func:`datasets.Dataset.set_format()`. For example, if we want the columns formatted as PyTorch CUDA tensors, we use the following: - -.. code-block:: - - >>> dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'], device='cuda') - -We don't support any keyword arguments for the ``'pandas'`` format. - -Resetting the format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Resetting the format to the default behavior (returning all columns as python object) can be done either by calling :func:`datasets.Dataset.reset_format` or by calling :func:`datasets.Dataset.set_format` with no arguments. - -Accessing the format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The current format of the dataset can be queried by accessing the :obj:`datasets.Dataset.format` property which returns a dictionary with the current values of the ``type``, ``columns`` and ``output_all_columns`` values. - -This dict can be stored and used as named argument inputs for :func:`datasets.Dataset.set_format` if necessary (``dataset.set_format(**dataset.format)``). diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md new file mode 100644 index 00000000000..16309f9c4b8 --- /dev/null +++ b/docs/source/tutorial.md @@ -0,0 +1,15 @@ +# Overview + +Welcome to the πŸ€— Datasets tutorial! + +The goal of the tutorials are to help new users build up a basic understanding of πŸ€— Datasets. You will learn to: + +* Setup a virtual environment and install πŸ€— Datasets. +* Load a dataset. +* Explore what's inside a Dataset object. +* Use a dataset with PyTorch and TensorFlow. +* Evaluate your model predictions with a metric. + +After completing the tutorials, we hope you will have the necessary skills to start using our library in your own projects! + +We understand that people who want to use πŸ€— Datasets come from a wide and diverse range of disciplines. The tutorials are designed to be as accessible as possible to people without a developer background. If you already have some experience, take a look at our [Quick Start](../quickstart.rst) to see an end-to-end code example in context. \ No newline at end of file diff --git a/docs/source/use_dataset.rst b/docs/source/use_dataset.rst new file mode 100644 index 00000000000..2d9e7039bb7 --- /dev/null +++ b/docs/source/use_dataset.rst @@ -0,0 +1,105 @@ +Train with πŸ€— Datasets +====================== + +So far, you loaded a dataset from the Hugging Face Hub and learned how to access the information stored inside the dataset. Now you will tokenize and use your dataset with a framework such as PyTorch or TensorFlow. By default, all the dataset columns are returned as Python objects. But you can bridge the gap between a Python object and your machine learning framework by setting the format of a dataset. Formatting casts the columns into compatible PyTorch or TensorFlow types. + +.. important:: + + Often times you may want to modify the structure and content of your dataset before you use it to train a model. For example, you may want to remove a column or cast it as a different type. πŸ€— Datasets provides the necessary tools to do this, but since each dataset is so different, the processing approach will vary individually. For more detailed information about preprocessing data, take a look at our `guide `_ from the πŸ€— Transformers library. Then come back and read our :doc:`How-to Process <./process>` guide to see all the different methods for processing your dataset. + +Tokenize +-------- + +Tokenization divides text into individual words called tokens. Tokens are converted into numbers, which is what the model receives as its input. + +The first step is to install the πŸ€— Transformers library: + +.. code:: + + pip install transformers + +Next, import a tokenizer. It is important to use the tokenizer that is associated with the model you are using, so the text is split in the same way. In this example, load the `BERT tokenizer `_ because you are using the `BERT `_ model: + +.. code-block:: + + >>> from transformers import BertTokenizerFast + >>> tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased') + +Now you can tokenize ``sentence1`` field of the dataset: + +.. code-block:: + + >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence1']), batched=True) + >>> encoded_dataset.column_names + ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'] + >>> encoded_dataset[0] + {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', + 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', + 'label': 1, + 'idx': 0, + 'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + +The tokenization process creates three new columns: ``input_ids``, ``token_type_ids``, and ``attention_mask``. These are the inputs to the model. + +Format +------ + +Set the format with :func:`datasets.Dataset.set_format`, which accepts two main arguments: + +1. ``type`` defines the type of column to cast to. For example, ``torch`` returns PyTorch tensors and ``tensorflow`` returns TensorFlow tensors. + +2. ``columns`` specifies which columns should be formatted. + +After you set the format, wrap the dataset in a ``torch.utils.data.DataLoader`` or a ``tf.data.Dataset``: + +.. tab:: PyTorch + + >>> import torch + >>> from datasets import load_dataset + >>> from transformers import AutoTokenizer + >>> dataset = load_dataset('glue', 'mrpc', split='train') + >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') + >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True) + ... + >>> dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) + >>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) + >>> next(iter(dataloader)) + {'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], + ..., + [1, 1, 1, ..., 0, 0, 0]]), + 'input_ids': tensor([[ 101, 7277, 2180, ..., 0, 0, 0], + ..., + [ 101, 1109, 4173, ..., 0, 0, 0]]), + 'label': tensor([1, 0, 1, 0, 1, 1, 0, 1]), + 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0], + ..., + [0, 0, 0, ..., 0, 0, 0]])} + +.. tab:: TensorFlow + + >>> import tensorflow as tf + >>> from datasets import load_dataset + >>> from transformers import AutoTokenizer + >>> dataset = load_dataset('glue', 'mrpc', split='train') + >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') + >>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True) + ... + >>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) + >>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']} + >>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["label"])).batch(32) + >>> next(iter(tfdataset)) + ({'input_ids': , 'token_type_ids': , 'attention_mask': }, ) \ No newline at end of file diff --git a/docs/source/using_metrics.rst b/docs/source/using_metrics.rst deleted file mode 100644 index bb4a9ba698d..00000000000 --- a/docs/source/using_metrics.rst +++ /dev/null @@ -1,281 +0,0 @@ -Using a Metric -============================================================== - -Evaluating a model's predictions with :class:`datasets.Metric` involves just a couple of methods: - -- :func:`datasets.Metric.add` and :func:`datasets.Metric.add_batch` are used to add pairs of predictions/reference (or just predictions if a metric doesn't make use of references) to a temporary and memory efficient cache table, -- :func:`datasets.Metric.compute` then gathers all the cached predictions and references to compute the metric score. - -A typical **two-step workflow** to compute the metric is thus as follows: - -.. code-block:: - - import datasets - - metric = datasets.load_metric('my_metric') - - for model_input, gold_references in evaluation_dataset: - model_predictions = model(model_inputs) - metric.add_batch(predictions=model_predictions, references=gold_references) - - final_score = metric.compute() - -Alternatively, when the model predictions over the whole evaluation dataset can be computed in one step, a **single-step workflow** can be used by directly feeding the predictions/references to the :func:`datasets.Metric.compute` method as follows: - -.. code-block:: - - import datasets - - metric = datasets.load_metric('my_metric') - - model_predictions = model(model_inputs) - - final_score = metric.compute(predictions=model_predictions, references=gold_references) - - -.. note:: - - Under the hood, both the two-steps workflow and the single-step workflow use memory-mapped temporary cache tables to store predictions/references before computing the scores (similarly to a :class:`datasets.Dataset`). This is convenient for several reasons: - - - let us easily handle metrics whose scores depends on the evaluation set in non-additive ways, i.e. when f(AβˆͺB) β‰  f(A) + f(B), - - very efficient in terms of CPU/GPU memory (effectively requiring no CPU/GPU memory to use the metrics), - - enable easy distributed computation for the metrics by using the cache file as synchronization objects across the various processes. - -Adding predictions and references ------------------------------------------ - -Adding model predictions and references to a :class:`datasets.Metric` instance can be done using either one of :func:`datasets.Metric.add`, :func:`datasets.Metric.add_batch` and :func:`datasets.Metric.compute` methods. - -There methods are pretty simple to use and only accept two arguments for predictions/references: - -- ``predictions`` (for :func:`datasets.Metric.add_batch`) and ``prediction`` (for :func:`datasets.Metric.add`) should contains the predictions of a model to be evaluated by mean of the metric. For :func:`datasets.Metric.add` this will be a single prediction, for :func:`datasets.Metric.add_batch` this will be a batch of predictions. -- ``references`` (for :func:`datasets.Metric.add_batch`) and ``reference`` (for :func:`datasets.Metric.add`) should contains the references that the model predictions should be compared to (if the metric requires references). For :func:`datasets.Metric.add` this will be the reference associated to a single prediction, for :func:`datasets.Metric.add_batch` this will be references associated to a batch of predictions. Note that some metrics accept several references to compare each model prediction to. - -:func:`datasets.Metric.add` and :func:`datasets.Metric.add_batch` require the use of **named arguments** to avoid the silent error of mixing predictions with references. - -The model predictions and references can be provided in a wide number of formats (python lists, numpy arrays, pytorch tensors, tensorflow tensors), the metric object will take care of converting them to a suitable format for temporary storage and computation (as well as bringing them back to cpu and detaching them from gradients for PyTorch tensors). - -The exact format of the inputs is specific to each metric script and can be found in :obj:`datasets.Metric.features`, :obj:`datasets.Metric.inputs_descriptions` and the string representation of the :class:`datasets.Metric` object. - -Here is an example for the sacrebleu metric: - -.. code-block:: - - >>> import datasets - >>> metric = datasets.load_metric('sacrebleu') - >>> print(metric) - Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """ - Produces BLEU scores along with its sufficient statistics - from a source against one or more references. - - Args: - predictions: The system stream (a sequence of segments). - references: A list of one or more reference streams (each a sequence of segments). - smooth_method: The smoothing method to use. (Default: 'exp'). - smooth_value: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1). - tokenize: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for - Japanese and '13a' (mteval) otherwise. - lowercase: Lowercase the data. If True, enables case-insensitivity. (Default: False). - force: Insist that your tokenized input is actually detokenized. - - Returns: - 'score': BLEU score, - 'counts': Counts, - 'totals': Totals, - 'precisions': Precisions, - 'bp': Brevity penalty, - 'sys_len': predictions length, - 'ref_len': reference length, - - Examples: - - >>> predictions = ["hello there general kenobi", "foo bar foobar"] - >>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]] - >>> sacrebleu = datasets.load_metric("sacrebleu") - >>> results = sacrebleu.compute(predictions=predictions, references=references) - >>> print(list(results.keys())) - ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'] - >>> print(round(results["score"], 1)) - 100.0 - """, stored examples: 0) - >>> print(metric.features) - {'predictions': Value(dtype='string', id='sequence'), - 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')} - >>> print(metric.inputs_description) - Produces BLEU scores along with its sufficient statistics - from a source against one or more references. - - Args: - predictions: The system stream (a sequence of segments). - references: A list of one or more reference streams (each a sequence of segments). - smooth_method: The smoothing method to use. (Default: 'exp'). - smooth_value: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1). - tokenize: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for - Japanese and '13a' (mteval) otherwise. - lowercase: Lowercase the data. If True, enables case-insensitivity. (Default: False). - force: Insist that your tokenized input is actually detokenized. - - Returns: - 'score': BLEU score, - 'counts': Counts, - 'totals': Totals, - 'precisions': Precisions, - 'bp': Brevity penalty, - 'sys_len': predictions length, - 'ref_len': reference length, - - Examples: - >>> predictions = ["hello there general kenobi", "foo bar foobar"] - >>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]] - >>> sacrebleu = datasets.load_metric("sacrebleu") - >>> results = sacrebleu.compute(predictions=predictions, references=references) - >>> print(list(results.keys())) - ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'] - >>> print(round(results["score"], 1)) - 100.0 - -Here we can see that the ``sacrebleu`` metric expects a sequence of segments as predictions and a list of one or several sequences of segments as references. - -You can find more information on the segments in the description, homepage and publication of ``sacrebleu`` which can be accessed with the respective attributes on the metric: - -.. code-block:: - - >>> print(metric.description) - SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. - Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text. - It also knows all the standard test sets and handles downloading, processing, and tokenization for you. - See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information. - >>> print(metric.homepage) - https://github.com/mjpost/sacreBLEU - >>> print(metric.citation) - @inproceedings{post-2018-call, - title = "A Call for Clarity in Reporting {BLEU} Scores", - author = "Post, Matt", - booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers", - month = oct, - year = "2018", - address = "Belgium, Brussels", - publisher = "Association for Computational Linguistics", - url = "https://www.aclweb.org/anthology/W18-6319", - pages = "186--191", - } - -Let's use ``sacrebleu`` with the official quick-start example on its homepage at https://github.com/mjpost/sacreBLEU: - -.. code-block:: - - >>> reference_batch = [['The dog bit the man.', 'The dog had bit the man.'], - ... ['It was not unexpected.', 'No one was surprised.'], - ... ['The man bit him first.', 'The man had bitten the dog.']] - >>> sys_batch = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.'] - >>> metric.add_batch(predictions=sys_batch, references=reference_batch) - >>> print(len(metric)) - 3 - -Note that the format of the inputs is a bit different than the official sacrebleu format: we provide the references for each prediction in a list inside the list associated to the prediction while the official example is nested the other way around (list for the reference numbers and inside list for the examples). - -Querying the length of a Metric object will return the number of examples (predictions or predictions/references pair) currently stored in the metric's cache. As we can see on the last line, we have stored three evaluation examples in our metric. - -Now let's compute the sacrebleu score from these 3 evaluation datapoints. - -Computing the metric scores ------------------------------------------ - -The evaluation of a metric scores is done by using the :func:`datasets.Metric.compute` method. - -This method can accept several arguments: - -- predictions and references: you can add predictions and references (to be added at the end of the cache if you have used :func:`datasets.Metric.add` or :func:`datasets.Metric.add_batch` before) -- specific arguments that can be required or can modify the behavior of some metrics (print the metric input description to see the details with ``print(metric)`` or ``print(metric.inputs_description)``). - -In the simplest case when the predictions and references have already been added with ``add`` or ``add_batch`` and no specific arguments need to be set to modify the default behavior of the metric, we can just call :func:`datasets.Metric.compute`: - -.. code-block:: - - >>> score = metric.compute() - >>> print(score) - {'score': 48.530827009929865, 'counts': [14, 7, 5, 3], 'totals': [17, 14, 11, 8], 'precisions': [82.3529411764706, 50.0, 45.45454545454545, 37.5], 'bp': 0.9428731438548749, 'sys_len': 17, 'ref_len': 18} - -If needed and if possible for the metric, you can pass additional arguments to the :func:`datasets.Metric.compute` method to control more precisely the behavior of the metric. -These additional arguments are detailed in the metric information. - -For example ``sacrebleu`` accepts the following additional arguments: - -- ``smooth_method``: The smoothing method to use. (Default: 'exp'). -- ``smooth_value``: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1). -- ``tokenize``: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for - Japanese and '13a' (mteval) otherwise. -- ``lowercase``: Lowercase the data. If True, enables case-insensitivity. (Default: False). -- ``force``: Insist that your tokenized input is actually detokenized. - -To use `"floor"` smooth method with floor value 0.2, pass these arguments to :func:`datasets.Metric.compute`: - -.. code-block:: - - score = metric.compute(smooth_method="floor", smooth_value=0.2) - -You can list these arguments with ``print(metric)`` or ``print(metric.inputs_description)`` as we saw in the previous section and have more details on the official ``sacrebleu`` homepage and publication (accessible with ``print(metric.homepage)`` and ``print(metric.citation)``): - -.. code-block:: - - >>> print(metric.inputs_description) - Produces BLEU scores along with its sufficient statistics - from a source against one or more references. - - Args: - predictions: The system stream (a sequence of segments). - references: A list of one or more reference streams (each a sequence of segments). - smooth_method: The smoothing method to use. (Default: 'exp'). - smooth_value: The smoothing value. Only valid for 'floor' and 'add-k'. (Defaults: floor: 0.1, add-k: 1). - tokenize: Tokenization method to use for BLEU. If not provided, defaults to 'zh' for Chinese, 'ja-mecab' for - Japanese and '13a' (mteval) otherwise. - lowercase: Lowercase the data. If True, enables case-insensitivity. (Default: False). - force: Insist that your tokenized input is actually detokenized. - - Returns: - 'score': BLEU score, - 'counts': Counts, - 'totals': Totals, - 'precisions': Precisions, - 'bp': Brevity penalty, - 'sys_len': predictions length, - 'ref_len': reference length, - - Examples: - >>> predictions = ["hello there general kenobi", "foo bar foobar"] - >>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]] - >>> sacrebleu = datasets.load_metric("sacrebleu") - >>> results = sacrebleu.compute(predictions=predictions, references=references) - >>> print(list(results.keys())) - ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len'] - >>> print(round(results["score"], 1)) - 100.0 - -Distributed usage -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Using the metric in a distributed or multiprocessing setting is exactly identical with the only specific behavior that the metric will only be computed on the first node (``process_id=0``). On the other processes, :func:`datasets.Metric.compute` will return ``None``. You should still run :func:`datasets.Metric.compute` on each node though to finalize the prediction/reference writing. - -We detailed on the :doc:`loading_metrics` page how to load a metric in a distributed setup. - -Here is now a sample script showing how to instantiate and run a metric computation in a distributed/multiprocessing setup: - -Here is how we can instantiate the metric in such a distributed script: - -.. code-block:: - - >>> from datasets import load_metric - - >>> # NUM_PROCESS is the total number of processes in the pool (it CANNOT evolve dynamically at the moment) - >>> # PROCESS_ID is the rank of rank of current process ranging from 0 to NUM_PROCESS (it also CANNOT evolve dynamically at the moment) - >>> # For instance with pytorch: - >>> # NUM_PROCESS = torch.distributed.get_world_size() - >>> # PROCESS_ID = torch.distributed.get_rank() - - >>> metric = load_metric('sacrebleu', num_process=NUM_PROCESS, process_id=PROCESS_ID) - - >>> for model_input, gold_references in evaluation_dataset: - ... model_predictions = model(model_inputs) - ... metric.add_batch(predictions=model_predictions, references=gold_references) - - >>> final_score = metric.compute() # final_score is returned on process with process_id==0 and will be `None` on the other processes diff --git a/setup.py b/setup.py index a49eeb0de78..f8bdcd0aa7c 100644 --- a/setup.py +++ b/setup.py @@ -208,6 +208,9 @@ "sphinx-copybutton", "fsspec", "s3fs", + "sphinx-panels", + "sphinx-inline-tabs", + "myst-parser", ], } diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py index 0c8af56f21a..f9c97019ddb 100644 --- a/src/datasets/inspect.py +++ b/src/datasets/inspect.py @@ -55,6 +55,7 @@ def inspect_dataset(path: str, local_path: str, download_config: Optional[Downlo Args: path (``str``): path to the dataset processing script with the dataset builder. Can be either: + - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) @@ -79,6 +80,7 @@ def inspect_metric(path: str, local_path: str, download_config: Optional[Downloa Args: path (``str``): path to the dataset processing script with the dataset builder. Can be either: + - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) @@ -102,6 +104,7 @@ def get_dataset_infos(path: str): Args: path (``str``): path to the dataset processing script with the dataset builder. Can be either: + - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) @@ -117,6 +120,7 @@ def get_dataset_config_names(path: str): Args: path (``str``): path to the dataset processing script with the dataset builder. Can be either: + - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` - a dataset identifier on HuggingFace AWS bucket (list all available datasets and ids with ``datasets.list_datasets()``) diff --git a/src/datasets/splits.py b/src/datasets/splits.py index 883ced8dbe2..3e4fadb2e01 100644 --- a/src/datasets/splits.py +++ b/src/datasets/splits.py @@ -402,7 +402,7 @@ class Split: Note: All splits, including compositions inherit from `datasets.SplitBase` - See the :doc:`guide on splits ` for more information. + See the :doc:`guide on splits ` for more information. """ # pylint: enable=line-too-long TRAIN = NamedSplit("train")