From 8a9d2c3ffe3ae9bc191a4becb9953b5130f36984 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 1 Jul 2021 11:32:02 +0200 Subject: [PATCH] add streaming in load a dataset docs --- docs/source/loading_datasets.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst index 2964f7b4940..53af794291d 100644 --- a/docs/source/loading_datasets.rst +++ b/docs/source/loading_datasets.rst @@ -349,6 +349,28 @@ We provide more details on how to create your own dataset generation script on t .. _load_dataset_cache_management: + +Loading datasets in streaming mode +----------------------------------------------------------- + +When a dataset is in streaming mode, you can iterate over it directly without having to download the entire dataset. +The data are downloaded progressively as you iterate over the dataset. +You can enable dataset streaming by passing ``streaming=True`` in the :func:`load_dataset` function to get an iterable dataset. + +For example, you can start iterating over big datasets like OSCAR without having to download terabytes of data using this code: + + +.. code-block:: + + >>> from datasets import load_dataset + >>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) + >>> print(next(iter(dataset))) + {'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visit to Malawi. Chief Napoleon conveyed the desperate need for a program to intervene and care for the orphans and vulnerable children (OVC) in Malawi, and John committed to help... + +.. note:: + + A dataset in streaming mode is not a :class:`datasets.Dataset` object, but an :class:`datasets.IterableDataset` object. You can find more information about iterable datasets in the `dataset streaming documentation `__ + Cache management and integrity verifications -----------------------------------------------------------