Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 17 additions & 5 deletions docs/source/loading_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho
to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
(RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_datasets` to ``True``.
The default in 🤗Datasets is to memory-map the dataset on drive if its size is larger than
``datasets.config.IN_MEMORY_MAX_SIZE`` (default ``250 * 2 ** 20`` B); otherwise, the dataset is copied in-memory.
This behavior can be disabled (i.e., the dataset will not be loaded in memory) by setting to ``0`` either the
configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment variable
``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence).
The default in 🤗Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE``
different from ``0`` bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than
``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and memory-mapped otherwise. This behavior can be enabled by setting
either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment
variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence) to nonzero.

3. Return a **dataset built from the splits** asked by the user (default: all); in the above example we create a dataset with the train split.

Expand Down Expand Up @@ -440,3 +440,15 @@ Indeed, if you've already loaded the dataset once before (when you had an intern
You can even set the environment variable `HF_DATASETS_OFFLINE` to ``1`` to tell ``datasets`` to run in full offline mode.
This mode disables all the network calls of the library.
This way, instead of waiting for a dataset builder download to time out, the library looks directly at the cache.

.. _load_dataset_enhancing_performance:

Enhancing performance
-----------------------------------------------------------

If you would like to speed up dataset operations, you can disable caching and copy the dataset in-memory by setting
``datasets.config.IN_MEMORY_MAX_SIZE`` to a nonzero size (in bytes) that fits in your RAM memory. In that case, the
dataset will be copied in-memory if its size is smaller than ``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and
memory-mapped otherwise. This behavior can be enabled by setting either the configuration option
``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence) or the environment variable
``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence) to nonzero.
7 changes: 3 additions & 4 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -660,10 +660,9 @@ def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] =
fs (:class:`~filesystems.S3FileSystem`, ``fsspec.spec.AbstractFileSystem``, optional, default ``None``):
Instance of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than `datasets.config.IN_MEMORY_MAX_SIZE`
(default ``250 * 2 ** 20`` B). This behavior can be disabled (i.e., the dataset will not be loaded in
memory) by setting to ``0`` either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE``
(higher precedence) or the environment variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence).
dataset will not be copied in-memory unless explicitly enabled by setting
`datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the
:ref:`load_dataset_enhancing_performance` section.

Returns:
:class:`Dataset` or :class:`DatasetDict`.
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
HF_DATASETS_OFFLINE = False

# In-memory
DEFAULT_IN_MEMORY_MAX_SIZE = 250 * 2 ** 20 # 250 MiB
DEFAULT_IN_MEMORY_MAX_SIZE = 0 # Disabled
IN_MEMORY_MAX_SIZE = float(os.environ.get("HF_DATASETS_IN_MEMORY_MAX_SIZE", DEFAULT_IN_MEMORY_MAX_SIZE))

# File names
Expand Down
7 changes: 3 additions & 4 deletions src/datasets/dataset_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -706,10 +706,9 @@ def load_from_disk(dataset_dict_path: str, fs=None, keep_in_memory: Optional[boo
fs (:class:`~filesystems.S3FileSystem` or ``fsspec.spec.AbstractFileSystem``, optional, default ``None``):
Instance of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the
dataset will be copied in-memory if its size is smaller than `datasets.config.IN_MEMORY_MAX_SIZE`
(default ``250 * 2 ** 20`` B). This behavior can be disabled (i.e., the dataset will not be loaded in
memory) by setting to ``0`` either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE``
(higher precedence) or the environment variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence).
dataset will not be copied in-memory unless explicitly enabled by setting
`datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the
:ref:`load_dataset_enhancing_performance` section.

Returns:
:class:`DatasetDict`
Expand Down
12 changes: 4 additions & 8 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -683,10 +683,8 @@ def load_dataset(
download_mode (:class:`GenerateMode`, optional): Select the download/generate mode - Default to REUSE_DATASET_IF_EXISTS
ignore_verifications (:obj:`bool`, default ``False``): Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/...).
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the dataset
will be copied in-memory if its size is smaller than `datasets.config.IN_MEMORY_MAX_SIZE` (default
``250 * 2 ** 20`` B). This behavior can be disabled (i.e., the dataset will not be loaded in memory) by
setting to ``0`` either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence)
or the environment variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence).
will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.
save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...).
script_version (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

Expand Down Expand Up @@ -775,10 +773,8 @@ def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] =
fs (:class:`~filesystems.S3FileSystem` or ``fsspec.spec.AbstractFileSystem``, optional, default ``None``):
Instance of of the remote filesystem used to download the files from.
keep_in_memory (:obj:`bool`, default ``None``): Whether to copy the dataset in-memory. If `None`, the dataset
will be copied in-memory if its size is smaller than `datasets.config.IN_MEMORY_MAX_SIZE` (default
``250 * 2 ** 20`` B). This behavior can be disabled (i.e., the dataset will not be loaded in memory) by
setting to ``0`` either the configuration option ``datasets.config.IN_MEMORY_MAX_SIZE`` (higher precedence)
or the environment variable ``HF_DATASETS_IN_MEMORY_MAX_SIZE`` (lower precedence).
will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to
nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section.

Returns:
``datasets.Dataset`` or ``datasets.DatasetDict``
Expand Down
2 changes: 1 addition & 1 deletion tests/test_info_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def test_is_small_dataset(
if env_max_in_memory_dataset_size:
assert max_in_memory_dataset_size == env_max_in_memory_dataset_size
else:
assert max_in_memory_dataset_size == 250 * 2 ** 20
assert max_in_memory_dataset_size == 0
else:
assert max_in_memory_dataset_size == config_max_in_memory_dataset_size
if dataset_size and max_in_memory_dataset_size:
Expand Down
6 changes: 2 additions & 4 deletions tests/test_load.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,8 +233,7 @@ def test_load_dataset_local_with_default_in_memory(
):
current_dataset_size = 148
if max_in_memory_dataset_size == "default":
# default = 250 * 2 ** 20
max_in_memory_dataset_size = datasets.config.IN_MEMORY_MAX_SIZE
max_in_memory_dataset_size = 0 # default
else:
monkeypatch.setattr(datasets.config, "IN_MEMORY_MAX_SIZE", max_in_memory_dataset_size)
if max_in_memory_dataset_size:
Expand All @@ -253,8 +252,7 @@ def test_load_from_disk_with_default_in_memory(
):
current_dataset_size = 512 # arrow file size = 512, in-memory dataset size = 148
if max_in_memory_dataset_size == "default":
# default = 250 * 2 ** 20
max_in_memory_dataset_size = datasets.config.IN_MEMORY_MAX_SIZE
max_in_memory_dataset_size = 0 # default
else:
monkeypatch.setattr(datasets.config, "IN_MEMORY_MAX_SIZE", max_in_memory_dataset_size)
if max_in_memory_dataset_size:
Expand Down