Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

`🤗Datasets` is a lightweight library providing **two** main features:

- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_datasets("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
Expand Down
2 changes: 1 addition & 1 deletion datasets/babi_qa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,7 @@ The "types" are are:
- `en-valid` and `en-valid-10k`
- are the same as `en` and `en10k` except the train sets have been conveniently split into train and valid portions (90% and 10% split).

To get a particular dataset, use `load_datasets('babi_qa',type=f'{type}',task_no=f'{task_no}')` where `type` is one of the types, and `task_no` is one of the task numbers. For example, `load_dataset('babi_qa', type='en', task_no='qa1')`.
To get a particular dataset, use `load_dataset('babi_qa',type=f'{type}',task_no=f'{task_no}')` where `type` is one of the types, and `task_no` is one of the task numbers. For example, `load_dataset('babi_qa', type='en', task_no='qa1')`.
### Languages


Expand Down
2 changes: 1 addition & 1 deletion docs/source/loading_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho
typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you
to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use
memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory
(RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_datasets` to ``True``.
(RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_dataset` to ``True``.
The default in 🤗Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE``
different from ``0`` bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than
``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and memory-mapped otherwise. This behavior can be enabled by setting
Expand Down