diff --git a/README.md b/README.md index 3b2e5218192..ec4845c7e1b 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ `🤗Datasets` is a lightweight library providing **two** main features: -- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_datasets("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), +- **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training. [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) diff --git a/datasets/babi_qa/README.md b/datasets/babi_qa/README.md index 8af4fdf7a9b..98562c1e6cd 100644 --- a/datasets/babi_qa/README.md +++ b/datasets/babi_qa/README.md @@ -419,7 +419,7 @@ The "types" are are: - `en-valid` and `en-valid-10k` - are the same as `en` and `en10k` except the train sets have been conveniently split into train and valid portions (90% and 10% split). -To get a particular dataset, use `load_datasets('babi_qa',type=f'{type}',task_no=f'{task_no}')` where `type` is one of the types, and `task_no` is one of the task numbers. For example, `load_dataset('babi_qa', type='en', task_no='qa1')`. +To get a particular dataset, use `load_dataset('babi_qa',type=f'{type}',task_no=f'{task_no}')` where `type` is one of the types, and `task_no` is one of the task numbers. For example, `load_dataset('babi_qa', type='en', task_no='qa1')`. ### Languages diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst index 8a6f5d29cfa..2964f7b4940 100644 --- a/docs/source/loading_datasets.rst +++ b/docs/source/loading_datasets.rst @@ -65,7 +65,7 @@ This call to :func:`datasets.load_dataset` does the following steps under the ho typed with potentially complex nested types that can be mapped to numpy/pandas/python types. Apache Arrow allows you to map blobs of data on-drive without doing any deserialization. So caching the dataset directly on disk can use memory-mapping and pay effectively zero cost with O(1) random access. Alternatively, you can copy it in CPU memory - (RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_datasets` to ``True``. + (RAM) by setting the ``keep_in_memory`` argument of :func:`datasets.load_dataset` to ``True``. The default in 🤗Datasets is to memory-map the dataset on disk unless you set ``datasets.config.IN_MEMORY_MAX_SIZE`` different from ``0`` bytes (default). In that case, the dataset will be copied in-memory if its size is smaller than ``datasets.config.IN_MEMORY_MAX_SIZE`` bytes, and memory-mapped otherwise. This behavior can be enabled by setting