huggingface · lhoestq · Jul 13, 2021 · Jul 9, 2021 · Jul 9, 2021 · Jul 9, 2021
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -126,7 +126,7 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_
 
 6. Finally, take some time to document your dataset for other users. Each dataset should be accompanied by a `README.md` dataset card in its directory which describes the data and contains tags representing languages and tasks supported to be easily discoverable. You can find information on how to fill out the card either manually or by using our [web app](https://huggingface.co/datasets/card-creator/) in the following [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md).
 
-7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗Datasets?*](#how-to-contribute-to-🤗Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
+7. If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section [*How to contribute to 🤗 Datasets?*](#how-to-contribute-to-🤗 Datasets). If you experience problems with the dummy data tests, you might want to take a look at the section *Help for dummy data tests* below.
 
 
 ### Help for dummy data tests

diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@
     <a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
 </p>
 
-`🤗Datasets` is a lightweight library providing **two** main features:
+`🤗 Datasets` is a lightweight library providing **two** main features:
 
 - **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
 - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
@@ -38,29 +38,29 @@
     <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a>
 </h3>
 
-`🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. 
+`🤗 Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. 
 
-`🤗Datasets` has many additional interesting features:
-- Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
+`🤗 Datasets` has many additional interesting features:
+- Thrive on large datasets: `🤗 Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
 - Smart caching: never wait for your data to process several times.
 - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
 - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.
 
-`🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds).
+`🤗 Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗 Datasets` and `tfds` can be found in the section [Main differences between `🤗 Datasets` and `tfds`](#main-differences-between-datasets-and-tfds).
 
 # Installation
 
 ## With pip
 
-`🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
+`🤗 Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
 
 ```bash
 pip install datasets
 ```
 
 ## With conda
 
-`🤗Datasets` can be installed using conda as follows:
+`🤗 Datasets` can be installed using conda as follows:
 
 ```bash
 conda install -c huggingface -c conda-forge datasets
@@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati
 
 ## Installation to use with PyTorch/TensorFlow/pandas
 
-If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
+If you plan to use `🤗 Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.
 
 For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html
 
 # Usage
 
-`🤗Datasets` is made to be very simple to use. The main methods are:
+`🤗 Datasets` is made to be very simple to use. The main methods are:
 
 - `datasets.list_datasets()` to list the available datasets
 - `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset
@@ -106,7 +106,7 @@ squad_metric = load_metric('squad')
 # Process the dataset - add a column with the length of the context texts
 dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})
 
-# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library)
+# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
 
@@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document
 
 - Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
 - What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
-- Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html
+- Processing data with `🤗 Datasets`: https://huggingface.co/docs/datasets/processing.html
 - Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
 - etc.
 
-Another introduction to `🤗Datasets` is the tutorial on Google Colab here:
+Another introduction to `🤗 Datasets` is the tutorial on Google Colab here:
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
 
 # Add a new dataset to the Hub
@@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas
 
 You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html).
 
-# Main differences between `🤗Datasets` and `tfds`
+# Main differences between `🤗 Datasets` and `tfds`
 
-If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`:
-- the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
-- `🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
-- the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
-- the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
+If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗 Datasets` and `tfds`:
+- the scripts in `🤗 Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
+- `🤗 Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/).
+- the backend serialization of `🤗 Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
+- the user-facing dataset object of `🤗 Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache.
 
 # Disclaimers
 
-Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
+Similar to `TensorFlow Datasets`, `🤗 Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
 
 If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community!
 

diff --git a/datasets/norne/README.md b/datasets/norne/README.md
@@ -238,7 +238,7 @@ To access these reduced versions of the dataset, you can use the configs `bokmaa
 
 NorNE was created as a collaboration between [Schibsted Media Group](https://schibsted.com/), [Språkbanken](https://www.nb.no/forskning/sprakbanken/) at the [National Library of Norway](https://www.nb.no) and the [Language Technology Group](https://www.mn.uio.no/ifi/english/research/groups/ltg/) at the University of Oslo.
 
-NorNE was added to Huggingface Datasets by the AI-Lab at the National Library of Norway.
+NorNE was added to HuggingFace Datasets by the AI-Lab at the National Library of Norway.
 
 ### Licensing Information
 

diff --git a/docs/source/exploring.rst b/docs/source/exploring.rst
@@ -190,7 +190,7 @@ Up to now, the rows/batches/columns returned when querying the elements of the d
 
 Sometimes we would like to have more sophisticated objects returned by our dataset, for instance NumPy arrays or PyTorch tensors instead of python lists.
 
-🤗Datasets provides a way to do that through what is called a ``format``.
+🤗 Datasets provides a way to do that through what is called a ``format``.
 
 While the internal storage of the dataset is always the Apache Arrow format, by setting a specific format on a dataset, you can filter some columns and cast the output of :func:`datasets.Dataset.__getitem__` in NumPy/pandas/PyTorch/TensorFlow, on-the-fly.
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing
 
 Compatible with NumPy, Pandas, PyTorch and TensorFlow
 
-🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
+🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
 
-🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
+🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
 
 Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
 Lightweight and fast with a transparent and pythonic API
-Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
+Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
 Smart caching: never wait for your data to process several times
-🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer.
+🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
 
-🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.
+🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds.
-🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds.
+🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and `tfds`.
-🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds.
+🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and `tfds`.
 
 Contents
 ---------------------------------

diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -1,21 +1,21 @@
 # Installation
 
-🤗Datasets is tested on Python 3.6+.
+🤗 Datasets is tested on Python 3.6+.
 
-You should install 🤗Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
+You should install 🤗 Datasets in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
 unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going to use and activate it.
 
-Now, if you want to use 🤗Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
+Now, if you want to use 🤗 Datasets, you can install it with pip. If you'd like to play with the examples, you must install it from source.
 
 ## Installation with pip
 
-🤗Datasets can be installed using pip as follows:
+🤗 Datasets can be installed using pip as follows:
 
 ```bash
 pip install datasets
 ```
 
-To check 🤗Datasets is properly installed, run the following command:
+To check 🤗 Datasets is properly installed, run the following command:
 
 ```bash
 python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
@@ -27,7 +27,7 @@ It should download version 1 of the [Stanford Question Answering Dataset](https:
 {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}
 ```
 
-If you want to use the 🤗Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
+If you want to use the 🤗 Datasets library with TensorFlow 2.0 or PyTorch, you will need to install these seperately.
 Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) 
 and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
 
@@ -48,11 +48,11 @@ Again, you can run:
 python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"
 ```
 
-to check 🤗Datasets is properly installed.
+to check 🤗 Datasets is properly installed.
 
 ## With conda
 
-🤗Datasets can be installed using conda as follows:
+🤗 Datasets can be installed using conda as follows:
 
 ```bash
 conda install -c huggingface -c conda-forge datasets