-
Notifications
You must be signed in to change notification settings - Fork 3.1k
More consistent naming #2611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More consistent naming #2611
Changes from 1 commit
d938669
a7c4ee1
509a46f
d27ecf6
7716214
ba51f68
adb71e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,7 +25,7 @@ | |
| <a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a> | ||
| </p> | ||
|
|
||
| `🤗Datasets` is a lightweight library providing **two** main features: | ||
| `🤗 Datasets` is a lightweight library providing **two** main features: | ||
|
||
|
|
||
| - **one-line dataloaders for many public datasets**: one liners to download and pre-process any of the  major public datasets (in 467 languages and dialects!) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), | ||
| - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like `tokenized_dataset = dataset.map(tokenize_exemple)`, efficiently prepare the dataset for inspection and ML model evaluation and training. | ||
|
|
@@ -38,29 +38,29 @@ | |
| <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a> | ||
| </h3> | ||
|
|
||
| `🤗Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. | ||
| `🤗 Datasets` also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. | ||
|
|
||
| `🤗Datasets` has many additional interesting features: | ||
| - Thrive on large datasets: `🤗Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). | ||
| `🤗 Datasets` has many additional interesting features: | ||
| - Thrive on large datasets: `🤗 Datasets` naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). | ||
| - Smart caching: never wait for your data to process several times. | ||
| - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping). | ||
| - Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX. | ||
|
|
||
| `🤗Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗Datasets` and `tfds` can be found in the section [Main differences between `🤗Datasets` and `tfds`](#main-differences-between-datasets-and-tfds). | ||
| `🤗 Datasets` originated from a fork of the awesome [`TensorFlow Datasets`](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between `🤗 Datasets` and `tfds` can be found in the section [Main differences between `🤗 Datasets` and `tfds`](#main-differences-between-datasets-and-tfds). | ||
|
|
||
| # Installation | ||
|
|
||
| ## With pip | ||
|
|
||
| `🤗Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) | ||
| `🤗 Datasets` can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) | ||
|
|
||
| ```bash | ||
| pip install datasets | ||
| ``` | ||
|
|
||
| ## With conda | ||
|
|
||
| `🤗Datasets` can be installed using conda as follows: | ||
| `🤗 Datasets` can be installed using conda as follows: | ||
|
|
||
| ```bash | ||
| conda install -c huggingface -c conda-forge datasets | ||
|
|
@@ -72,13 +72,13 @@ For more details on installation, check the installation page in the documentati | |
|
|
||
| ## Installation to use with PyTorch/TensorFlow/pandas | ||
|
|
||
| If you plan to use `🤗Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. | ||
| If you plan to use `🤗 Datasets` with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. | ||
|
|
||
| For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html | ||
|
|
||
| # Usage | ||
|
|
||
| `🤗Datasets` is made to be very simple to use. The main methods are: | ||
| `🤗 Datasets` is made to be very simple to use. The main methods are: | ||
|
|
||
| - `datasets.list_datasets()` to list the available datasets | ||
| - `datasets.load_dataset(dataset_name, **kwargs)` to instantiate a dataset | ||
|
|
@@ -106,7 +106,7 @@ squad_metric = load_metric('squad') | |
| # Process the dataset - add a column with the length of the context texts | ||
| dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])}) | ||
|
|
||
| # Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library) | ||
| # Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library) | ||
| from transformers import AutoTokenizer | ||
| tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') | ||
|
|
||
|
|
@@ -117,11 +117,11 @@ For more details on using the library, check the quick tour page in the document | |
|
|
||
| - Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html | ||
| - What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html | ||
| - Processing data with `🤗Datasets`: https://huggingface.co/docs/datasets/processing.html | ||
| - Processing data with `🤗 Datasets`: https://huggingface.co/docs/datasets/processing.html | ||
| - Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html | ||
| - etc. | ||
|
|
||
| Another introduction to `🤗Datasets` is the tutorial on Google Colab here: | ||
| Another introduction to `🤗 Datasets` is the tutorial on Google Colab here: | ||
| [](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) | ||
|
|
||
| # Add a new dataset to the Hub | ||
|
|
@@ -132,17 +132,17 @@ You will find [the step-by-step guide here](https://github.com/huggingface/datas | |
|
|
||
| You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in [the documentation section about dataset sharing](https://huggingface.co/docs/datasets/share_dataset.html). | ||
|
|
||
| # Main differences between `🤗Datasets` and `tfds` | ||
| # Main differences between `🤗 Datasets` and `tfds` | ||
|
|
||
| If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗Datasets` and `tfds`: | ||
| - the scripts in `🤗Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request | ||
| - `🤗Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/). | ||
| - the backend serialization of `🤗Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache). | ||
| - the user-facing dataset object of `🤗Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache. | ||
| If you are familiar with the great `Tensorflow Datasets`, here are the main differences between `🤗 Datasets` and `tfds`: | ||
| - the scripts in `🤗 Datasets` are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request | ||
| - `🤗 Datasets` also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) or [GLUE](https://gluebenchmark.com/). | ||
| - the backend serialization of `🤗 Datasets` is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache). | ||
| - the user-facing dataset object of `🤗 Datasets` is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache. | ||
|
|
||
| # Disclaimers | ||
|
|
||
| Similar to `TensorFlow Datasets`, `🤗Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. | ||
| Similar to `TensorFlow Datasets`, `🤗 Datasets` is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. | ||
|
|
||
| If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a [GitHub issue](https://github.com/huggingface/datasets/issues/new). Thanks for your contribution to the ML community! | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -5,17 +5,17 @@ Datasets and evaluation metrics for natural language processing | |||||
|
|
||||||
| Compatible with NumPy, Pandas, PyTorch and TensorFlow | ||||||
|
|
||||||
| 🤗Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). | ||||||
| 🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). | ||||||
|
|
||||||
| 🤗Datasets has many interesting features (beside easy sharing and accessing datasets/metrics): | ||||||
| 🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics): | ||||||
|
|
||||||
| Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2 | ||||||
| Lightweight and fast with a transparent and pythonic API | ||||||
| Strive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default. | ||||||
| Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default. | ||||||
| Smart caching: never wait for your data to process several times | ||||||
| 🤗Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗Datasets viewer. | ||||||
| 🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer. | ||||||
|
|
||||||
| 🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds. | ||||||
| 🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds. | ||||||
|
||||||
| 🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and tfds. | |
| 🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and tfds can be found in the section Main differences between 🤗 Datasets and `tfds`. |
but see my earlier comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine this way. We could also change it to "Tensorflow Datasets"
Uh oh!
There was an error while loading. Please reload this page.