You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).
12
+
behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc).
13
13
14
14
In this guide we will detail what's in this object and how to access all the information.
15
15
16
-
An:class:`datasets.Dataset` is a python container with a length coresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
16
+
A:class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
17
17
18
18
.. code-block::
19
19
@@ -76,9 +76,9 @@ More details on the ``features`` can be found in the guide on :doc:`features` an
The :class:`datasets.Dataset` object also host many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
79
+
The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
80
80
81
-
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``licence`` when this one is available).
81
+
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available).
82
82
83
83
.. code-block::
84
84
@@ -168,7 +168,7 @@ You can also get a full column by querying its name as a string. This will retur
168
168
As you can see depending on the object queried (single row, batch of rows or column), the returned object is different:
169
169
170
170
- a single row like ``dataset[0]`` will be returned as a python dictionary of values,
171
-
- a batch like ``dataset[5:10]``) will be returned as a python dictionary of lists of values,
171
+
- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values,
172
172
- a column like ``dataset['sentence1']`` will be returned as a python lists of values.
173
173
174
174
This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset.
@@ -201,12 +201,12 @@ A specific format can be activated with :func:`datasets.Dataset.set_format`.
201
201
- :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects),
202
202
- :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns),
203
203
- :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects.
204
-
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the convertiong function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
204
+
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
205
205
206
206
.. note::
207
207
208
208
The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned.
209
-
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite usefull to be able to access column even when they are masked by the format.
209
+
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format.
210
210
211
211
Here is an example:
212
212
@@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
Copy file name to clipboardExpand all lines: docs/source/faiss_and_ea.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ Adding a FAISS or Elastic Search index to a Dataset
3
3
4
4
It is possible to do document retrieval in a dataset.
5
5
6
-
For example, one way to do Open Domain Question Answering, one way to do that is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
6
+
For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
7
7
8
8
FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search.
9
9
As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval.
@@ -29,7 +29,7 @@ Adding a FAISS index
29
29
30
30
The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index.
31
31
32
-
One way to get good vector representations for text passages is to use the DPR model. We'll compute the representations of only 100 examples just to give you the idea of how it works.
32
+
One way to get good vector representations for text passages is to use the `DPR model<https://huggingface.co/transformers/model_doc/dpr.html>`_. We'll compute the representations of only 100 examples just to give you the idea of how it works.
Copy file name to clipboardExpand all lines: docs/source/filesystems.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ FileSystems Integration for cloud storages
4
4
Supported Filesystems
5
5
---------------------
6
6
7
-
Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
7
+
Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
8
8
9
9
Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are:
10
10
@@ -24,15 +24,15 @@ Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``.
Copy file name to clipboardExpand all lines: docs/source/index.rst
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,22 +9,23 @@ Compatible with NumPy, Pandas, PyTorch and TensorFlow
9
9
10
10
🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
11
11
12
-
Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
13
-
Lightweight and fast with a transparent and pythonic API
14
-
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
15
-
Smart caching: never wait for your data to process several times
16
-
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
12
+
- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
13
+
- Lightweight and fast with a transparent and pythonic API
14
+
- Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
15
+
- Smart caching: never wait for your data to process several times
16
+
- 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `🤗 Datasets viewer<https://huggingface.co/datasets/viewer/>`_.
17
17
18
18
🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.
19
19
20
20
Contents
21
21
---------------------------------
22
22
23
-
The documentation is organized in five parts:
23
+
The documentation is organized in six parts:
24
24
25
25
- **GET STARTED** contains a quick tour and the installation instructions.
26
26
- **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library.
27
27
- **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library.
28
+
- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script.
28
29
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library.
29
30
- **PACKAGE REFERENCE** contains the documentation of each public class and function.
30
31
@@ -79,4 +80,4 @@ The documentation is organized in five parts:
0 commit comments