Skip to content

Commit f83cd44

Browse files
committed
Merge branch 'master' into load_dataset-no-dataset-script
2 parents be38796 + 0a0227f commit f83cd44

File tree

11 files changed

+121
-106
lines changed

11 files changed

+121
-106
lines changed

datasets/c4/dataset_infos.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docs/source/exploring.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,11 @@ The :class:`datasets.Dataset` object that you get when you execute for instance
99
>>> from datasets import load_dataset
1010
>>> dataset = load_dataset('glue', 'mrpc', split='train')
1111
12-
behaves like a normal python container. You can query its length, get rows, columns and also lot of metadata on the dataset (description, citation, split sizes, etc).
12+
behaves like a normal python container. You can query its length, get rows, columns and also a lot of metadata on the dataset (description, citation, split sizes, etc).
1313

1414
In this guide we will detail what's in this object and how to access all the information.
1515

16-
An :class:`datasets.Dataset` is a python container with a length coresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
16+
A :class:`datasets.Dataset` is a python container with a length corresponding to the number of examples in the dataset. You can access a single example by its index. Let's query the first sample in the dataset:
1717

1818
.. code-block::
1919
@@ -76,9 +76,9 @@ More details on the ``features`` can be found in the guide on :doc:`features` an
7676
Metadata
7777
------------------------------------------------------
7878

79-
The :class:`datasets.Dataset` object also host many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
79+
The :class:`datasets.Dataset` object also hosts many important metadata on the dataset which are all stored in ``dataset.info``. Many of these metadata are also accessible on the lower level, i.e. directly as attributes of the Dataset for shorter access (e.g. ``dataset.info.features`` is also available as ``dataset.features``).
8080

81-
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``licence`` when this one is available).
81+
All these attributes are listed in the package reference on :class:`datasets.DatasetInfo`. The most important metadata are ``split``, ``description``, ``citation``, ``homepage`` (and ``license`` when this one is available).
8282

8383
.. code-block::
8484
@@ -168,7 +168,7 @@ You can also get a full column by querying its name as a string. This will retur
168168
As you can see depending on the object queried (single row, batch of rows or column), the returned object is different:
169169

170170
- a single row like ``dataset[0]`` will be returned as a python dictionary of values,
171-
- a batch like ``dataset[5:10]``) will be returned as a python dictionary of lists of values,
171+
- a batch like ``dataset[5:10]`` will be returned as a python dictionary of lists of values,
172172
- a column like ``dataset['sentence1']`` will be returned as a python lists of values.
173173

174174
This may seems surprising at first but in our experiments it's actually easier to use these various format for data processing than returning the same format for each of these views on the dataset.
@@ -201,12 +201,12 @@ A specific format can be activated with :func:`datasets.Dataset.set_format`.
201201
- :obj:`type` (``Union[None, str]``, default to ``None``) defines the return type for the dataset :obj:`__getitem__` method and is one of ``[None, 'numpy', 'pandas', 'torch', 'tensorflow', 'jax']`` (``None`` means return python objects),
202202
- :obj:`columns` (``Union[None, str, List[str]]``, default to ``None``) defines the columns returned by :obj:`__getitem__` and takes the name of a column in the dataset or a list of columns to return (``None`` means return all columns),
203203
- :obj:`output_all_columns` (``bool``, default to ``False``) controls whether the columns which cannot be formatted (e.g. a column with ``string`` cannot be cast in a PyTorch Tensor) are still outputted as python objects.
204-
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the convertiong function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
204+
- :obj:`format_kwargs` can be used to provide additional keywords arguments that will be forwarded to the converting function like ``np.array``, ``torch.tensor``, ``tensorflow.ragged.constant`` or ``jnp.array``. For instance, to create ``torch.Tensor`` directly on the GPU you can specify ``device='cuda'``.
205205

206206
.. note::
207207

208208
The format is only applied to a single row or batches of rows (i.e. when querying :obj:`dataset[0]` or :obj:`dataset[10:20]`). Querying a column (e.g. :obj:`dataset['sentence1']`) will return the column even if it's filtered by the format. In this case the un-formatted column is returned.
209-
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite usefull to be able to access column even when they are masked by the format.
209+
This design choice was made because it's quite rare to use column-only access when working with deep-learning frameworks and it's quite useful to be able to access column even when they are masked by the format.
210210

211211
Here is an example:
212212

@@ -239,6 +239,7 @@ Here is an example to tokenize and pad tokens on-the-fly when accessing the samp
239239
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
240240
>>> def encode(batch):
241241
>>> return tokenizer(batch["sentence1"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
242+
>>>
242243
>>> dataset.set_transform(encode)
243244
>>> dataset.format
244245
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}

docs/source/faiss_and_ea.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Adding a FAISS or Elastic Search index to a Dataset
33

44
It is possible to do document retrieval in a dataset.
55

6-
For example, one way to do Open Domain Question Answering, one way to do that is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
6+
For example, one way to do Open Domain Question Answering is to first retrieve documents that may be relevant to answer a question, and then we can use a model to generate an answer given the retrieved documents.
77

88
FAISS is a library for dense retrieval. It means that it retrieves documents based on their vector representations, by doing a nearest neighbors search.
99
As we now have models that can generate good semantic vector representations of documents, this has become an interesting tool for document retrieval.
@@ -29,7 +29,7 @@ Adding a FAISS index
2929

3030
The :func:`datasets.Dataset.add_faiss_index` method is in charge of building, training and adding vectors to a FAISS index.
3131

32-
One way to get good vector representations for text passages is to use the DPR model. We'll compute the representations of only 100 examples just to give you the idea of how it works.
32+
One way to get good vector representations for text passages is to use the `DPR model <https://huggingface.co/transformers/model_doc/dpr.html>`_. We'll compute the representations of only 100 examples just to give you the idea of how it works.
3333

3434
.. code-block::
3535

docs/source/filesystems.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ FileSystems Integration for cloud storages
44
Supported Filesystems
55
---------------------
66

7-
Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
7+
Currently ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``.
88

99
Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently known implementations are:
1010

@@ -24,15 +24,15 @@ Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``.
2424

2525
.. code-block::
2626
27-
>>> pip install datasets[s3]
27+
>>> pip install "datasets[s3]"
2828
2929
Listing files from a public s3 bucket.
3030

3131
.. code-block::
3232
3333
>>> import datasets
3434
>>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP
35-
>>> s3.ls('public-datasets/imdb/train') # doctest: +SKIP
35+
>>> s3.ls('some-public-datasets/imdb/train') # doctest: +SKIP
3636
['dataset_info.json.json','dataset.arrow','state.json']
3737
3838
Listing files from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``.
@@ -129,8 +129,8 @@ Loading ``encoded_dataset`` from a public s3 bucket.
129129
>>> # create S3FileSystem without credentials
130130
>>> s3 = S3FileSystem(anon=True) # doctest: +SKIP
131131
>>>
132-
>>> # load encoded_dataset to from s3 bucket
133-
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
132+
>>> # load encoded_dataset from s3 bucket
133+
>>> dataset = load_from_disk('s3://some-public-datasets/imdb/train',fs=s3) # doctest: +SKIP
134134
>>>
135135
>>> print(len(dataset))
136136
>>> # 25000

docs/source/index.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,23 @@ Compatible with NumPy, Pandas, PyTorch and TensorFlow
99

1010
🤗 Datasets has many interesting features (beside easy sharing and accessing datasets/metrics):
1111

12-
Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
13-
Lightweight and fast with a transparent and pythonic API
14-
Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
15-
Smart caching: never wait for your data to process several times
16-
🤗 Datasets currently provides access to ~100 NLP datasets and ~10 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.
12+
- Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
13+
- Lightweight and fast with a transparent and pythonic API
14+
- Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
15+
- Smart caching: never wait for your data to process several times
16+
- 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live `🤗 Datasets viewer <https://huggingface.co/datasets/viewer/>`_.
1717

1818
🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section Main differences between 🤗 Datasets and `tfds`.
1919

2020
Contents
2121
---------------------------------
2222

23-
The documentation is organized in five parts:
23+
The documentation is organized in six parts:
2424

2525
- **GET STARTED** contains a quick tour and the installation instructions.
2626
- **USING DATASETS** contains general tutorials on how to use and contribute to the datasets in the library.
2727
- **USING METRICS** contains general tutorials on how to use and contribute to the metrics in the library.
28+
- **ADDING NEW DATASETS/METRICS** explains how to create your own dataset or metric loading script.
2829
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a part of the library.
2930
- **PACKAGE REFERENCE** contains the documentation of each public class and function.
3031

@@ -79,4 +80,4 @@ The documentation is organized in five parts:
7980
package_reference/builder_classes
8081
package_reference/table_classes
8182
package_reference/logging_methods
82-
package_reference/task_templates
83+
package_reference/task_templates

0 commit comments

Comments
 (0)