Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@
- local: installation
title: Installation
- local: load_hub
title: Hugging Face Hub
title: Load a dataset from the Hub
- local: access
title: The Dataset object
title: Know your dataset
- local: use_dataset
title: Train with 🤗 Datasets
title: Preprocess
- local: metrics
title: Evaluate predictions
- local: upload_dataset
title: Upload a dataset to the Hub
title: Share a dataset to the Hub
title: "Tutorials"
- sections:
- local: how_to
Expand Down
141 changes: 48 additions & 93 deletions docs/source/access.mdx
Original file line number Diff line number Diff line change
@@ -1,126 +1,81 @@
# The Dataset object
# Know your dataset

In the previous tutorial, you learned how to successfully load a dataset. This section will familiarize you with the [`Dataset`] object. You will learn about the metadata stored inside a Dataset object, and the basics of querying a Dataset object to return rows and columns.

A [`Dataset`] object is returned when you load an instance of a dataset. This object behaves like a normal Python container.
When you load a dataset split, you'll get a [`Dataset`] object. You can do many things with a [`Dataset`] object, which is why it's important to learn how to manipulate and interact with the data stored inside.
This tutorial uses the [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) dataset, but feel free to load any dataset you'd like and follow along!

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
```

## Metadata

The [`Dataset`] object contains a lot of useful information about your dataset. For example, access [`DatasetInfo`] to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes.

```py
>>> dataset.info
DatasetInfo(
description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n',
citation='@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398',
license='',
features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')},
download_checksums={'https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}},
download_size=1494541,
post_processing_size=None,
dataset_size=1492156,
size_in_bytes=2986697
)
```

You can request specific attributes of the dataset, like `description`, `citation`, and `homepage`, by calling them directly. Take a look at [`DatasetInfo`] for a complete list of attributes you can return.

```py
>>> dataset.split
NamedSplit('train')
>>> dataset.description
'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n'
>>> dataset.citation
'@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.'
>>> dataset.homepage
'https://www.microsoft.com/en-us/download/details.aspx?id=52398'
>>> dataset = load_dataset("rotten_tomatoes", split="train")
```

## Features and columns
## Indexing

A dataset is a table of rows and typed columns. Querying a dataset returns a Python dictionary where the keys correspond to column names, and the values correspond to column values:
A [`Dataset`] contains columns of data, and each column can be a different type of data. The *index*, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:

```py
# Get the first row in the dataset
>>> dataset[0]
{'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
{'label': 1,
'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}
```

Return the number of rows and columns with the following standard attributes:
Use the `-` operator to start from the end of the dataset:

```py
>>> dataset.shape
(3668, 4)
>>> dataset.num_columns
4
>>> dataset.num_rows
3668
>>> len(dataset)
3668
# Get the last row in the dataset
>>> dataset[-1]
{'label': 0,
'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}
```

List the columns names with [`Dataset.column_names`]:
Indexing by the column name returns a list of all the values in the column:

```py
>>> dataset.column_names
['idx', 'label', 'sentence1', 'sentence2']
>>> dataset["text"]
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic',
...,
'things really get weird , though not particularly scary : the movie is all portent and no content .']
```

Get detailed information about the columns with [`~datasets.Features`]:
You can combine row and column name indexing to return a specific value at a position:

```py
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
}
>>> dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
```

Return even more specific information about a feature like [`ClassLabel`], by calling its parameters `num_classes` and `names`:
But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.

```py
>>> dataset.features['label'].num_classes
2
>>> dataset.features['label'].names
['not_equivalent', 'equivalent']
>>> with Timer():
... dataset[0]['text']
Elapsed time: 0.0031 seconds

>>> with Timer():
... dataset["text"][0]
Elapsed time: 0.0094 seconds
```

## Rows, slices, batches, and columns
## Slicing

Get several rows of your dataset at a time with slice notation or a list of indices:
Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once. To slice a dataset, use the `:` operator to specify a range of positions.

```py
# Get the first three rows
>>> dataset[:3]
{'idx': [0, 1, 2],
'label': [1, 0, 1],
'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'],
'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."]
}
>>> dataset[[1, 3, 5]]
{'idx': [1, 3, 5],
'label': [0, 0, 1],
'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'],
'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."]
}
```

Querying by the column name will return its values. For example, if you want to only return the first three examples:

```py
>>> dataset['sentence1'][:3]
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .']
```

Depending on how a [`Dataset`] object is queried, the format returned will be different:

- A single row like `dataset[0]` returns a Python dictionary of values.
- A batch like `dataset[5:10]` returns a Python dictionary of lists of values.
- A column like `dataset['sentence1']` returns a Python list of values.
{'label': [1, 1, 1],
'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
'effective but too-tepid biopic']}

# Get rows between three and six
>>> dataset[3:6]
{'label': [1, 1, 1],
'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}
```
Loading