Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,4 @@ Creating a dataset card is easy and can be done in just a few steps:

YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.

Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.
4 changes: 2 additions & 2 deletions docs/source/faiss_es.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ FAISS retrieves documents based on the similarity of their vector representation

```py
>>> from datasets import load_dataset
>>> ds = load_dataset('crime_and_punish', split='train[:100]')
>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})
```

Expand Down Expand Up @@ -62,7 +62,7 @@ FAISS retrieves documents based on the similarity of their vector representation
7. Reload it at a later time with [`Dataset.load_faiss_index`]:

```py
>>> ds = load_dataset('crime_and_punish', split='train[:100]')
>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
>>> ds.load_faiss_index('embeddings', 'my_index.faiss')
```

Expand Down
4 changes: 2 additions & 2 deletions docs/source/image_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ When you load an image dataset and call the image column, the images are decoded
```py
>>> from datasets import load_dataset, Image

>>> dataset = load_dataset("beans", split="train")
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train")
>>> dataset[0]["image"]
```

Expand All @@ -33,7 +33,7 @@ You can load a dataset from the image path. Use the [`~Dataset.cast_column`] fun
If you only want to load the underlying path to the image dataset without decoding the image object, set `decode=False` in the [`Image`] feature:

```py
>>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]
{'bytes': None,
'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/bean_rust/bean_rust_train.29.jpg'}
Expand Down
2 changes: 1 addition & 1 deletion docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ Select specific rows of the `train` split:
```py
>>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[10:20]")
===STRINGAPI-READINSTRUCTION-SPLIT===
>>> train_10_20_ds = datasets.load_dataset("bookcorpu", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
>>> train_10_20_ds = datasets.load_dataset("rojagtap/bookcorpus", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
```

Or select a percentage of a split with:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/object_detection.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ To run these examples, make sure you have up-to-date versions of [albumentations
pip install -U albumentations opencv-python
```

In this example, you'll use the [`cppe-5`](https://huggingface.co/datasets/cppe-5) dataset for identifying medical personal protective equipment (PPE) in the context of the COVID-19 pandemic.
In this example, you'll use the [`cppe-5`](https://huggingface.co/datasets/rishitdagli/cppe-5) dataset for identifying medical personal protective equipment (PPE) in the context of the COVID-19 pandemic.

Load the dataset and take a look at an example:

```py
>>> from datasets import load_dataset

>>> ds = load_dataset("cppe-5")
>>> ds = load_dataset("rishitdagli/cppe-5")
>>> example = ds['train'][0]
>>> example
{'height': 663,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ pip install -U albumentations opencv-python

## NLP

Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.
Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.

**1**. Load the MRPC dataset by providing the [`load_dataset`] function with the dataset name, dataset configuration (not all datasets will have a configuration), and dataset split:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -160,11 +160,11 @@ You can split your dataset one of two ways:

🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter.

For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):
For example, the [amazon_polarity](https://huggingface.co/datasets/fancyzhx/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("amazon_polarity", split="train", streaming=True)
>>> dataset = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> print(dataset)
IterableDataset({
features: ['label', 'title', 'content'],
Expand Down
4 changes: 2 additions & 2 deletions docs/source/use_with_jax.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -195,11 +195,11 @@ part.

The easiest way to get JAX arrays out of a dataset is to use the `with_format('jax')` method. Lets assume
that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available
at the HuggingFace Hub at https://huggingface.co/datasets/mnist.
at the HuggingFace Hub at https://huggingface.co/datasets/ylecun/mnist.

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("mnist")
>>> ds = load_dataset("ylecun/mnist")
>>> ds = ds.with_format("jax")
>>> ds["train"][0]
{'image': DeviceArray([[ 0, 0, 0, ...],
Expand Down
2 changes: 1 addition & 1 deletion docs/source/use_with_numpy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ at the HuggingFace Hub at https://huggingface.co/datasets/mnist.

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("mnist")
>>> ds = load_dataset("ylecun/mnist")
>>> ds = ds.with_format("numpy")
>>> ds["train"][0]
{'image': array([[ 0, 0, 0, ...],
Expand Down
10 changes: 5 additions & 5 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1970,7 +1970,7 @@ def class_encode_column(self, column: str, include_nulls: bool = False) -> "Data

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("boolq", split="validation")
>>> ds = load_dataset("google/boolq", split="validation")
>>> ds.features
{'answer': Value('bool'),
'passage': Value('string'),
Expand Down Expand Up @@ -4725,7 +4725,7 @@ def train_test_split(
>>> ds = ds.train_test_split(test_size=0.2, seed=42)

# stratified split
>>> ds = load_dataset("imdb",split="train")
>>> ds = load_dataset("stanfordnlp/imdb",split="train")
Dataset({
features: ['text', 'label'],
num_rows: 25000
Expand Down Expand Up @@ -6175,15 +6175,15 @@ def add_faiss_index(
Example:

```python
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
>>> ds_with_embeddings.add_faiss_index(column='embeddings')
>>> # query
>>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
>>> # save index
>>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
>>> # load index
>>> ds.load_faiss_index('embeddings', 'my_index.faiss')
>>> # query
Expand Down Expand Up @@ -6314,7 +6314,7 @@ def add_elasticsearch_index(

```python
>>> es_client = elasticsearch.Elasticsearch()
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
>>> ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
>>> scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
```
Expand Down
24 changes: 12 additions & 12 deletions src/datasets/arrow_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,34 +459,34 @@ class ReadInstruction:
Examples::

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
ds = datasets.load_dataset('ylecun/mnist', split='test[:33%]')
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%'))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
ds = datasets.load_dataset('ylecun/mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%]+train[1:-1]'))
ds = datasets.load_dataset('mnist', split=(
ds = datasets.load_dataset('ylecun/mnist', split=(
datasets.ReadInstruction('test', to=33, unit='%') +
datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
ds = datasets.load_dataset('ylecun/mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%](pct1_dropremainder)'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

# 10-fold validation:
tests = datasets.load_dataset(
'mnist',
'ylecun/mnist',
[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains = datasets.load_dataset(
'mnist',
'ylecun/mnist',
[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
for k in range(0, 100, 10)])

Expand Down
2 changes: 1 addition & 1 deletion src/datasets/dataset_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,7 @@ def class_encode_column(self, column: str, include_nulls: bool = False) -> "Data

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("boolq")
>>> ds = load_dataset("google/boolq")
>>> ds["train"].features
{'answer': Value('bool'),
'passage': Value('string'),
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/download/download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ def iter_files(self, paths: Union[str, list[str]]):
Example:

```py
>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip')
>>> files = dl_manager.iter_files(files)
```
"""
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/download/streaming_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def iter_files(self, urlpaths: Union[str, list[str]]) -> Iterable[str]:
Example:

```py
>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
>>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip')
>>> files = dl_manager.iter_files(files)
```
"""
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3218,7 +3218,7 @@ def shard(

```py
>>> from datasets import load_dataset
>>> ds = load_dataset("amazon_polarity", split="train", streaming=True)
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
Dataset({
features: ['label', 'title', 'content'],
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/utils/patching.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ class patch_submodule:
>>> from datasets.load import dataset_module_factory
>>> from datasets.streaming import patch_submodule, xjoin
>>>
>>> dataset_module = dataset_module_factory("snli")
>>> dataset_module = dataset_module_factory("stanfordnlp/snli")
>>> snli_module = importlib.import_module(dataset_module.module_path)
>>> patcher = patch_submodule(snli_module, "os.path.join", xjoin)
>>> patcher.start()
Expand Down
12 changes: 6 additions & 6 deletions tests/test_metadata_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,23 +282,23 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset
"dataset": "AI-Lab-Makerere/beans",
"config": "default",
"split": "test",
"url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
"url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
"filename": "0000.parquet",
"size": 17707203,
},
{
"dataset": "AI-Lab-Makerere/beans",
"config": "default",
"split": "train",
"url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
"url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
"filename": "0000.parquet",
"size": 143780164,
},
{
"dataset": "AI-Lab-Makerere/beans",
"config": "default",
"split": "validation",
"url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
"url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
"filename": "0000.parquet",
"size": 18500862,
},
Expand Down Expand Up @@ -332,15 +332,15 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset
},
},
download_checksums={
"https://huggingface.co/datasets/beans/resolve/main/data/train.zip": {
"https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip": {
"num_bytes": 143812152,
"checksum": None,
},
"https://huggingface.co/datasets/beans/resolve/main/data/validation.zip": {
"https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/validation.zip": {
"num_bytes": 18504213,
"checksum": None,
},
"https://huggingface.co/datasets/beans/resolve/main/data/test.zip": {
"https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/test.zip": {
"num_bytes": 17708541,
"checksum": None,
},
Expand Down
Loading