Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,12 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
- take
- shard
- repeat
- to_csv
- to_pandas
- to_dict
- to_json
- to_parquet
- to_sql
- push_to_hub
- load_state_dict
- state_dict
Expand Down
21 changes: 16 additions & 5 deletions docs/source/process.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -808,17 +808,28 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio

## Save

Once you are done processing your dataset, you can save and reuse it later with [`~Dataset.save_to_disk`].
Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].

Save your dataset by providing the path to the directory you wish to save it to:
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:

```py
>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
```python
encoded_dataset.push_to_hub("username/my_dataset")
```

Use the [`load_from_disk`] function to reload the dataset:
Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):

```python
from datasets import load_dataset
reloaded_dataset = load_dataset("username/my_dataset", streaming=True)
```

Alternatively, you can save it locally in Arrow format on disk. Compared to Parquet, Arrow is uncompressed which makes it much faster to reload which is great for local use on disk and ephemeral caching. But since it's larger and with less metadata, it is slower to upload/download/query than Parquet and less suited for long term storage.

Use the [`~Dataset.save_to_disk`] and [`load_from_disk`] function to reload the dataset from your disk:

```py
>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
>>> # later
>>> from datasets import load_from_disk
>>> reloaded_dataset = load_from_disk("path/of/my/dataset/directory")
```
Expand Down
60 changes: 53 additions & 7 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ You can find more details in the [Dataset vs. IterableDataset guide](./about_map

</Tip>


## Column indexing

Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
```python
>>> from datasets import load_dataset
>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
>>> print(next(iter(dataset["text"])))
Beginners BBQ Class Taking Place in Missoula!...
```

## Convert from a Dataset

If you have an existing [`Dataset`] object, you can convert it to an [`IterableDataset`] with the [`~Dataset.to_iterable_dataset`] function. This is actually faster than setting the `streaming=True` argument in [`load_dataset`] because the data is streamed from local files.
Expand Down Expand Up @@ -495,12 +506,47 @@ Resuming returns exactly where the checkpoint was saved except if `.shuffle()` i

</Tip>

## Column indexing

Sometimes it is convenient to iterate over values of a specific column. Fortunately, an [`IterableDataset`] supports column indexing:
## Save

Once your iterable dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].

Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]. This iterates over the dataset and progressively uploads the data to Hugging Face:

```python
>>> from datasets import load_dataset
>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train")
>>> print(next(iter(dataset["text"])))
Beginners BBQ Class Taking Place in Missoula!...
```
dataset.push_to_hub("username/my_dataset")
```

Use the [`load_dataset`] function to reload the dataset:

```python
from datasets import load_dataset
reloaded_dataset = load_dataset("username/my_dataset")
```

## Export

🤗 Datasets supports exporting as well so you can work with your dataset in other applications. The following table shows currently supported file formats you can export to:

| File type | Export method |
|-------------------------|----------------------------------------------------------------|
| CSV | [`IterableDataset.to_csv`] |
| JSON | [`IterableDataset.to_json`] |
| Parquet | [`IterableDataset.to_parquet`] |
| SQL | [`IterableDataset.to_sql`] |
| In-memory Python object | [`IterableDataset.to_pandas`], [`IterableDataset.to_polars`] or [`IterableDataset.to_dict`] |

For example, export your dataset to a CSV file like this:

```py
>>> dataset.to_csv("path/of/my/dataset.csv")
```

If you have a large dataset, you can save one file per shard, e.g.

```py
>>> num_shards = dataset.num_shards
>>> for index in range(num_shards):
... shard = dataset.shard(index, num_shards)
... shard.to_parquet(f"path/of/my/dataset/data-{index:05d}.parquet")
```
17 changes: 10 additions & 7 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4937,12 +4937,15 @@ def to_csv(
**to_csv_kwargs,
).write()

def to_dict(self, batch_size: Optional[int] = None) -> Union[dict, Iterator[dict]]:
def to_dict(self, batch_size: Optional[int] = None, batched: bool = False) -> Union[dict, Iterator[dict]]:
"""Returns the dataset as a Python dict. Can also return a generator for large datasets.

Args:
batch_size (`int`, *optional*): The size (number of rows) of the batches if `batched` is `True`.
Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
batched (`bool`):
Set to `True` to return a generator that yields the dataset as batches
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).

Returns:
`dict` or `Iterator[dict]`
Expand Down Expand Up @@ -5045,12 +5048,12 @@ def to_pandas(
"""Returns the dataset as a `pandas.DataFrame`. Can also return a generator for large datasets.

Args:
batched (`bool`):
Set to `True` to return a generator that yields the dataset as batches
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
batch_size (`int`, *optional*):
The size (number of rows) of the batches if `batched` is `True`.
Defaults to `datasets.config.DEFAULT_MAX_BATCH_SIZE`.
batched (`bool`):
Set to `True` to return a generator that yields the dataset as batches
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).

Returns:
`pandas.DataFrame` or `Iterator[pandas.DataFrame]`
Expand Down Expand Up @@ -5088,12 +5091,12 @@ def to_polars(
"""Returns the dataset as a `polars.DataFrame`. Can also return a generator for large datasets.

Args:
batched (`bool`):
Set to `True` to return a generator that yields the dataset as batches
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
batch_size (`int`, *optional*):
The size (number of rows) of the batches if `batched` is `True`.
Defaults to `genomicsml.datasets.config.DEFAULT_MAX_BATCH_SIZE`.
batched (`bool`):
Set to `True` to return a generator that yields the dataset as batches
of `batch_size` rows. Defaults to `False` (returns the whole datasets once).
schema_overrides (`dict`, *optional*):
Support type specification or override of one or more columns; note that
any dtypes inferred from the schema param will be overridden.
Expand Down
Loading
Loading