fix(deps): update dependency datasets to v4 #854

renovate · 2025-07-16T15:05:42Z

This PR contains the following updates:

Package	Change	Age	Confidence
datasets	`<4.0.0,>=2.19.0` -> `<4.1.0,>=4.0.0`

Release Notes

huggingface/datasets (datasets)

`v4.0.0`

Compare Source

New Features

Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @&#8203;lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

##### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

New Column object
- Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564
- Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614

Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @&#8203;TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

##### Don't download full audios/videos when it's not necessary
##### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

##### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434
fix string_to_dict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522
Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582
load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587
Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572
Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581
Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585
Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593
fix string_to_dict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598
No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604
Tests typing and fixes for push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7608
fix parallel push_to_hub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613
remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609
Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620
Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596
minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623
fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639
better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640
update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641
fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642
Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643
fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644
Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663
Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659
Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666
Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667
Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670
Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672

New Contributors

@TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564
@francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576
@SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581
@ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593
@ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623
@TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616
@HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659

Full Changelog: huggingface/datasets@3.6.0...4.0.0

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

fix(deps): update dependency datasets to v4

3a010c9

renovate bot added the renovate label Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency datasets to v4 #854

fix(deps): update dependency datasets to v4 #854

Uh oh!

renovate bot commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

fix(deps): update dependency datasets to v4 #854

Are you sure you want to change the base?

fix(deps): update dependency datasets to v4 #854

Uh oh!

Conversation

renovate bot commented Jul 16, 2025

Release Notes

v4.0.0

New Features

Build streaming data pipelines in a few lines of code !

Syntax:

Iterate on a column:

Load one cell without bringing the full column in memory

Breaking changes

Other improvements and bug fixes

New Contributors

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

`v4.0.0`