Skip to content

Conversation

@renovate
Copy link
Contributor

@renovate renovate bot commented Jul 16, 2025

This PR contains the following updates:

Package Change Age Confidence
datasets <4.0.0,>=2.19.0 -> <4.1.0,>=4.0.0 age confidence

Release Notes

huggingface/datasets (datasets)

v4.0.0

Compare Source

New Features
Build streaming data pipelines in a few lines of code !

from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)


* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @&#8203;lhoestq in https://github.com/huggingface/datasets/pull/7606

```python

##### Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
Syntax:

ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)

Iterate on a column:

for text in ds["text"]:
...

Load one cell without bringing the full column in memory

first_text = ds["text"][0] # equivalent to ds[0]["text"]

* Torchcodec decoding by @&#8203;TyTodd in https://github.com/huggingface/datasets/pull/7616
- Enables streaming only the ranges you need ! 

```python

##### Don't download full audios/videos when it's not necessary
##### Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
  • Requires torch>=2.7.0 and FFmpeg >= 4
  • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
  • Load audio data with AudioDecoder:
audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

##### old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
  • Load video data with VideoDecoder:
video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])
Breaking changes
Other improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@3.6.0...4.0.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate bot added the renovate label Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants