fix(deps): update dependency datasets to v4 #854
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
<4.0.0,>=2.19.0-><4.1.0,>=4.0.0Release Notes
huggingface/datasets (datasets)
v4.0.0Compare Source
New Features
Add
IterableDataset.push_to_hub()by @lhoestq in https://github.com/huggingface/datasets/pull/7595Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
New
ColumnobjectSyntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
Iterate on a column:
for text in ds["text"]:
...
Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
torch>=2.7.0and FFmpeg >= 4datasets<4.0AudioDecoder:VideoDecoder:Breaking changes
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_codeis no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
ListtypeSequencewas a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aListor adictdepending on the subfeatureOther improvements and bug fixes
Dataset.mapto reuse cache files mapped with differentnum_procby @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterableby @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.pyto useco_linetablefor Python 3.10+ in place ofco_lnotabby @qgallouedec in https://github.com/huggingface/datasets/pull/7609New Contributors
Full Changelog: huggingface/datasets@3.6.0...4.0.0
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Never, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.