Releases · huggingface/datasets

16 Nov 08:06

albertvillanova

2.15.0

0caf912

2.15.0

What's Changed

Fix typo in Audio dataset documentation by @prassanna-ravishankar in #6222
Add push_to_hub with multiple configs docs by @lhoestq in #6226
Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in #6228
Update README.md by @NinoRisteski in #6233
Don't skip hidden files in dl_manager.iter_files when they are given as input by @mariosasko in #6230
Update README.md by @NinoRisteski in #6223
Remove unused global variables in audio.py by @mariosasko in #6241
Improve error message for missing function parameters by @suavemint in #6232
Fix cast from fixed size list to variable size list by @mariosasko in #6243
Update create_dataset.mdx by @EswarDivi in #6247
[DOCS] Fix typo: Elasticsearch by @leemthompo in #6258
Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in #6251
Temporarily pin tensorflow < 2.14.0 by @albertvillanova in #6264
Fix CI 404 errors by @albertvillanova in #6262
Remove apache_beam import in BeamBasedBuilder._save_info by @mariosasko in #6265
Improve documentation of dataset.from_generator by @hartmans in #6281
Fix parquet columns argument in streaming mode by @lhoestq in #6295
Doc readme improvements by @mariosasko in #6298
Unpin tensorflow maximum version by @mariosasko in #6301
Unpin jax maximum version by @mariosasko in #6300
Fix ArrayXD cast by @mariosasko in #6297
Reduce the number of commits in push_to_hub by @mariosasko in #6269
Fix typo in code example in docs by @bryant1410 in #6307
Update README.md by @smty2018 in #6304
Deterministic set hash by @lhoestq in #6318
docs: resolving namespace conflict, refactored variable by @smty2018 in #6312
Fix typos by @python273 in #6321
Fix commit message formatting in multi-commit uploads by @qgallouedec in #6313
Temporarily pin fsspec < 2023.10.0 by @albertvillanova in #6331
Unpin fsspec by @lhoestq in #6336
Fix use_dataset.mdx by @angel-luis in #6351
Add fsspec version to the datasets-cli env command output by @mariosasko in #6356
Expanduser in save_to_disk() by @Unknown3141592 in #6098
Fix time measuring snippet in docs by @mariosasko in #6367
Temporarily pin pyarrow < 14.0.0 by @albertvillanova in #6375
Fix typo in Dataset.map docstring by @bryant1410 in #6373
Avoid redundant warning when encoding NumPy array as Image by @mariosasko in #6379
Replace deprecated license_file in setup.cfg by @albertvillanova in #6332
Minor release step improvement by @lhoestq in #6339
Fix dependency conflict within CI build documentation by @albertvillanova in #6411
Remove redundant condition in builders by @albertvillanova in #6398
Handle future deprecation argument by @winglian in #6390
Remove token value from warnings by @mariosasko in #6418
Rename audio_classificiation.py to audio_classification.py by @carlthome in #6416
Add pyarrow-hotfix to release docs by @albertvillanova in #6421
Simplify filesystem logic by @mariosasko in #6362
Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in #6423

New Contributors

@prassanna-ravishankar made their first contribution in #6222
@NinoRisteski made their first contribution in #6233
@suavemint made their first contribution in #6232
@EswarDivi made their first contribution in #6247
@leemthompo made their first contribution in #6258
@hartmans made their first contribution in #6281
@smty2018 made their first contribution in #6304
@python273 made their first contribution in #6321
@angel-luis made their first contribution in #6351
@Unknown3141592 made their first contribution in #6098
@winglian made their first contribution in #6390
@carlthome made their first contribution in #6416

Full Changelog: 2.14.7...2.15.0

Contributors

hartmans, winglian, and 15 other contributors

Assets 2

15 Nov 08:19

albertvillanova

2.14.7

bf02cff

2.14.7

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
Fix python formatting for complex types in format_table by @mariosasko in #6368
Support pyarrow 14.0.0 by @albertvillanova in #6378
Do not try to download from HF GCS for generator by @yundai424 in #6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404

New Contributors

@cwallenwein made their first contribution in #6346
@yundai424 made their first contribution in #6372

Full Changelog: 2.14.6...2.14.7

Contributors

albertvillanova, cwallenwein, and 2 other contributors

Assets 2

24 Oct 08:15

lhoestq

2.14.6

06c3ffb

2.14.6

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in #6224
Check builder cls default config name in inspect by @lhoestq in #6253
Add support for fsspec>=2023.9.0 by @mariosasko in #6244
Create DefunctDatasetError by @albertvillanova in #6286
Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
Pin upper version of fsspec by @albertvillanova in #6337
Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322

New Contributors

@ap-- made their first contribution in #6334
@ZachNagengast made their first contribution in #6322

Full Changelog: 2.14.5...2.14.6

Contributors

ap--, ZachNagengast, and 3 other contributors

Assets 2

24 Oct 08:15

albertvillanova

2.14.5

1a598a0

2.14.5

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
Minor fix in iter_files for hidden files by @mariosasko in #6092
Use yaml instead of get data patterns when possible by @lhoestq in #6154
Fix Parquet loading with columns by @mariosasko in #6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in #6164
PyArrow 13 CI fixes by @mariosasko in #6175
Don't alter input in Features.from_dict by @lhoestq in #6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
Preserve split order in DataFilesDict by @albertvillanova in #6198
Add missing revision argument by @qgallouedec in #6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Fix empty splitinfo json by @lhoestq in #6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
Fix checking patterns to infer packaged builder by @polinaeterna in #6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218

Other improvements

Deprecate Dataset.export by @mariosasko in #6081
Deprecate download_custom by @mariosasko in #6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
Remove unused allowed_extensions param by @albertvillanova in #6135
Export to_iterable_dataset to document by @npuichigo in #6145
[Docs] Add description of select_columns to guide by @unifyh in #6119
Ignore parallel warning in map_nested by @lhoestq in #6148
[docs] Complete to_iterable_dataset by @stevhliu in #6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
Fix import in image_load doc by @mariosasko in #6181
Use object detection images from huggingface/documentation-images by @mariosasko in #6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in #6180

New Contributors

@npuichigo made their first contribution in #6145
@unifyh made their first contribution in #6119

Full Changelog: 2.14.4...2.14.5

Contributors

albertvillanova, npuichigo, and 8 other contributors

Assets 2

06 Sep 08:29

albertvillanova

2.13.2

98b1bdd

2.13.2

Bug fixes

Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208

Full Changelog: 2.13.1...2.13.2

Contributors

albertvillanova

Assets 2

08 Aug 15:52

albertvillanova

2.14.4

53d55f3

2.14.4

Bug fixes

Fix authentication issues by @albertvillanova in #6127

Full Changelog: 2.14.3...2.14.4

Contributors

albertvillanova

Assets 2

03 Aug 10:31

albertvillanova

2.14.3

33f736e

2.14.3

Bug fixes

Fix error when loading from GCP bucket by @albertvillanova in #6105
Fix deprecation of use_auth_token in file_utils by @albertvillanova in #6107

Full Changelog: 2.14.2...2.14.3

Contributors

albertvillanova

Assets 2

31 Jul 06:39

albertvillanova

2.14.2

09492ba

2.14.2

Bug fixes

Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in #6094
Fix deprecation of errors in TextConfig by @albertvillanova in #6095

Full Changelog: 2.14.1...2.14.2

Contributors

albertvillanova

Assets 2

27 Jul 17:09

lhoestq

2.14.1

029956a

2.14.1

Bug fixes

fix tqdm lock by @lhoestq in #6067
fix tqdm lock deletion by @lhoestq in #6068
Fix fsspec storage_options from load_dataset by @lhoestq in #6072
No gzip encoding from github by @lhoestq in #6076

Other improvements

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in #5902
Fix Quickstart notebook link by @mariosasko in #6070
Remove README link to deprecated Colab notebook by @mariosasko in #6080
Misc doc improvements by @mariosasko in #6074

Full Changelog: 2.14.0...2.14.1

Contributors

alvarobartt, lhoestq, and mariosasko

Assets 2

24 Jul 15:54

lhoestq

2.14.0

88896a7

2.14.0

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

push_to_hub() additional dataset configurations

ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")

Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in #5974
Fix select_columns columns order by @lhoestq in #5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in #5949
Pin joblib to avoid joblibspark test failures by @mariosasko in #6000
Align column_names type check with type hint in sort by @mariosasko in #6001
Deprecate use_auth_token in favor of token by @mariosasko in #5996
Drop Python 3.7 support by @mariosasko in #6005
Misc improvements by @mariosasko in #6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
Fix cast for dictionaries with no keys by @mariosasko in #6009
Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
Deprecate task api by @mariosasko in #5865
Add metadata ui screenshot in docs by @lhoestq in #6015
Fix ClassLabel min max check for None values by @mariosasko in #6023
[docs] Update return statement of index search by @stevhliu in #6021
Improve logging by @mariosasko in #6019
Fix style with ruff 0.0.278 by @lhoestq in #6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in #6027
[docs] Fix link by @stevhliu in #6029
fixed typo in comment by @NightMachinery in #6030
Fix legacy_dataset_infos by @lhoestq in #6040
Flatten repository_structure docs on yaml by @lhoestq in #6041
Use new hffs by @lhoestq in #6028
Bump dev version by @lhoestq in #6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in #6052
Dill 3.7 support by @mariosasko in #6061
Improve Dataset.from_list docstring by @mariosasko in #6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in #6045
Release: 2.14.0 by @lhoestq in #6063

New Contributors

@mathewjacob1002 made their first contribution in #5986
@pappacena made their first contribution in #5976

Full Changelog: 2.13.1...2.14.0

Contributors

pappacena, polinaeterna, and 6 other contributors

Assets 2

Releases: huggingface/datasets

2.15.0

What's Changed

New Contributors

Contributors

Uh oh!

2.14.7

Bug Fixes

New Contributors

Contributors

Uh oh!

2.14.6

What's Changed

New Contributors

Contributors

Uh oh!

2.14.5

Bug fixes

Other improvements

New Contributors

Contributors

Uh oh!

2.13.2

Bug fixes

Contributors

Uh oh!

2.14.4

Bug fixes

Contributors

Uh oh!

2.14.3

Bug fixes

Contributors

Uh oh!

2.14.2

Bug fixes

Contributors

Uh oh!

2.14.1

Bug fixes

Other improvements

Contributors

Uh oh!

2.14.0

Important: caching

Dataset Configuration

Dataset Features

What's Changed

New Contributors

Contributors

Uh oh!