Releases: huggingface/datasets
Releases · huggingface/datasets
2.15.0
What's Changed
- Fix typo in Audio dataset documentation by @prassanna-ravishankar in #6222
- Add push_to_hub with multiple configs docs by @lhoestq in #6226
- Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in #6228
- Update README.md by @NinoRisteski in #6233
- Don't skip hidden files in
dl_manager.iter_fileswhen they are given as input by @mariosasko in #6230 - Update README.md by @NinoRisteski in #6223
- Remove unused global variables in
audio.pyby @mariosasko in #6241 - Improve error message for missing function parameters by @suavemint in #6232
- Fix cast from fixed size list to variable size list by @mariosasko in #6243
- Update create_dataset.mdx by @EswarDivi in #6247
- [DOCS] Fix typo: Elasticsearch by @leemthompo in #6258
- Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in #6251
- Temporarily pin tensorflow < 2.14.0 by @albertvillanova in #6264
- Fix CI 404 errors by @albertvillanova in #6262
- Remove
apache_beamimport inBeamBasedBuilder._save_infoby @mariosasko in #6265 - Improve documentation of dataset.from_generator by @hartmans in #6281
- Fix parquet columns argument in streaming mode by @lhoestq in #6295
- Doc readme improvements by @mariosasko in #6298
- Unpin
tensorflowmaximum version by @mariosasko in #6301 - Unpin
jaxmaximum version by @mariosasko in #6300 - Fix ArrayXD cast by @mariosasko in #6297
- Reduce the number of commits in
push_to_hubby @mariosasko in #6269 - Fix typo in code example in docs by @bryant1410 in #6307
- Update README.md by @smty2018 in #6304
- Deterministic set hash by @lhoestq in #6318
- docs: resolving namespace conflict, refactored variable by @smty2018 in #6312
- Fix typos by @python273 in #6321
- Fix commit message formatting in multi-commit uploads by @qgallouedec in #6313
- Temporarily pin fsspec < 2023.10.0 by @albertvillanova in #6331
- Unpin fsspec by @lhoestq in #6336
- Fix use_dataset.mdx by @angel-luis in #6351
- Add
fsspecversion to thedatasets-cli envcommand output by @mariosasko in #6356 - Expanduser in save_to_disk() by @Unknown3141592 in #6098
- Fix time measuring snippet in docs by @mariosasko in #6367
- Temporarily pin pyarrow < 14.0.0 by @albertvillanova in #6375
- Fix typo in
Dataset.mapdocstring by @bryant1410 in #6373 - Avoid redundant warning when encoding NumPy array as
Imageby @mariosasko in #6379 - Replace deprecated license_file in setup.cfg by @albertvillanova in #6332
- Minor release step improvement by @lhoestq in #6339
- Fix dependency conflict within CI build documentation by @albertvillanova in #6411
- Remove redundant condition in builders by @albertvillanova in #6398
- Handle future deprecation argument by @winglian in #6390
- Remove token value from warnings by @mariosasko in #6418
- Rename audio_classificiation.py to audio_classification.py by @carlthome in #6416
- Add pyarrow-hotfix to release docs by @albertvillanova in #6421
- Simplify filesystem logic by @mariosasko in #6362
- Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in #6423
New Contributors
- @prassanna-ravishankar made their first contribution in #6222
- @NinoRisteski made their first contribution in #6233
- @suavemint made their first contribution in #6232
- @EswarDivi made their first contribution in #6247
- @leemthompo made their first contribution in #6258
- @hartmans made their first contribution in #6281
- @smty2018 made their first contribution in #6304
- @python273 made their first contribution in #6321
- @angel-luis made their first contribution in #6351
- @Unknown3141592 made their first contribution in #6098
- @winglian made their first contribution in #6390
- @carlthome made their first contribution in #6416
Full Changelog: 2.14.7...2.15.0
2.14.7
Bug Fixes
- Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
- Fix python formatting for complex types in format_table by @mariosasko in #6368
- Support pyarrow 14.0.0 by @albertvillanova in #6378
- Do not try to download from HF GCS for generator by @yundai424 in #6372
- Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404
New Contributors
- @cwallenwein made their first contribution in #6346
- @yundai424 made their first contribution in #6372
Full Changelog: 2.14.6...2.14.7
2.14.6
What's Changed
- Ignore dataset_info.json in data files resolution by @mariosasko in #6224
- Check builder cls default config name in inspect by @lhoestq in #6253
- Add support for fsspec>=2023.9.0 by @mariosasko in #6244
- Create DefunctDatasetError by @albertvillanova in #6286
- Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
- Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
- datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
- Pin upper version of fsspec by @albertvillanova in #6337
- Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322
New Contributors
- @ap-- made their first contribution in #6334
- @ZachNagengast made their first contribution in #6322
Full Changelog: 2.14.5...2.14.6
2.14.5
Bug fixes
- Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
- Minor fix in
iter_filesfor hidden files by @mariosasko in #6092 - Use yaml instead of get data patterns when possible by @lhoestq in #6154
- Fix Parquet loading with
columnsby @mariosasko in #6160 - Fix: Missing a MetadataConfigs init when the repo has a
datasets_info.jsonbut no README by @clefourrier in #6164 - PyArrow 13 CI fixes by @mariosasko in #6175
- Don't alter input in Features.from_dict by @lhoestq in #6189
- Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
- Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
- Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
- Preserve split order in DataFilesDict by @albertvillanova in #6198
- Add missing
revisionargument by @qgallouedec in #6191 - Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
- Fix empty splitinfo json by @lhoestq in #6211
- Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
- Fix checking patterns to infer packaged builder by @polinaeterna in #6215
- Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218
Other improvements
- Deprecate
Dataset.exportby @mariosasko in #6081 - Deprecate
download_customby @mariosasko in #6093 - Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
- Remove unused allowed_extensions param by @albertvillanova in #6135
- Export to_iterable_dataset to document by @npuichigo in #6145
- [Docs] Add description of
select_columnsto guide by @unifyh in #6119 - Ignore parallel warning in map_nested by @lhoestq in #6148
- [docs] Complete
to_iterable_datasetby @stevhliu in #6158 - Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
- Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
- Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
- Fix import in
image_loaddoc by @mariosasko in #6181 - Use object detection images from
huggingface/documentation-imagesby @mariosasko in #6177 - Use
hf-internal-testingrepos for hosting test dataset repos by @mariosasko in #6180
New Contributors
- @npuichigo made their first contribution in #6145
- @unifyh made their first contribution in #6119
Full Changelog: 2.14.4...2.14.5
2.13.2
Bug fixes
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Full Changelog: 2.13.1...2.13.2
2.14.4
2.14.3
Bug fixes
- Fix error when loading from GCP bucket by @albertvillanova in #6105
- Fix deprecation of use_auth_token in file_utils by @albertvillanova in #6107
Full Changelog: 2.14.2...2.14.3
2.14.2
Bug fixes
- Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in #6094
- Fix deprecation of errors in TextConfig by @albertvillanova in #6095
Full Changelog: 2.14.1...2.14.2
2.14.1
Bug fixes
- fix tqdm lock by @lhoestq in #6067
- fix tqdm lock deletion by @lhoestq in #6068
- Fix fsspec storage_options from load_dataset by @lhoestq in #6072
- No gzip encoding from github by @lhoestq in #6076
Other improvements
- Fix
Overview.ipynb& detach Jupyter Notebooks fromdatasetsrepository by @alvarobartt in #5902 - Fix Quickstart notebook link by @mariosasko in #6070
- Remove README link to deprecated Colab notebook by @mariosasko in #6080
- Misc doc improvements by @mariosasko in #6074
Full Changelog: 2.14.0...2.14.1
2.14.0
Important: caching
- Datasets downloaded and cached using
datasets>=2.14.0may not be reloaded from cache using older version ofdatasets(and therefore re-downloaded). - Datasets that were already cached are still supported.
- This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
- This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.
Dataset Configuration
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
- Configure your dataset using YAML at the top of your dataset card (docs here)
- Choose which file goes into which split
--- configs: - config_name: default data_files: - split: train path: data.csv - split: test path: holdout.csv ---
- Define multiple dataset configurations
--- configs: - config_name: main_data data_files: main_data.csv - config_name: additional_data data_files: additional_data.csv ---
Dataset Features
-
Support for multiple configs via metadata yaml info by @polinaeterna in #5331
push_to_hub()additional dataset configurations
ds.push_to_hub("username/dataset_name", config_name="additional_data") # reload later ds = load_dataset("username/dataset_name", "additional_data")
-
Support returning dataframe in map transform by @mariosasko in #5995
What's Changed
- Deprecate
errorsparam in favor ofencoding_errorsin text builder by @mariosasko in #5974 - Fix select_columns columns order by @lhoestq in #5994
- Replace metadata utils with
huggingface_hub's RepoCard API by @mariosasko in #5949 - Pin
joblibto avoidjoblibsparktest failures by @mariosasko in #6000 - Align
column_namestype check with type hint insortby @mariosasko in #6001 - Deprecate
use_auth_tokenin favor oftokenby @mariosasko in #5996 - Drop Python 3.7 support by @mariosasko in #6005
- Misc improvements by @mariosasko in #6004
- Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
- Fix cast for dictionaries with no keys by @mariosasko in #6009
- Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
- Deprecate task api by @mariosasko in #5865
- Add metadata ui screenshot in docs by @lhoestq in #6015
- Fix
ClassLabelmin max check forNonevalues by @mariosasko in #6023 - [docs] Update return statement of index search by @stevhliu in #6021
- Improve logging by @mariosasko in #6019
- Fix style with ruff 0.0.278 by @lhoestq in #6026
- Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
- Delete
task_templatesinIterableDatasetwhen they are no longer valid by @mariosasko in #6027 - [docs] Fix link by @stevhliu in #6029
- fixed typo in comment by @NightMachinery in #6030
- Fix legacy_dataset_infos by @lhoestq in #6040
- Flatten repository_structure docs on yaml by @lhoestq in #6041
- Use new hffs by @lhoestq in #6028
- Bump dev version by @lhoestq in #6047
- Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
- Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
- Remove
HfFileSystemand deprecateS3FileSystemby @mariosasko in #6052 - Dill 3.7 support by @mariosasko in #6061
- Improve
Dataset.from_listdocstring by @mariosasko in #6062 - Check if column names match in Parquet loader only when config
featuresare specified by @mariosasko in #6045 - Release: 2.14.0 by @lhoestq in #6063
New Contributors
- @mathewjacob1002 made their first contribution in #5986
- @pappacena made their first contribution in #5976
Full Changelog: 2.13.1...2.14.0