Releases: huggingface/datasets
Releases · huggingface/datasets
2.7.0
Dataset Features
- Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
- Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to
maporfilterthat uses tensors or pipelines can now be cached
- Function passed to
- Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
- TextConfig: added "errors" by @NightMachinery in #5155
Audio setup
- Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167
Docs
- Update create image dataset docs by @stevhliu in #5177
- add: segmentation guide. by @sayakpaul in #5188
- Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
- Add SQL guide by @stevhliu in #5223
General improvements and bug fixes
- Add
pyproject.tomlforblackby @mariosasko in #5125 - Fix
tqdmzip bug by @david1542 in #5120 - Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
- [TYPO] Update new_dataset_script.py by @cakiki in #5119
- Avoid extra cast in
class_encode_columnby @mariosasko in #5130 - Use yaml for issue templates + revamp by @mariosasko in #5116
- Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
- Delete duplicate issue template file by @albertvillanova in #5146
- Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
- Raise ImportError instead of OSError by @ayushthe1 in #5141
- Fix CI require beam by @albertvillanova in #5168
- Make iter_files deterministic by @albertvillanova in #5149
- Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
- Reduce default max
writer_batch_sizeby @mariosasko in #5163 - Support dill 0.3.6 by @albertvillanova in #5166
- Make filename matching more robust by @riccardobucco in #5128
- Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
- Raise ffmpeg warnings only once by @polinaeterna in #5173
- Add "ipykernel" to list of
co_filenames to remove by @gpucce in #5169 - chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
- Fix docs about dataset_info in YAML by @albertvillanova in #5194
- fsspec lock reset in multiprocessing by @lhoestq in #5159
- Add note about the name of a dataset script by @polinaeterna in #5198
- Deprecate dummy data generation command by @mariosasko in #5199
- Do not sort splits in dataset info by @polinaeterna in #5201
- Add missing
DownloadConfig.use_auth_tokenvalue by @alvarobartt in #5205 - Update canonical links to Hub links by @stevhliu in #5203
- Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
- Update github pr docs actions by @mishig25 in #5214
- Use hfh hf_hub_url function by @albertvillanova in #5196
- Pin
typerversion in tests to <0.5 to fix Windows CI by @polinaeterna in #5235 - Fix shards in IterableDataset.from_generator by @lhoestq in #5233
- Fix class name of symbolic link by @riccardobucco in #5126
- Make
Versionhashable by @mariosasko in #5238 - Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
- Encode path only for old versions of hfh by @lhoestq in #5237
- Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
- Support hfh rc version by @lhoestq in #5241
- Cleaner error tracebacks for dataset script errors by @mariosasko in #5240
New Contributors
- @david1542 made their first contribution in #5120
- @ayushthe1 made their first contribution in #5142
- @gpucce made their first contribution in #5169
- @sayakpaul made their first contribution in #5187
- @NightMachinery made their first contribution in #5155
Full Changelog: 2.6.1...2.7.0
2.6.1
Bug fixes
- Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where
filtercould return examples with the wrong indices
- fixed a bug where
- Fix iter_batches by @lhoestq in #5115
- fixed a bug where
mapwithbatch=Truecould return a dataset with less examples
- fixed a bug where
- Fix a typo in arrow_dataset.py by @yangky11 in #5108
New Contributors
Full Changelog: 2.6.0...2.6.1
2.6.0
Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
Datasets features
- Add ability to read-write to SQL databases. by @Dref360 in #4928
- Read from sqlite file:
from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
- Allow connection objects in
from_sql+ small doc improvement by @mariosasko in #5091
from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
- return numpy/torch/tf/jax tensors with
from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- Added
IterableDataset.from_generatorby @hamid-vakilzadeh in #5052 - Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargstoDataset.from_generatorby @mariosasko in #5049 - Support
convertersinCsvBuilderby @mariosasko in #5057 - Restore saved format state in
load_from_diskby @asofiaoliveira in #5073
Dataset changes
- Update: hendrycks_test - support streaming by @albertvillanova in #5041
- Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
- Fix: sbu_captions - fix URLs by @donglixp in #5020
- Fix: xcsr - fix string features by @albertvillanova in #5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
- Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
- Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715
Dataset cards
- Add description to hellaswag dataset by @julien-c in #4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
- Update languages in aeslc dataset card by @apergo-ai in #3357
- Update license to bookcorpus dataset card by @meg-huggingface in #3526
- Update paper link in medmcqa dataset card by @monk1337 in #4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054
General improvements and bug fixes
- Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
- Add some note about running the transformers ci before a release by @lhoestq in #5007
- Remove license tag file and validation by @albertvillanova in #5004
- Re-apply input columns change by @mariosasko in #5008
- patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
- Fix typo in error message by @severo in #5027
- Fix import in
ClassLabeldocstring example by @alvarobartt in #5029 - Remove redundant code from some dataset module factories by @albertvillanova in #5033
- Fix typos in load docstrings and comments by @albertvillanova in #5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
- Fix tar extraction vuln by @lhoestq in #5016
- Support hfh 0.10 implicit auth by @lhoestq in #5031
- Fix
flatten_indiceswith empty indices mapping by @mariosasko in #5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
- Revert task removal in folder-based builders by @mariosasko in #5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
- Fix typo by @stevhliu in #5059
- Fix CI hfh token warning by @albertvillanova in #5062
- Mark CI tests as xfail when 502 error by @albertvillanova in #5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
- Fix header level in Audio docs by @stevhliu in #5078
- Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
- Support streaming gzip.open by @albertvillanova in #5066
- adding keep in memory by @Mustapha-AJEGHRIR in #5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
- Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
- Fix filter with empty indices by @Mouhanedg56 in #5087
- Fix tutorial (#5093) by @riccardobucco in #5095
- Use HTML relative paths for tiles in the docs by @lewtun in #5092
- Fix loading how to guide (#5102) by @riccardobucco in #5104
- url encode hub url (#5099) by @riccardobucco in #5103
- Free the "hf" filesystem protocol for
hffsby @lhoestq in #5101 - Fix task template reload from dict by @lhoestq in #5106
New Contributors
- @Wauplin made their first contribution in #5026
- @donglixp made their first contribution in #5020
- @Timothyxxx made their first contribution in #3715
- @hamid-vakilzadeh made their first contribution in #5052
- @Mustapha-AJEGHRIR made their first contribution in #5082
- @galbwe made their first contribution in #5079
- @rahulXs made their first contribution in #5076
- @Mouhanedg56 made their first contribution in #5087
- @riccardobucco made their first contribution in #5095
- @asofiaoliveira made their first contribution in #5073
Full Changelog: 2.5.1...2.6.0
2.5.2
2.5.1
2.5.0
Important
- Drop Python 3.6 support by @mariosasko in #4460
- Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
!pip install evaluate import evaluate metric = evaluate.load("accuracy")
- Metrics are now deprecated and have been moved to evaluate:
- Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
- Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
- Use HTTP requests to access data and metadata through the Datasets REST API (docs here)
Datasets features
No-code loaders
- Add AudioFolder packaged loader by @polinaeterna in #4530
- Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
- Add support for parsing JSON files in array form by @mariosasko in #4997
Dataset methods
- add
Dataset.from_listby @sanderland in #4890 - Add
Dataset.from_generatorby @mariosasko in #4957 - Add oversampling strategies to interleave datasets by @ylacombe in #4831
- Preserve non-
input_columsinDataset.mapifinput_columnsare specified by @mariosasko in #4971 - Add
fn_kwargsparam toIterableDataset.mapby @mariosasko in #4975 - More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763
Parquet support
- Download and prepare as Parquet for cloud storage by @lhoestq in #4724
- Shard parquet in download_and_prepare by @lhoestq in #4747
- Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987
Datasets changes
- Update: natural questions - Add long answer candidates by @seirasto in #4368
- Update: opus_paracrawl - update version by @albertvillanova in #4816
- Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
- Update: swda - Support streaming by @albertvillanova in #4914
- Update: Enwik8 - update broken link and information by @mtanghu in #4
- Update: compguesswhat - Support streaming by @albertvillanova in #4968
- Update: nli_tr - Support streaming by @albertvillanova in #4970
- Update: IndicGLUE - update download links by @sumanthd17 in #4978
- Update: iwslt2017 - Support streaming by @albertvillanova in #4992
- Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
- Fix: mkqa - Update data URL by @albertvillanova in #4823
- Fix: exams - fix bug and checksums by @albertvillanova in #4853
- Fix: trec - use fine classes by @albertvillanova in #4801
- Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
- Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
- Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
- Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
- Fix: MBPP - Add splits by @cwarny in #4943
Dataset cards
- Add
language_bcp47tag by @lhoestq in #4753 - Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
- Remove "unkown" language tags by @lhoestq in #4754
- Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
- Added dataset information in clinic oos dataset card by @arnav-ladkat in #4751
- Fix opus_gnome dataset card by @gojiteji in #4806
- Complete the mlqa dataset card by @eldhoittangeorge in #4809
- Fix loading example in opus dataset cards by @albertvillanova in #4813
- Add missing language tags to resources by @albertvillanova in #4819
- Fix titles in dataset cards by @albertvillanova in #4824
- Fix language tags in dataset cards by @albertvillanova in #4826
- Add license metadata to pg19 by @julien-c in #4827
- Fix task tags in dataset cards by @albertvillanova in #4830
- Fix tags in dataset cards by @albertvillanova in #4832
- Fix missing tags in dataset cards by @albertvillanova in #4833
- Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
- Fix documentation card of ethos dataset by @albertvillanova in #4835
- Update documentation card of miam dataset by @PierreColombo in #4846
- Update stackexchange license by @cakiki in #4842
- Update ted_talks_iwslt license to include ND by @cakiki in #4841
- Fix documentation card of adv_glue dataset by @albertvillanova in #4838
- Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
- Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
- Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
- Fix Citation Information section in dataset cards by @albertvillanova in #4879
- Fix documentation card of math_qa dataset by @albertvillanova in #4884
- Added names of less-studied languages by @BenjaminGalliot in #4880
- Fix language tags resource file by @albertvillanova in #4882
- Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
- Add citation information to makhzan dataset by @albertvillanova in #4894
- Fix missing tags in dataset cards by @albertvillanova in #4891
- Fix missing tags in dataset cards by @albertvillanova in #4896
- Re-add code and und language tags by @albertvillanova in #4899
- Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
- Update GLUE evaluation metadata by @lewtun in #4909
- Fix missing tags in dataset cards by @albertvillanova in #4908
- Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
- Fix missing tags in dataset cards by @albertvillanova in #4921
- Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
- Fix missing tags in dataset cards by @albertvillanova in #4931
- Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
- Fix license information in qasc dataset card by @albertvillanova in #4951
- Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
- Fix missing tags in dataset cards by @albertvillanova in #4979
- Fix missing tags in dataset cards by @albertvillanova in #4991
Documentation
2.4.0
Dataset Features
- Add
concatenate_datasetsfor iterable datasets by @lhoestq in #4500 - Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
- Support using PCM audio files (#4323) by @YooSungHyun in #4409
- [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
- Support extract 7-zip compressed data files by @albertvillanova in #4672
- Support extract lz4 compressed data files by @albertvillanova in #4700
- Support
metadata.jsonlfrom parent directories inimagefolder@mariosasko in #4576
Dataset changes
- Update: allocine - Support streaming by @albertvillanova in #4563
- Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
- Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
- Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
- Update: cfq - Support streaming by @albertvillanova in #4579
- Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
- Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
- Update: fever - Refactor and add metadata by @albertvillanova in #4503
- Update: mlsum - Support streaming dataset by @albertvillanova in #4574
- Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
- Fix: conll2003 - fix empty example by @lhoestq in #4662
- Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
- Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
- Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705
Dataset Cards
- Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
- Add evaluation data to acronym_identification by @lewtun in #4561
- Update WinoBias README by @sashavor in #4631
- Support "tags" yaml tag by @lhoestq in #4716
- Fix POS tags by @lhoestq in #4715
- AESLC dataset: Add summarization tags by @hobson in #4517
Documentation
- Update docs around audio and vision by @stevhliu in #4440
- Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
- Remove multiple config section by @stevhliu in #4600
- Create new sections for audio and vision in guides by @stevhliu in #4519
- Document installation of sox OS dependency for audio by @albertvillanova in #4713
General improvements and bug fixes
- Add regression test for
ArrowWriter.write_batchwhen batch is empty by @alvarobartt in #4510 - Support all negative values in ClassLabel by @lhoestq in #4511
- Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
- Patch tests for hfh v0.8.0 by @LysandreJik in #4518
- Replace deprecated logging.warn with logging.warning by @hugovk in #4539
- [CI] Fix upstream hub test url by @lhoestq in #4543
- Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
- [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
- Tell users to upload on the hub directly by @lhoestq in #4552
- Add
batch_sizeparameter when callingadd_faiss_indexandadd_faiss_index_from_external_arraysby @alvarobartt in #4535 - Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
- Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
- Fix hashing for python 3.9 by @lhoestq in #4516
- [CI] Fix some warnings by @lhoestq in #4547
- Validate new_fingerprint passed by user by @lhoestq in #4587
- Update CI Windows orb by @albertvillanova in #4604
- Perform hidden file check on relative data file path by @mariosasko in #4551
- Align more metadata with other repo types (models,spaces) by @julien-c in #4607
- Align/fix license metadata info by @julien-c in #4613
- Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
- Add authentication tip to
load_datasetby @mariosasko in #4577 - Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
- fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
- Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
- Rename master to main by @lhoestq in #4643
- Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
- [Minor fix] Typo correction by @cakiki in #4644
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
- Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
- Fix time type
_arrow_to_datasets_dtypeconversion by @mariosasko in #4628 - Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
- Replace
assertEqualwithassertTupleEqualin unit tests for verbosity by @alvarobartt in #4496 - Fix
embed_storageon features inside lists/sequences by @mariosasko in #4615 - Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
- Transfer CI to GitHub Actions by @albertvillanova in #4659
- Fix mock fsspec by @albertvillanova in #4685
- Trigger CI also on push to main by @albertvillanova in #4687
- Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
- Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
- Test extractors for all compression formats by @albertvillanova in #4689
- Refactor base extractors by @albertvillanova in #4690
- Update create dataset card docs by @stevhliu in #4683
- Add text decorators by @stevhliu in #4663
- Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
- Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
- Docs: Fix same-page haslinks by @mishig25 in #4722
- Fix broken link to the Hub by @stevhliu in #4726
- Refactor conftest fixtures by @albertvillanova in #4723
- Add object detection processing tutorial by @nateraw in #4710
- Fix require torchaudio and refactor test requirements by @albertvillanova in https:/...
2.3.2
Bug fixes
- Fix double dots in data files by @lhoestq in #4505
- fix a bug when
/../is passed todata_filescausing FileNotFoundError
- fix a bug when
- fix ETT m1/m2 test/val dataset by @kashif in #4499
- Corrected broken links in doc by @clefourrier in #4501
New Contributors
- @clefourrier made their first contribution in #4501
Full Changelog: 2.3.1...2.3.2
2.3.1
Bug fixes
- Fix patching module that doesn't exist by @lhoestq in #4495
- fix bug when importing the lib when scipy is not installed
- Re-add download_manager module in utils by @lhoestq in #4497
- fix moved imports of
DownloadConfig,DownloadMode,DownloadManager
- fix moved imports of
- Support streaming UDHR dataset by @albertvillanova in #4487
Full Changelog: 2.3.0...2.3.1
2.3.0
Datasets Changes
- New: ImageNet-Sketch by @nateraw in #4301
- New: Biwi Kinect Head Pose by @dnaveenr in #3903
- New: enwik8 by @HallerPatrick in #4321
- New: LCCC dataset by @silverriver in #4416
- New: TruthfulQA by @jon-tow in #4159
- New: BIG-bench by @andersjohanandreassen in #4125
- New: QuickDraw by @mariosasko in #3592
- New: SST-2 by @albertvillanova in #4473
- Update: imagenet-1k - remove manual download by @mariosasko in #4299
- ImageNet can now be loaded in python with
load_datasetwithout requiring a manual download ! - It also supports streaming mode with
load_dataset("imagenet-1k", streaming=True)
- ImageNet can now be loaded in python with
- Update: spider - Remove Google Drive URL by @albertvillanova in #4410
- Update: blended_skill_talk - add missing columns to by @mariosasko in #4437
- Update: multi-news - Use newer version with fixes by @JohnGiorgi in #4451
- Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
- Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
- Update: udhr - update metadata by @leondz in #4362
- Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in #4469
- Update: PASS - update dataset version by @mariosasko in #4488
- Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in #4389
- Fix: GEM - fix URL for totto config by @albertvillanova in #4396
- Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in #4424
- Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in #4425
- Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in #4436
- Fix: iwslt2017 by @lhoestq in #4481
Dataset Features
- to_tf_dataset rewrite by @Rocketknight1 in #4170
- see more in the documentation
- Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in #4375
- see more in the documentation
- Added stratify option to
train_test_splitby @nandwalritik in #4322 - Re-add support for Apache Beam functionality by @albertvillanova in #4328
- Resume
push_to_hub: skip identical files inpush_to_hubinstead of overwriting by @mariosasko in #4402 - Support nested/complex feature types as
featuresin packaged loaders by @mariosasko in #4364 - Optimize contiguous shard and select by @lhoestq in #4466
Dataset Cards
- Minor fixes/improvements in
scene_parse_150card by @mariosasko in #4447 - Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
- Fix example in opus_ubuntu, Add license info by @leondz in #4360
- Update README.md of fquad by @lhoestq in #4450
Documentation
- Add API code examples for loading methods by @stevhliu in #4300
- Add API code examples for remaining main classes by @stevhliu in #4292
- Generalize tutorials for audio and vision by @stevhliu in #4468
- [Docs] How to use with PyTorch page by @lhoestq in #4474
- First draft of the docs for TF + Datasets by @Rocketknight1 in #4457
Other improvements and bug fixes
- Update CI deprecated legacy image by @albertvillanova in #4393
- remove int documentation from logging docs by @lvwerra in #4392
- Fix docstring in DatasetDict::shuffle by @felixdivo in #4344
- Fix Version equality by @albertvillanova in #4359
- Set builder name from module instead of class by @albertvillanova in #4388
- Test dill by @albertvillanova in #4385
- Refactor download by @albertvillanova in #4384
- Fix dependency on dill version by @albertvillanova in #4397
- Support remote cache_dir by @albertvillanova in #4347
- Update imagenet gate by @lhoestq in #4408
- Fix dataset builder default version by @albertvillanova in #4356
- Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in #4403
- Rename DatasetBuilder config_name by @albertvillanova in #4414
- Fix metadata validation by @albertvillanova in #4390
- Add HF.co for PRs/Issues for specific datasets by @lhoestq in #4427
- Fix type hint and documentation for
new_fingerprintby @fxmarty in #4326 - Skip hidden files/directories in data files resolution and
iter_filesby @mariosasko in #4412 - Fix docstring of inspect_dataset by @albertvillanova in #4438
- Fix builder docstring by @albertvillanova in #4432
- Fix kwargs in docstrings by @albertvillanova in #4444
- Fix missing args in docstring of load_dataset_builder by @albertvillanova in #4445
- Add missing kwargs to docstrings by @albertvillanova in #4446
- Add extractor for bzip2-compressed files by @asivokon in #4421
- Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in #4434
- Update
dataset_infos.jsonwith new split info inDataset.push_to_hubto avoid verification error by @mariosasko in #4415 - Update builder docstring for deprecated/added arguments by @albertvillanova in #4429
- Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in #4464
- Fix script fetching and local path handling in
inspect_datasetandinspect_metricby @mariosasko in #4433 - Fix bigbench config names by @lhoestq in #4465
- Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in #4472
- Reorder returned validation/test splits in script template by @albertvillanova in #4470
- Better ImportError message when a dataset script dependency is missing by @lhoestq in #4484
- Fix cast to null by @lhoestq in #4485
- Update
_format_columnsinremove_columnsby @alvarobartt in #4411 - Fix wrong map parameter name in cache docs by @h4iku in #4293
- Pin the revision in imagenet download links by @lhoestq in #4492
- Refactor column mappings for question answering datasets by @lewtun in #4391
New Contributors
- @leondz made their first contribution in #4378
- @felixdivo made their first contribution in #4344
- @nandwalritik made their first contribution in #4322
- @fxmarty made their first contribution in #4326
- @HallerPatrick made their first contribution in #4321
- @silverriver made their first contribution in #4416
- @asivokon made their first contribution in #4421
- @andersjohanandreassen made their first contribution in #4125
Full Changelog: https://git...