Releases · huggingface/datasets

16 Nov 10:11

albertvillanova

2.7.0

edf1902

2.7.0

Dataset Features

Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
```
from datasets import load_dataset
ds = load_dataset("imagenet-1k", num_proc=4)
```
Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to map or filter that uses tensors or pipelines can now be cached
Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167

Docs

Update create image dataset docs by @stevhliu in #5177
add: segmentation guide. by @sayakpaul in #5188
Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
Add SQL guide by @stevhliu in #5223

General improvements and bug fixes

Add pyproject.toml for black by @mariosasko in #5125
Fix tqdm zip bug by @david1542 in #5120
Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
[TYPO] Update new_dataset_script.py by @cakiki in #5119
Avoid extra cast in class_encode_column by @mariosasko in #5130
Use yaml for issue templates + revamp by @mariosasko in #5116
Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
Delete duplicate issue template file by @albertvillanova in #5146
Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
Raise ImportError instead of OSError by @ayushthe1 in #5141
Fix CI require beam by @albertvillanova in #5168
Make iter_files deterministic by @albertvillanova in #5149
Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
Reduce default max writer_batch_size by @mariosasko in #5163
Support dill 0.3.6 by @albertvillanova in #5166
Make filename matching more robust by @riccardobucco in #5128
Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
Raise ffmpeg warnings only once by @polinaeterna in #5173
Add "ipykernel" to list of co_filenames to remove by @gpucce in #5169
chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
Fix docs about dataset_info in YAML by @albertvillanova in #5194
fsspec lock reset in multiprocessing by @lhoestq in #5159
Add note about the name of a dataset script by @polinaeterna in #5198
Deprecate dummy data generation command by @mariosasko in #5199
Do not sort splits in dataset info by @polinaeterna in #5201
Add missing DownloadConfig.use_auth_token value by @alvarobartt in #5205
Update canonical links to Hub links by @stevhliu in #5203
Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
Update github pr docs actions by @mishig25 in #5214
Use hfh hf_hub_url function by @albertvillanova in #5196
Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235
Fix shards in IterableDataset.from_generator by @lhoestq in #5233
Fix class name of symbolic link by @riccardobucco in #5126
Make Version hashable by @mariosasko in #5238
Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
Encode path only for old versions of hfh by @lhoestq in #5237
Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
Support hfh rc version by @lhoestq in #5241
Cleaner error tracebacks for dataset script errors by @mariosasko in #5240

New Contributors

@david1542 made their first contribution in #5120
@ayushthe1 made their first contribution in #5142
@gpucce made their first contribution in #5169
@sayakpaul made their first contribution in #5187
@NightMachinery made their first contribution in #5155

Full Changelog: 2.6.1...2.7.0

Contributors

cakiki, albertvillanova, and 13 other contributors

Assets 2

14 Oct 15:45

lhoestq

2.6.1

1742cf1

2.6.1

Bug fixes

Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where filter could return examples with the wrong indices
Fix iter_batches by @lhoestq in #5115
- fixed a bug where map with batch=True could return a dataset with less examples
Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

@yangky11 made their first contribution in #5108

Full Changelog: 2.6.0...2.6.1

Contributors

yangky11, albertvillanova, and lhoestq

Assets 2

13 Oct 11:00

lhoestq

2.6.0

dc3f72e

2.6.0

Important

[GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

Add ability to read-write to SQL databases. by @Dref360 in #4928

Read from sqlite file:

from datasets import Dataset
dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")

Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091

from datasets import Dataset
from sqlite3 import connect
con = connect(...)
dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)

Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072

return numpy/torch/tf/jax tensors with

from datasets import load_dataset
ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
ds[0]["image"]

Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
Add kwargs to Dataset.from_generator by @mariosasko in #5049
Support converters in CsvBuilder by @mariosasko in #5057
Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Update: hendrycks_test - support streaming by @albertvillanova in #5041
Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
Fix: sbu_captions - fix URLs by @donglixp in #5020
Fix: xcsr - fix string features by @albertvillanova in #5024
Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715

Dataset cards

Add description to hellaswag dataset by @julien-c in #4810
Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
Update languages in aeslc dataset card by @apergo-ai in #3357
Update license to bookcorpus dataset card by @meg-huggingface in #3526
Update paper link in medmcqa dataset card by @monk1337 in #4290
Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054

General improvements and bug fixes

Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
Add some note about running the transformers ci before a release by @lhoestq in #5007
Remove license tag file and validation by @albertvillanova in #5004
Re-apply input columns change by @mariosasko in #5008
patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
Fix typo in error message by @severo in #5027
Fix import in ClassLabel docstring example by @alvarobartt in #5029
Remove redundant code from some dataset module factories by @albertvillanova in #5033
Fix typos in load docstrings and comments by @albertvillanova in #5035
Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
Fix tar extraction vuln by @lhoestq in #5016
Support hfh 0.10 implicit auth by @lhoestq in #5031
Fix flatten_indices with empty indices mapping by @mariosasko in #5043
Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
Revert task removal in folder-based builders by @mariosasko in #5051
Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
Fix typo by @stevhliu in #5059
Fix CI hfh token warning by @albertvillanova in #5062
Mark CI tests as xfail when 502 error by @albertvillanova in #5058
Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
Fix header level in Audio docs by @stevhliu in #5078
Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
Support streaming gzip.open by @albertvillanova in #5066
adding keep in memory by @Mustapha-AJEGHRIR in #5082
refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
Fix filter with empty indices by @Mouhanedg56 in #5087
Fix tutorial (#5093) by @riccardobucco in #5095
Use HTML relative paths for tiles in the docs by @lewtun in #5092
Fix loading how to guide (#5102) by @riccardobucco in #5104
url encode hub url (#5099) by @riccardobucco in #5103
Free the "hf" filesystem protocol for hffs by @lhoestq in #5101
Fix task template reload from dict by @lhoestq in #5106

New Contributors

@Wauplin made their first contribution in #5026
@donglixp made their first contribution in #5020
@Timothyxxx made their first contribution in #3715
@hamid-vakilzadeh made their first contribution in #5052
@Mustapha-AJEGHRIR made their first contribution in #5082
@galbwe made their first contribution in #5079
@rahulXs made their first contribution in #5076
@Mouhanedg56 made their first contribution in #5087
@riccardobucco made their first contribution in #5095
@asofiaoliveira made their first contribution in #5073

Full Changelog: 2.5.1...2.6.0

Contributors

julien-c, donglixp, and 24 other contributors

Assets 2

05 Oct 10:17

lhoestq

2.5.2

c59cc34

2.5.2

Bug fixes

Revert task removal in folder-based builders (#5051)
Support hfh 0.10 implicit auth (#5031)

Full Changelog: 2.5.1...2.5.2

Assets 2

21 Sep 15:17

lhoestq

2.5.1

0c84b71

2.5.1

Bug fixes

Revert input_columns change by @lhoestq in #5006

Full Changelog: 2.5.0...2.5.1

Contributors

lhoestq

Assets 2

21 Sep 13:14

lhoestq

2.5.0

6fc30c1

2.5.0

Important

Drop Python 3.6 support by @mariosasko in #4460
Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
```
!pip install evaluate
import evaluate
metric = evaluate.load("accuracy")
```
Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Add AudioFolder packaged loader by @polinaeterna in #4530
Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
Add support for parsing JSON files in array form by @mariosasko in #4997

Dataset methods

add Dataset.from_list by @sanderland in #4890
Add Dataset.from_generator by @mariosasko in #4957
Add oversampling strategies to interleave datasets by @ylacombe in #4831
Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in #4971
Add fn_kwargs param to IterableDataset.map by @mariosasko in #4975
More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763

Parquet support

Download and prepare as Parquet for cloud storage by @lhoestq in #4724
Shard parquet in download_and_prepare by @lhoestq in #4747
Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987

Datasets changes

Update: natural questions - Add long answer candidates by @seirasto in #4368
Update: opus_paracrawl - update version by @albertvillanova in #4816
Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
Update: swda - Support streaming by @albertvillanova in #4914
Update: Enwik8 - update broken link and information by @mtanghu in #4
Update: compguesswhat - Support streaming by @albertvillanova in #4968
Update: nli_tr - Support streaming by @albertvillanova in #4970
Update: IndicGLUE - update download links by @sumanthd17 in #4978
Update: iwslt2017 - Support streaming by @albertvillanova in #4992
Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
Fix: mkqa - Update data URL by @albertvillanova in #4823
Fix: exams - fix bug and checksums by @albertvillanova in #4853
Fix: trec - use fine classes by @albertvillanova in #4801
Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
Fix: MBPP - Add splits by @cwarny in #4943

Dataset cards

Add language_bcp47 tag by @lhoestq in #4753
Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
Remove "unkown" language tags by @lhoestq in #4754
Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
Added dataset information in clinic oos dataset card by @arnav-ladkat in #4751
Fix opus_gnome dataset card by @gojiteji in #4806
Complete the mlqa dataset card by @eldhoittangeorge in #4809
Fix loading example in opus dataset cards by @albertvillanova in #4813
Add missing language tags to resources by @albertvillanova in #4819
Fix titles in dataset cards by @albertvillanova in #4824
Fix language tags in dataset cards by @albertvillanova in #4826
Add license metadata to pg19 by @julien-c in #4827
Fix task tags in dataset cards by @albertvillanova in #4830
Fix tags in dataset cards by @albertvillanova in #4832
Fix missing tags in dataset cards by @albertvillanova in #4833
Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
Fix documentation card of ethos dataset by @albertvillanova in #4835
Update documentation card of miam dataset by @PierreColombo in #4846
Update stackexchange license by @cakiki in #4842
Update ted_talks_iwslt license to include ND by @cakiki in #4841
Fix documentation card of adv_glue dataset by @albertvillanova in #4838
Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
Fix Citation Information section in dataset cards by @albertvillanova in #4879
Fix documentation card of math_qa dataset by @albertvillanova in #4884
Added names of less-studied languages by @BenjaminGalliot in #4880
Fix language tags resource file by @albertvillanova in #4882
Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
Add citation information to makhzan dataset by @albertvillanova in #4894
Fix missing tags in dataset cards by @albertvillanova in #4891
Fix missing tags in dataset cards by @albertvillanova in #4896
Re-add code and und language tags by @albertvillanova in #4899
Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
Update GLUE evaluation metadata by @lewtun in #4909
Fix missing tags in dataset cards by @albertvillanova in #4908
Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
Fix missing tags in dataset cards by @albertvillanova in #4921
Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
Fix missing tags in dataset cards by @albertvillanova in #4931
Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
Fix license information in qasc dataset card by @albertvillanova in #4951
Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
Fix missing tags in dataset cards by @albertvillanova in #4979
Fix missing tags in dataset cards by @albertvillanova in #4991

Documentation

Update map docs by @stevhliu in #4743
Ad...

Contributors

kashif, apohllo, and 38 other contributors

Assets 2

25 Jul 13:41

lhoestq

2.4.0

401d4c4

2.4.0

Dataset Features

Add concatenate_datasets for iterable datasets by @lhoestq in #4500
Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
Support using PCM audio files (#4323) by @YooSungHyun in #4409
[data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
Support extract 7-zip compressed data files by @albertvillanova in #4672
Support extract lz4 compressed data files by @albertvillanova in #4700
Support metadata.jsonl from parent directories in imagefolder @mariosasko in #4576

Dataset changes

Update: allocine - Support streaming by @albertvillanova in #4563
Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
Update: cfq - Support streaming by @albertvillanova in #4579
Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
Update: fever - Refactor and add metadata by @albertvillanova in #4503
Update: mlsum - Support streaming dataset by @albertvillanova in #4574
Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
Fix: conll2003 - fix empty example by @lhoestq in #4662
Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705

Dataset Cards

Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
Add evaluation data to acronym_identification by @lewtun in #4561
Update WinoBias README by @sashavor in #4631
Support "tags" yaml tag by @lhoestq in #4716
Fix POS tags by @lhoestq in #4715
AESLC dataset: Add summarization tags by @hobson in #4517

Documentation

Update docs around audio and vision by @stevhliu in #4440
Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
Remove multiple config section by @stevhliu in #4600
Create new sections for audio and vision in guides by @stevhliu in #4519
Document installation of sox OS dependency for audio by @albertvillanova in #4713

General improvements and bug fixes

Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in #4510
Support all negative values in ClassLabel by @lhoestq in #4511
Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
Patch tests for hfh v0.8.0 by @LysandreJik in #4518
Replace deprecated logging.warn with logging.warning by @hugovk in #4539
[CI] Fix upstream hub test url by @lhoestq in #4543
Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
[CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
Tell users to upload on the hub directly by @lhoestq in #4552
Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in #4535
Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
Fix hashing for python 3.9 by @lhoestq in #4516
[CI] Fix some warnings by @lhoestq in #4547
Validate new_fingerprint passed by user by @lhoestq in #4587
Update CI Windows orb by @albertvillanova in #4604
Perform hidden file check on relative data file path by @mariosasko in #4551
Align more metadata with other repo types (models,spaces) by @julien-c in #4607
Align/fix license metadata info by @julien-c in #4613
Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
Add authentication tip to load_dataset by @mariosasko in #4577
Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
Rename master to main by @lhoestq in #4643
Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
[Minor fix] Typo correction by @cakiki in #4644
fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in #4628
Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in #4496
Fix embed_storage on features inside lists/sequences by @mariosasko in #4615
Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
Transfer CI to GitHub Actions by @albertvillanova in #4659
Fix mock fsspec by @albertvillanova in #4685
Trigger CI also on push to main by @albertvillanova in #4687
Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
Test extractors for all compression formats by @albertvillanova in #4689
Refactor base extractors by @albertvillanova in #4690
Update create dataset card docs by @stevhliu in #4683
Add text decorators by @stevhliu in #4663
Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
Docs: Fix same-page haslinks by @mishig25 in #4722
Fix broken link to the Hub by @stevhliu in #4726
Refactor conftest fixtures by @albertvillanova in #4723
Add object detection processing tutorial by @nateraw in #4710
Fix require torchaudio and refactor test requirements by @albertvillanova in https:/...

Contributors

hobson, julien-c, and 21 other contributors

Assets 2

15 Jun 18:08

lhoestq

2.3.2

9f9f0b5

2.3.2

Bug fixes

Fix double dots in data files by @lhoestq in #4505
- fix a bug when /../ is passed to data_files causing FileNotFoundError
fix ETT m1/m2 test/val dataset by @kashif in #4499
Corrected broken links in doc by @clefourrier in #4501

New Contributors

@clefourrier made their first contribution in #4501

Full Changelog: 2.3.1...2.3.2

Contributors

kashif, clefourrier, and lhoestq

Assets 2

15 Jun 11:08

lhoestq

2.3.1

23f37b2

2.3.1

Bug fixes

Fix patching module that doesn't exist by @lhoestq in #4495
- fix bug when importing the lib when scipy is not installed
Re-add download_manager module in utils by @lhoestq in #4497
- fix moved imports of DownloadConfig, DownloadMode, DownloadManager
Support streaming UDHR dataset by @albertvillanova in #4487

Full Changelog: 2.3.0...2.3.1

Contributors

albertvillanova and lhoestq

Assets 2

14 Jun 18:12

lhoestq

2.3.0

c82d4c4

2.3.0

Datasets Changes

New: ImageNet-Sketch by @nateraw in #4301
New: Biwi Kinect Head Pose by @dnaveenr in #3903
New: enwik8 by @HallerPatrick in #4321
New: LCCC dataset by @silverriver in #4416
New: TruthfulQA by @jon-tow in #4159
New: BIG-bench by @andersjohanandreassen in #4125
New: QuickDraw by @mariosasko in #3592
New: SST-2 by @albertvillanova in #4473
Update: imagenet-1k - remove manual download by @mariosasko in #4299
- ImageNet can now be loaded in python with load_dataset without requiring a manual download !
- It also supports streaming mode with load_dataset("imagenet-1k", streaming=True)
Update: spider - Remove Google Drive URL by @albertvillanova in #4410
Update: blended_skill_talk - add missing columns to by @mariosasko in #4437
Update: multi-news - Use newer version with fixes by @JohnGiorgi in #4451
Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
Update: udhr - update metadata by @leondz in #4362
Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in #4469
Update: PASS - update dataset version by @mariosasko in #4488
Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in #4389
Fix: GEM - fix URL for totto config by @albertvillanova in #4396
Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in #4424
Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in #4425
Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in #4436
Fix: iwslt2017 by @lhoestq in #4481

Dataset Features

to_tf_dataset rewrite by @Rocketknight1 in #4170
- see more in the documentation
Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in #4375
- see more in the documentation
Added stratify option to train_test_split by @nandwalritik in #4322
Re-add support for Apache Beam functionality by @albertvillanova in #4328
Resume push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in #4402
Support nested/complex feature types as features in packaged loaders by @mariosasko in #4364
Optimize contiguous shard and select by @lhoestq in #4466

Dataset Cards

Minor fixes/improvements in scene_parse_150 card by @mariosasko in #4447
Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
Fix example in opus_ubuntu, Add license info by @leondz in #4360
Update README.md of fquad by @lhoestq in #4450

Documentation

Add API code examples for loading methods by @stevhliu in #4300
Add API code examples for remaining main classes by @stevhliu in #4292
Generalize tutorials for audio and vision by @stevhliu in #4468
[Docs] How to use with PyTorch page by @lhoestq in #4474
First draft of the docs for TF + Datasets by @Rocketknight1 in #4457

Other improvements and bug fixes

Update CI deprecated legacy image by @albertvillanova in #4393
remove int documentation from logging docs by @lvwerra in #4392
Fix docstring in DatasetDict::shuffle by @felixdivo in #4344
Fix Version equality by @albertvillanova in #4359
Set builder name from module instead of class by @albertvillanova in #4388
Test dill by @albertvillanova in #4385
Refactor download by @albertvillanova in #4384
Fix dependency on dill version by @albertvillanova in #4397
Support remote cache_dir by @albertvillanova in #4347
Update imagenet gate by @lhoestq in #4408
Fix dataset builder default version by @albertvillanova in #4356
Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in #4403
Rename DatasetBuilder config_name by @albertvillanova in #4414
Fix metadata validation by @albertvillanova in #4390
Add HF.co for PRs/Issues for specific datasets by @lhoestq in #4427
Fix type hint and documentation for new_fingerprint by @fxmarty in #4326
Skip hidden files/directories in data files resolution and iter_files by @mariosasko in #4412
Fix docstring of inspect_dataset by @albertvillanova in #4438
Fix builder docstring by @albertvillanova in #4432
Fix kwargs in docstrings by @albertvillanova in #4444
Fix missing args in docstring of load_dataset_builder by @albertvillanova in #4445
Add missing kwargs to docstrings by @albertvillanova in #4446
Add extractor for bzip2-compressed files by @asivokon in #4421
Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in #4434
Update dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in #4415
Update builder docstring for deprecated/added arguments by @albertvillanova in #4429
Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in #4464
Fix script fetching and local path handling in inspect_dataset and inspect_metric by @mariosasko in #4433
Fix bigbench config names by @lhoestq in #4465
Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in #4472
Reorder returned validation/test splits in script template by @albertvillanova in #4470
Better ImportError message when a dataset script dependency is missing by @lhoestq in #4484
Fix cast to null by @lhoestq in #4485
Update _format_columns in remove_columns by @alvarobartt in #4411
Fix wrong map parameter name in cache docs by @h4iku in #4293
Pin the revision in imagenet download links by @lhoestq in #4492
Refactor column mappings for question answering datasets by @lewtun in #4391