Releases · huggingface/datasets

15 Sep 17:45

lhoestq

1.12.1

2c1fc9c

1.12.1

Bug fixes

Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

Contributors

pierre-godard and lhoestq

Assets 2

13 Sep 18:35

lhoestq

1.12.0

c65dccc

1.12.0

New documentation

New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference

See the new documentation here !

Datasets changes

New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
New: The Pile books3 #2801 (@richarddwang)
New: The Pile stack exchange #2803 (@richarddwang)
New: The Pile openwebtext2 #2802 (@richarddwang)
New: Food-101 #2804 (@nateraw)
New: Beans #2809 (@nateraw)
New: cedr #2796 (@naumov-al)
New: cats_vs_dogs #2807 (@nateraw)
New: MultiEURLEX #2865 (@iliaschalkidis)
New: BIOSSES #2881 (@bwang482)
Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
Update: SUPERB - Add SD task #2661 (@albertvillanova)
Update: SUPERB - Add KS task #2783 (@anton-l)
Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
Update: Openwebtext - update size #2857 (@lhoestq)
Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
Fix: linnaeus - fix url #2852 (@lhoestq)
Fix ToTTo - fix data URL #2864 (@albertvillanova)
Fix: wikicorpus - fix keys #2844 (@lhoestq)
Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
add multi-proc in to_json #2747 (@bhavitvyamalik)
Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

Fix streaming zip files #2798 (@albertvillanova)
Support streaming tar files #2800 (@albertvillanova)
Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
Add url prefix convention for many compression formats #2822 (@lhoestq)
Support streaming datasets that use pathlib #2874 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

Update release instructions #2740 (@albertvillanova)
Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
Allow PyArrow from source #2769 (@patrickvonplaten)
fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
Fix typo in test_dataset_common #2790 (@nateraw)
Fix type hint for data_files #2793 (@albertvillanova)
Bump tqdm version #2814 (@mariosasko)
Use packaging to handle versions #2777 (@albertvillanova)
Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
Rename The Pile subsets #2817 (@lhoestq)
Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
Fix extraction protocol inference from urls with params #2843 (@lhoestq)
Fix caching when moving script #2854 (@lhoestq)
Fix windows CI CondaError #2855 (@lhoestq)
fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
Fix s3fs version in CI #2858 (@lhoestq)
Fix three typos in two files for documentation #2870 (@leny-mi)
Move checks from _map_single to map #2660 (@mariosasko)
fix regex to accept negative timezone #2847 (@jadermcs)
Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
Fix null sequence encoding #2900 (@lhoestq)

Contributors

iliaschalkidis, severo, and 22 other contributors

Assets 2

30 Jul 14:27

albertvillanova

1.11.0

ea7f0b8

1.11.0

Datasets Changes

New: Add Russian SuperGLUE #2668 (@slowwavesleep)
New: Add Disfl-QA #2473 (@bhavitvyamalik)
New: Add TimeDial #2476 (@bhavitvyamalik)
Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
Fix: Update WikiANN data URL #2710 (@albertvillanova)
Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)

General improvements and bug fixes

fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
Update BibTeX entry #2706 (@albertvillanova)
Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
Add support for disable_progress_bar on Windows #2696 (@mariosasko)
Ignore empty batch when writing #2698 (@pcuenca)
Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
fix: 🐛 fix two typos #2720 (@severo)
Docs details #2690 (@severo)
Deal with the bad check in test_load.py #2721 (@mariosasko)
Pass use_auth_token to request_etags #2725 (@albertvillanova)
Typo fix tokenize_exemple #2726 (@shabie)
Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
Add missing parquet known extension #2733 (@lhoestq)

Contributors

pcuenca, severo, and 8 other contributors

Assets 2

22 Jul 10:08

lhoestq

1.10.2

cea1a29

1.10.2

The error message to tell which dataset config name to load was not displayed:

Fix pick default config name message #2704 (@lhoestq)

Docstrings:

Fix download_mode docstrings #2701 (@albertvillanova)

Assets 2

22 Jul 08:47

lhoestq

1.10.1

6b7b227

1.10.1

Fix minimum tqdm version and import on Colab #2697 (@nateraw)
Fix OSCAR Esperanto #2693 (@lhoestq)

Assets 2

21 Jul 13:46

lhoestq

1.10.0

3aabafb

1.10.0

Datasets Features

Support remote data files #2616 (@albertvillanova)
This allows to pass URLs of remote data files to any dataset loader:
```
load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
```
This works for all these dataset loaders:
- text
- csv
- json
- parquet
- pandas
Streaming from remote text/json/csv/parquet/pandas files:
When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
- Streaming for the Pandas loader #2636 (@lhoestq)
- Streaming for the CSV loader #2635 (@lhoestq)
- Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

Fix: C4 - fix expected files list #2682 (@lhoestq)
Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

Dataset Tasks

Add speech processing tasks #2620 (@lewtun)
Update ASR tags #2633 (@lewtun)
Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
Add ASR task for SUPERB #2619 (@lewtun)
add image-classification task template #2632 (@nateraw)

Metrics Changes

New: wiki_split #2623 (@bhadreshpsavani)
Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

General improvements and bug fixes

Fix BibTeX entry #2594 (@albertvillanova)
Fix test_is_small_dataset #2588 (@albertvillanova)
Remove import of transformers #2602 (@albertvillanova)
Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
Remove redundant prepare_module #2597 (@albertvillanova)
Create ExtractManager #2295 (@albertvillanova)
Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
Use correct logger in metrics.py #2626 (@mariosasko)
Minor fix tests with Windows paths #2627 (@albertvillanova)
Use ETag of remote data files #2628 (@albertvillanova)
More consistent naming #2611 (@mariosasko)
Refactor patching to specific submodule #2639 (@albertvillanova)
Fix docstrings #2640 (@albertvillanova)
Fix anchor in README #2647 (@mariosasko)
Fix logging docstring #2652 (@mariosasko)
Allow dataset config kwargs to be None #2659 (@lhoestq)
Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
Use tqdm from tqdm_utils #2667 (@mariosasko)
Increase json reader block_size automatically #2676 (@lhoestq)
Parallelize ETag requests #2675 (@lhoestq)
Fix bad config ids that name cache directories #2686 (@lhoestq)
Minor documentation fix #2687 (@slowwavesleep)

Dataset Cards

Add missing WikiANN language tags #2610 (@albertvillanova)
feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)

Docs

Update processing.rst with other export formats #2599 (@TevenLeScao)

Assets 2

05 Jul 17:25

lhoestq

1.9.0

5bc064d

1.9.0

Datasets Changes

New: C4 #2575 #2592 (@lhoestq)
New: mC4 #2576 (@lhoestq)
New: MasakhaNER #2465 (@dadelani)
New: Eduge #2492 (@enod)
Update: xor_tydi_qa - update version #2455 (@cccntu)
Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
Update: udpos - change features structure #2466 (@JerryIsHere)
Update: WebNLG - update checksums #2558 (@lhoestq)
Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
Fix: proto_qa - fix download link #2463 (@mariosasko)
Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
Fix: code_search_net - fix keys #2555 (@lhoestq)
Fix: discofuse - fix link cc #2541 (@VictorSanh)
Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
JAX integration #2502 (@lhoestq)
Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
Set configurable downloaded datasets path #2488 (@albertvillanova)
Set configurable extracted datasets path #2487 (@albertvillanova)
Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
Add interleave_datasets for map-style datasets #2568 (@lhoestq)
Add load_dataset_builder #2500 (@mariosasko)
Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

Add task templates for tydiqa and xquad #2518 (@lewtun)
Insert text classification template for Emotion dataset #2521 (@lewtun)
Add summarization template #2529 (@lewtun)
Add task template for automatic speech recognition #2533 (@lewtun)
Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
Allow latest pyarrow version #2490 (@albertvillanova)
Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
Add Zenodo metadata file with license #2501 (@albertvillanova)
add tensorflow-macos support #2493 (@slayerjain)
Keep original features order #2453 (@albertvillanova)
Add course banner #2506 (@sgugger)
Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
Improve performance of pandas arrow extractor #2519 (@albertvillanova)
Fix fingerprint when moving cache dir #2509 (@lhoestq)
Replace bad n>1M size tag #2527 (@lhoestq)
Fix dev version #2531 (@lhoestq)
Sync with transformers disabling NOTSET #2534 (@albertvillanova)
Fix logging levels #2544 (@albertvillanova)
Add support for Split.ALL #2259 (@mariosasko)
Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
Make numpy arrow extractor faster #2505 (@lhoestq)
fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
Add ASR task and new languages to resources #2567 (@lewtun)
Filter expected warning log from transformers #2571 (@albertvillanova)
Fix BibTeX entry #2579 (@albertvillanova)
Fix Counter import #2580 (@albertvillanova)
Add aiohttp to tests extras require #2587 (@albertvillanova)
Add language tags #2590 (@lewtun)
Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

Updated Dataset Description #2420 (@binny-mathew)
Update DatasetMetadata and ReadMe #2436 (@gchhablani)
CRD3 dataset card #2515 (@wilsonyhlee)
Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

no s at load_datasets #2479 (@julien-c)
Fix docs custom stable version #2477 (@albertvillanova)
Improve Features docs #2535 (@albertvillanova)
Update README.md #2414 (@cryoff)
Fix FileSystems documentation #2551 (@connor-mccarthy)
Minor fix in loading metrics docs #2562 (@albertvillanova)
Minor fix docs format for bertscore #2570 (@albertvillanova)
Add streaming in load a dataset docs #2574 (@lhoestq)

Assets 2

08 Jun 18:23

lhoestq

1.8.0

bcf0543

1.8.0

Datasets Changes

New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
New: KLUE benchmark #2416 (@jungwhank)
New: HendrycksTest #2370 (@andyzoujm)
Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
Support sliced list arrays in cast #2461 (@lhoestq)
- Dataset.cast can now change the feature types of Sequence fields
Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

adds license information for DailyDialog. #2419 (@aditya2211)
add english language tags for ~100 datasets #2442 (@VictorSanh)
Add copyright info to MLSUM dataset #2427 (@PhilipMay)
Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

Add DOI badge to README #2411 (@albertvillanova)
Make datasets PEP-561 compliant #2417 (@SBrandeis)
Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
Fix CI six installation on linux #2432 (@lhoestq)
Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
add utf-8 while reading README #2418 (@bhavitvyamalik)
Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
Rename config and environment variable for in memory max size #2454 (@albertvillanova)
Add version-specific BibTeX #2430 (@albertvillanova)
Fix cross-reference typos in documentation #2456 (@albertvillanova)
Better error message when using the wrong load_from_disk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
Insert task templates for text classification #2389 (@lewtun)
Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

Assets 2

27 May 10:00

lhoestq

1.7.0

448c177

1.7.0

Dataset Changes

New: NLU evaluation data #2238 (@dkajtoch)
New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
New: Bbaw egyptian #2290 (@phiwi)
New: GooAQ #2260 (@bhavitvyamalik)
New: SubjQA #2302 (@lewtun)
New: Ascent KB #2341, #2349 (@phongnt570)
New: HLGD #2325 (@tingofurro)
New: Qasper #2346 (@cceyda)
New: ConvQuestions benchmark #2372 (@PhilippChr)
Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
Update multi_woz_v22 - update checksum #2281 (@lhoestq)
Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
Update: web_science - fixed download link #2338 (@bhavitvyamalik)
Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
Update: conll2003 - correct labels #2369 (@philschmid)
Update: offenseval_dravidian - update citations #2385 (@adeepH)
Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

Implement Dataset add_item #1870 (@albertvillanova)
Implement Dataset add_column #2145 (@albertvillanova)
Implement Dataset to JSON #2248, #2352 (@albertvillanova)
Add rename_columnS method #2312 (@SBrandeis)
add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

New: CUAD metrics #2273 (@bhavitvyamalik)
New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

Update black #2265 (@lhoestq)
Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
Fix query table with iterable #2269 (@lhoestq)
Perform minor refactoring: use config #2253 (@albertvillanova)
Update format, fingerprint and indices after add_item #2254 (@lhoestq)
Always update metadata in arrow schema #2274 (@lhoestq)
Make tests run faster #2266 (@lhoestq)
Fix metadata validation with config names #2286 (@lhoestq)
Fixed typo seperate->separate #2292 (@laksh9950)
Allow collaborators to self-assign issues #2289 (@albertvillanova)
Mapping in the distributed setting #2298 (@TevenLeScao)
Fix conda release #2309 (@lhoestq)
Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
Set default name in init_dynamic_modules #2320 (@albertvillanova)
Fix duplicate keys #2333 (@lhoestq)
Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
Metadata validation #2107 (@theo-m)
Add Validation For README #2121 (@gchhablani)
Fix overflow issue in interpolation search #2336 (@mariosasko)
Datasets cli improvements #2315 (@mariosasko)
Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
More consistent copy logic #2340 (@mariosasko)
Update README vallidation rules #2353 (@gchhablani)
normalized TOCs and titles in data cards #2355 (@yjernite)
simpllify faiss index save #2351 (@Guitaricet)
Allow "other-X" in licenses #2368 (@gchhablani)
Improve ReadInstruction logic and update docs #2261 (@mariosasko)
Disallow duplicate keys in yaml tags #2379 (@lhoestq)
maintain YAML structure reading from README #2380 (@bhavitvyamalik)
add dataset card title #2381 (@bhavitvyamalik)
Add tests for dataset cards #2348 (@gchhablani)
Improve example in rounding docs #2383 (@mariosasko)
Paperswithcode dataset mapping #2404 (@julien-c)
Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Format a dataset for specific tasks

Task formatting for text classification & question answering #2255 (@SBrandeis)
Add check for task templates on dataset load #2390 (@lewtun)
Add args description to DatasetInfo #2384 (@lewtun)
Improve task api code quality #2376 (@mariosasko)

Assets 2

30 Apr 13:20

lhoestq

1.6.2

b0d7ae1

1.6.2

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.

Assets 2

Releases: huggingface/datasets

1.12.1

Bug fixes

Contributors

Uh oh!

1.12.0

New documentation

Datasets changes

Datasets features

Dataset streaming - better support for compression:

Metrics changes

Dataset cards

General improvements and bug fixes

Contributors

Uh oh!

1.11.0

Datasets Changes

General improvements and bug fixes

Contributors

Uh oh!

1.10.2

Uh oh!

1.10.1

Uh oh!

1.10.0

Datasets Features

Datasets Changes

Dataset Tasks

Metrics Changes

General improvements and bug fixes

Dataset Cards

Docs

Uh oh!

1.9.0

Datasets Changes

Datasets Features

Task templates

General improvements and bug fixes

Dataset cards

Docs

Uh oh!

1.8.0

Datasets Changes

Datasets Features

Datasets Cards

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

Uh oh!

1.7.0

Dataset Changes

Dataset Features

Metric Changes

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

Uh oh!

1.6.2

Uh oh!