Releases: huggingface/datasets
Releases · huggingface/datasets
1.12.1
1.12.0
New documentation
- New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference
See the new documentation here !
Datasets changes
- New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
- New: The Pile books3 #2801 (@richarddwang)
- New: The Pile stack exchange #2803 (@richarddwang)
- New: The Pile openwebtext2 #2802 (@richarddwang)
- New: Food-101 #2804 (@nateraw)
- New: Beans #2809 (@nateraw)
- New: cedr #2796 (@naumov-al)
- New: cats_vs_dogs #2807 (@nateraw)
- New: MultiEURLEX #2865 (@iliaschalkidis)
- New: BIOSSES #2881 (@bwang482)
- Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
- Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
- Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
- Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
- Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
- Update: SUPERB - Add SD task #2661 (@albertvillanova)
- Update: SUPERB - Add KS task #2783 (@anton-l)
- Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
- Update: Openwebtext - update size #2857 (@lhoestq)
- Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
- Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
- Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
- Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
- Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
- Fix: linnaeus - fix url #2852 (@lhoestq)
- Fix ToTTo - fix data URL #2864 (@albertvillanova)
- Fix: wikicorpus - fix keys #2844 (@lhoestq)
- Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
- Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)
Datasets features
- Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
- Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
- add multi-proc in
to_json#2747 (@bhavitvyamalik) - Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)
Dataset streaming - better support for compression:
- Fix streaming zip files #2798 (@albertvillanova)
- Support streaming tar files #2800 (@albertvillanova)
- Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
- Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
- Add url prefix convention for many compression formats #2822 (@lhoestq)
- Support streaming datasets that use pathlib #2874 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
- Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)
Metrics changes
- Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
- Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)
Dataset cards
- Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
- Update ELI5 README.md #2848 (@odellus)
General improvements and bug fixes
- Update release instructions #2740 (@albertvillanova)
- Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
- Allow PyArrow from source #2769 (@patrickvonplaten)
- fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
- Fix typo in test_dataset_common #2790 (@nateraw)
- Fix type hint for data_files #2793 (@albertvillanova)
- Bump tqdm version #2814 (@mariosasko)
- Use packaging to handle versions #2777 (@albertvillanova)
- Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
- Rename The Pile subsets #2817 (@lhoestq)
- Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
- Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
- Fix extraction protocol inference from urls with params #2843 (@lhoestq)
- Fix caching when moving script #2854 (@lhoestq)
- Fix windows CI CondaError #2855 (@lhoestq)
- fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
- Update
column_namesshowed as:func:in exploring.st #2851 (@ClementRomac) - Fix s3fs version in CI #2858 (@lhoestq)
- Fix three typos in two files for documentation #2870 (@leny-mi)
- Move checks from _map_single to map #2660 (@mariosasko)
- fix regex to accept negative timezone #2847 (@jadermcs)
- Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
- Fix null sequence encoding #2900 (@lhoestq)
1.11.0
Datasets Changes
- New: Add Russian SuperGLUE #2668 (@slowwavesleep)
- New: Add Disfl-QA #2473 (@bhavitvyamalik)
- New: Add TimeDial #2476 (@bhavitvyamalik)
- Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
- Fix: Update WikiANN data URL #2710 (@albertvillanova)
- Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
- Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)
General improvements and bug fixes
- fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
- Update BibTeX entry #2706 (@albertvillanova)
- Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
- Add support for disable_progress_bar on Windows #2696 (@mariosasko)
- Ignore empty batch when writing #2698 (@pcuenca)
- Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
- fix: 🐛 fix two typos #2720 (@severo)
- Docs details #2690 (@severo)
- Deal with the bad check in test_load.py #2721 (@mariosasko)
- Pass use_auth_token to request_etags #2725 (@albertvillanova)
- Typo fix
tokenize_exemple#2726 (@shabie) - Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
- Add missing parquet known extension #2733 (@lhoestq)
1.10.2
1.10.1
1.10.0
Datasets Features
- Support remote data files #2616 (@albertvillanova)
This allows to pass URLs of remote data files to any dataset loader:This works for all these dataset loaders:load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
- text
- csv
- json
- parquet
- pandas
- Streaming from remote text/json/csv/parquet/pandas files:
When you pass URLs to a dataset loader, you can enable streaming mode withstreaming=True. Main contributions: - Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
- Delete extracted files when loading dataset #2631 (@albertvillanova)
Datasets Changes
- Fix: C4 - fix expected files list #2682 (@lhoestq)
- Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
- Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
- Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
- Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
- Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
- Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)
Dataset Tasks
- Add speech processing tasks #2620 (@lewtun)
- Update ASR tags #2633 (@lewtun)
- Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
- Add ASR task for SUPERB #2619 (@lewtun)
- add image-classification task template #2632 (@nateraw)
Metrics Changes
- New: wiki_split #2623 (@bhadreshpsavani)
- Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
- Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)
General improvements and bug fixes
- Fix BibTeX entry #2594 (@albertvillanova)
- Fix test_is_small_dataset #2588 (@albertvillanova)
- Remove import of transformers #2602 (@albertvillanova)
- Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
- Fix
filterwith multiprocessing in case all samples are discarded #2601 (@mxschmdt) - Remove redundant prepare_module #2597 (@albertvillanova)
- Create ExtractManager #2295 (@albertvillanova)
- Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
- Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
- Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
- Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
- Use correct logger in metrics.py #2626 (@mariosasko)
- Minor fix tests with Windows paths #2627 (@albertvillanova)
- Use ETag of remote data files #2628 (@albertvillanova)
- More consistent naming #2611 (@mariosasko)
- Refactor patching to specific submodule #2639 (@albertvillanova)
- Fix docstrings #2640 (@albertvillanova)
- Fix anchor in README #2647 (@mariosasko)
- Fix logging docstring #2652 (@mariosasko)
- Allow dataset config kwargs to be None #2659 (@lhoestq)
- Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
- Use tqdm from tqdm_utils #2667 (@mariosasko)
- Increase json reader block_size automatically #2676 (@lhoestq)
- Parallelize ETag requests #2675 (@lhoestq)
- Fix bad config ids that name cache directories #2686 (@lhoestq)
- Minor documentation fix #2687 (@slowwavesleep)
Dataset Cards
- Add missing WikiANN language tags #2610 (@albertvillanova)
- feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)
Docs
- Update processing.rst with other export formats #2599 (@TevenLeScao)
1.9.0
Datasets Changes
- New: C4 #2575 #2592 (@lhoestq)
- New: mC4 #2576 (@lhoestq)
- New: MasakhaNER #2465 (@dadelani)
- New: Eduge #2492 (@enod)
- Update: xor_tydi_qa - update version #2455 (@cccntu)
- Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
- Update: udpos - change features structure #2466 (@JerryIsHere)
- Update: WebNLG - update checksums #2558 (@lhoestq)
- Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
- Fix: proto_qa - fix download link #2463 (@mariosasko)
- Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
- Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
- Fix: code_search_net - fix keys #2555 (@lhoestq)
- Fix: discofuse - fix link cc #2541 (@VictorSanh)
- Fix: fever - fix keys #2557 (@lhoestq)
Datasets Features
- Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
- JAX integration #2502 (@lhoestq)
- Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
- Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
- Set configurable downloaded datasets path #2488 (@albertvillanova)
- Set configurable extracted datasets path #2487 (@albertvillanova)
- Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
- Add interleave_datasets for map-style datasets #2568 (@lhoestq)
- Add load_dataset_builder #2500 (@mariosasko)
- Support Zstandard compressed files #2578 (@albertvillanova)
Task templates
- Add task templates for tydiqa and xquad #2518 (@lewtun)
- Insert text classification template for Emotion dataset #2521 (@lewtun)
- Add summarization template #2529 (@lewtun)
- Add task template for automatic speech recognition #2533 (@lewtun)
- Remove task templates if required features are removed during
Dataset.map#2540 (@lewtun) - Inject templates for ASR datasets #2565 (@lewtun)
General improvements and bug fixes
- Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
- Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
- Allow latest pyarrow version #2490 (@albertvillanova)
- Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
- Add Zenodo metadata file with license #2501 (@albertvillanova)
- add tensorflow-macos support #2493 (@slayerjain)
- Keep original features order #2453 (@albertvillanova)
- Add course banner #2506 (@sgugger)
- Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
- Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
- Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
- Improve performance of pandas arrow extractor #2519 (@albertvillanova)
- Fix fingerprint when moving cache dir #2509 (@lhoestq)
- Replace bad
n>1Msize tag #2527 (@lhoestq) - Fix dev version #2531 (@lhoestq)
- Sync with transformers disabling NOTSET #2534 (@albertvillanova)
- Fix logging levels #2544 (@albertvillanova)
- Add support for Split.ALL #2259 (@mariosasko)
- Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
- Make numpy arrow extractor faster #2505 (@lhoestq)
- fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
- Add ASR task and new languages to resources #2567 (@lewtun)
- Filter expected warning log from transformers #2571 (@albertvillanova)
- Fix BibTeX entry #2579 (@albertvillanova)
- Fix Counter import #2580 (@albertvillanova)
- Add aiohttp to tests extras require #2587 (@albertvillanova)
- Add language tags #2590 (@lewtun)
- Support pandas 1.3.0 read_csv #2593 (@lhoestq)
Dataset cards
- Updated Dataset Description #2420 (@binny-mathew)
- Update DatasetMetadata and ReadMe #2436 (@gchhablani)
- CRD3 dataset card #2515 (@wilsonyhlee)
- Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
- wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)
Docs
- no s at load_datasets #2479 (@julien-c)
- Fix docs custom stable version #2477 (@albertvillanova)
- Improve Features docs #2535 (@albertvillanova)
- Update README.md #2414 (@cryoff)
- Fix FileSystems documentation #2551 (@connor-mccarthy)
- Minor fix in loading metrics docs #2562 (@albertvillanova)
- Minor fix docs format for bertscore #2570 (@albertvillanova)
- Add streaming in load a dataset docs #2574 (@lhoestq)
1.8.0
Datasets Changes
- New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
- New: KLUE benchmark #2416 (@jungwhank)
- New: HendrycksTest #2370 (@andyzoujm)
- Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
- Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
- Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
- Fix: flores - fix download link #2448 (@mariosasko)
Datasets Features
- Add
descparameter inmapforDatasetDictobject #2423 (@bhavitvyamalik) - Support sliced list arrays in cast #2461 (@lhoestq)
Dataset.castcan now change the feature types of Sequence fields
- Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set
keep_in_memory=Truewhen loading a dataset to load it in memory
Datasets Cards
- adds license information for DailyDialog. #2419 (@aditya2211)
- add english language tags for ~100 datasets #2442 (@VictorSanh)
- Add copyright info to MLSUM dataset #2427 (@PhilipMay)
- Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
- Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)
General improvements and bug fixes
- Add DOI badge to README #2411 (@albertvillanova)
- Make datasets PEP-561 compliant #2417 (@SBrandeis)
- Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
- Fix CI six installation on linux #2432 (@lhoestq)
- Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
- Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
- doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
- add utf-8 while reading README #2418 (@bhavitvyamalik)
- Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
- Rename config and environment variable for in memory max size #2454 (@albertvillanova)
- Add version-specific BibTeX #2430 (@albertvillanova)
- Fix cross-reference typos in documentation #2456 (@albertvillanova)
- Better error message when using the wrong load_from_disk #2437 (@lhoestq)
Experimental and work in progress: Format a dataset for specific tasks
1.7.0
Dataset Changes
- New: NLU evaluation data #2238 (@dkajtoch)
- New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
- New: Bbaw egyptian #2290 (@phiwi)
- New: GooAQ #2260 (@bhavitvyamalik)
- New: SubjQA #2302 (@lewtun)
- New: Ascent KB #2341, #2349 (@phongnt570)
- New: HLGD #2325 (@tingofurro)
- New: Qasper #2346 (@cceyda)
- New: ConvQuestions benchmark #2372 (@PhilippChr)
- Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
- Update multi_woz_v22 - update checksum #2281 (@lhoestq)
- Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
- Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
- Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
- Update: web_science - fixed download link #2338 (@bhavitvyamalik)
- Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
- Update: conll2003 - correct labels #2369 (@philschmid)
- Update: offenseval_dravidian - update citations #2385 (@adeepH)
- Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
- Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
- Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
- Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
- Fix: head_qa - Fix keys #2408 (@lhoestq)
Dataset Features
- Implement Dataset add_item #1870 (@albertvillanova)
- Implement Dataset add_column #2145 (@albertvillanova)
- Implement Dataset to JSON #2248, #2352 (@albertvillanova)
- Add rename_columnS method #2312 (@SBrandeis)
- add
desctotqdminDataset.map()#2374 (@bhavitvyamalik) - Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)
Metric Changes
- New: CUAD metrics #2273 (@bhavitvyamalik)
- New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
- Update: CER - Docs, CER above 1 #2342 (@borisdayma)
General improvements and bug fixes
- Update black #2265 (@lhoestq)
- Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
- Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
- Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
- Fix query table with iterable #2269 (@lhoestq)
- Perform minor refactoring: use config #2253 (@albertvillanova)
- Update format, fingerprint and indices after add_item #2254 (@lhoestq)
- Always update metadata in arrow schema #2274 (@lhoestq)
- Make tests run faster #2266 (@lhoestq)
- Fix metadata validation with config names #2286 (@lhoestq)
- Fixed typo seperate->separate #2292 (@laksh9950)
- Allow collaborators to self-assign issues #2289 (@albertvillanova)
- Mapping in the distributed setting #2298 (@TevenLeScao)
- Fix conda release #2309 (@lhoestq)
- Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
- Set default name in init_dynamic_modules #2320 (@albertvillanova)
- Fix duplicate keys #2333 (@lhoestq)
- Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
- Metadata validation #2107 (@theo-m)
- Add Validation For README #2121 (@gchhablani)
- Fix overflow issue in interpolation search #2336 (@mariosasko)
- Datasets cli improvements #2315 (@mariosasko)
- Add
keytype and duplicates verification with hashing #2245 (@NikhilBartwal) - More consistent copy logic #2340 (@mariosasko)
- Update README vallidation rules #2353 (@gchhablani)
- normalized TOCs and titles in data cards #2355 (@yjernite)
- simpllify faiss index save #2351 (@Guitaricet)
- Allow "other-X" in licenses #2368 (@gchhablani)
- Improve ReadInstruction logic and update docs #2261 (@mariosasko)
- Disallow duplicate keys in yaml tags #2379 (@lhoestq)
- maintain YAML structure reading from README #2380 (@bhavitvyamalik)
- add dataset card title #2381 (@bhavitvyamalik)
- Add tests for dataset cards #2348 (@gchhablani)
- Improve example in rounding docs #2383 (@mariosasko)
- Paperswithcode dataset mapping #2404 (@julien-c)
- Free datasets with cache file in temp dir on exit #2403 (@mariosasko)
Experimental and work in progress: Format a dataset for specific tasks
- Task formatting for text classification & question answering #2255 (@SBrandeis)
- Add check for task templates on dataset load #2390 (@lewtun)
- Add args description to DatasetInfo #2384 (@lewtun)
- Improve task api code quality #2376 (@mariosasko)
1.6.2
Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.
Breaking change:
- when using
Dataset.mapwith theinput_columnsparameter, the resulting dataset will only have the columns frominput_columnsand the columns added by the map functions. The other columns are discarded.