Skip to content

Releases: huggingface/datasets

1.12.1

15 Sep 17:45

Choose a tag to compare

Bug fixes

  • Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
  • Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
  • Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
    • this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

1.12.0

13 Sep 18:35

Choose a tag to compare

New documentation

  • New documentation structure #2718 (@stevhliu):
    • New: Tutorials
    • New: Hot-to guides
    • New: Conceptual guides
    • Update: Reference

See the new documentation here !

Datasets changes

Datasets features

Dataset streaming - better support for compression:

Metrics changes

Dataset cards

General improvements and bug fixes

1.11.0

30 Jul 14:27

Choose a tag to compare

Datasets Changes

General improvements and bug fixes

1.10.2

22 Jul 10:08

Choose a tag to compare

The error message to tell which dataset config name to load was not displayed:

Docstrings:

1.10.1

22 Jul 08:47

Choose a tag to compare

1.10.0

21 Jul 13:46

Choose a tag to compare

Datasets Features

  • Support remote data files #2616 (@albertvillanova)
    This allows to pass URLs of remote data files to any dataset loader:
    load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
    This works for all these dataset loaders:
    • text
    • csv
    • json
    • parquet
    • pandas
  • Streaming from remote text/json/csv/parquet/pandas files:
    When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
  • Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
  • Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

Dataset Tasks

Metrics Changes

General improvements and bug fixes

Dataset Cards

Docs

1.9.0

05 Jul 17:25

Choose a tag to compare

Datasets Changes

Datasets Features

Task templates

  • Add task templates for tydiqa and xquad #2518 (@lewtun)
  • Insert text classification template for Emotion dataset #2521 (@lewtun)
  • Add summarization template #2529 (@lewtun)
  • Add task template for automatic speech recognition #2533 (@lewtun)
  • Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
  • Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Dataset cards

Docs

1.8.0

08 Jun 18:23

Choose a tag to compare

Datasets Changes

Datasets Features

  • Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
  • Support sliced list arrays in cast #2461 (@lhoestq)
    • Dataset.cast can now change the feature types of Sequence fields
  • Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
    • we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
    • we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
    • users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

  • Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
  • Insert task templates for text classification #2389 (@lewtun)
  • Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
  • Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

1.7.0

27 May 10:00

Choose a tag to compare

Dataset Changes

Dataset Features

Metric Changes

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

1.6.2

30 Apr 13:20

Choose a tag to compare

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

  • when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.