Skip to content

Releases: huggingface/datasets

2.7.0

16 Nov 10:11
edf1902

Choose a tag to compare

Dataset Features

  • Multiprocessed dataset builder by @TevenLeScao in #5107
    • Load big datasets faster than before using multiprocessing:
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k", num_proc=4)
  • Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
    • Function passed to map or filter that uses tensors or pipelines can now be cached
  • Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
  • TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.6.1...2.7.0

2.6.1

14 Oct 15:45

Choose a tag to compare

Bug fixes

  • Fix filter indices when batched by @albertvillanova in #5113
    • fixed a bug where filter could return examples with the wrong indices
  • Fix iter_batches by @lhoestq in #5115
    • fixed a bug where map with batch=True could return a dataset with less examples
  • Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

Full Changelog: 2.6.0...2.6.1

2.6.0

13 Oct 11:00

Choose a tag to compare

Important

  • [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
    • all the dataset scripts and dataset cards are now on https://hf.co/datasets
    • we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

  • Add ability to read-write to SQL databases. by @Dref360 in #4928
    • Read from sqlite file:
    from datasets import Dataset
    dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
    • Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091
    from datasets import Dataset
    from sqlite3 import connect
    con = connect(...)
    dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
  • Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
    • return numpy/torch/tf/jax tensors with
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
    ds[0]["image"]
  • Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
  • Fast dataset iter by @mariosasko in #5030
    • speed up by a factor of 2 using the Arrow Table reader
  • Dataset infos in yaml by @lhoestq in #4926
  • Add kwargs to Dataset.from_generator by @mariosasko in #5049
  • Support converters in CsvBuilder by @mariosasko in #5057
  • Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Dataset cards

General improvements and bug fixes

New Contributors

Full Changelog: 2.5.1...2.6.0

2.5.2

05 Oct 10:17

Choose a tag to compare

Bug fixes

  • Revert task removal in folder-based builders (#5051)
  • Support hfh 0.10 implicit auth (#5031)

Full Changelog: 2.5.1...2.5.2

2.5.1

21 Sep 15:17

Choose a tag to compare

Bug fixes

Full Changelog: 2.5.0...2.5.1

2.5.0

21 Sep 13:14

Choose a tag to compare

Important

  • Drop Python 3.6 support by @mariosasko in #4460
  • Deprecate metrics by @albertvillanova in #4739
    • Metrics are now deprecated and have been moved to evaluate:
      !pip install evaluate
      import evaluate
      metric = evaluate.load("accuracy")
  • Load GitHub datasets from Hub by @albertvillanova in #4059
  • Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
    • latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
  • Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Dataset methods

Parquet support

  • Download and prepare as Parquet for cloud storage by @lhoestq in #4724
  • Shard parquet in download_and_prepare by @lhoestq in #4747
  • Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987

Datasets changes

Dataset cards

Documentation

Read more

2.4.0

25 Jul 13:41

Choose a tag to compare

Dataset Features

Dataset changes

Dataset Cards

Documentation

General improvements and bug fixes

Read more

2.3.2

15 Jun 18:08

Choose a tag to compare

Bug fixes

  • Fix double dots in data files by @lhoestq in #4505
    • fix a bug when /../ is passed to data_files causing FileNotFoundError
  • fix ETT m1/m2 test/val dataset by @kashif in #4499
  • Corrected broken links in doc by @clefourrier in #4501

New Contributors

Full Changelog: 2.3.1...2.3.2

2.3.1

15 Jun 11:08

Choose a tag to compare

Bug fixes

  • Fix patching module that doesn't exist by @lhoestq in #4495
    • fix bug when importing the lib when scipy is not installed
  • Re-add download_manager module in utils by @lhoestq in #4497
    • fix moved imports of DownloadConfig, DownloadMode, DownloadManager
  • Support streaming UDHR dataset by @albertvillanova in #4487

Full Changelog: 2.3.0...2.3.1

2.3.0

14 Jun 18:12

Choose a tag to compare

Datasets Changes

Dataset Features

Dataset Cards

  • Minor fixes/improvements in scene_parse_150 card by @mariosasko in #4447
  • Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
  • Fix example in opus_ubuntu, Add license info by @leondz in #4360
  • Update README.md of fquad by @lhoestq in #4450

Documentation

Other improvements and bug fixes

New Contributors

Full Changelog: https://git...

Read more