Skip to content

datasets.keyhash.DuplicatedKeysError for drop and adversarial_qa/adversarialQA #2542

@VictorSanh

Description

@VictorSanh

Describe the bug

Failure to generate the datasets (drop and subset adversarialQA from adversarial_qa) because of duplicate keys.

Steps to reproduce the bug

from datasets import load_dataset
load_dataset("drop")
load_dataset("adversarial_qa", "adversarialQA")

Expected results

The examples keys should be unique.

Actual results

>>> load_dataset("drop")
Using custom data configuration default
Downloading and preparing dataset drop/default (download: 7.92 MiB, generated: 111.88 MiB, post-processed: Unknown size, total: 119.80 MiB) to /home/hf/.cache/huggingface/datasets/drop/default/0.1.0/7a94f1e2bb26c4b5c75f89857c06982967d7416e5af935a9374b9bccf5068026...
Traceback (most recent call last):         
  File "<stdin>", line 1, in <module>
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/load.py", line 751, in load_dataset
    use_auth_token=use_auth_token,
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 575, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 652, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 992, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/arrow_writer.py", line 409, in finalize
    self.check_duplicate_keys()
  File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/arrow_writer.py", line 349, in check_duplicate_keys
    raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 28553293-d719-441b-8f00-ce3dc6df5398
Keys should be unique and deterministic in nature

Environment info

  • datasets version: 1.7.0
  • Platform: Linux-5.4.0-1044-gcp-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyArrow version: 3.0.0

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions