-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Failure to generate the datasets (drop and subset adversarialQA from adversarial_qa) because of duplicate keys.
Steps to reproduce the bug
from datasets import load_dataset
load_dataset("drop")
load_dataset("adversarial_qa", "adversarialQA")Expected results
The examples keys should be unique.
Actual results
>>> load_dataset("drop")
Using custom data configuration default
Downloading and preparing dataset drop/default (download: 7.92 MiB, generated: 111.88 MiB, post-processed: Unknown size, total: 119.80 MiB) to /home/hf/.cache/huggingface/datasets/drop/default/0.1.0/7a94f1e2bb26c4b5c75f89857c06982967d7416e5af935a9374b9bccf5068026...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/load.py", line 751, in load_dataset
use_auth_token=use_auth_token,
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 575, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/builder.py", line 992, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/arrow_writer.py", line 409, in finalize
self.check_duplicate_keys()
File "/home/hf/dev/promptsource/.venv/lib/python3.7/site-packages/datasets/arrow_writer.py", line 349, in check_duplicate_keys
raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 28553293-d719-441b-8f00-ce3dc6df5398
Keys should be unique and deterministic in natureEnvironment info
datasetsversion: 1.7.0- Platform: Linux-5.4.0-1044-gcp-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyArrow version: 3.0.0
albertvillanova
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working