Skip to content

Keys should be unique error on code_search_net #2552

@thomwolf

Description

@thomwolf

Describe the bug

Loading code_search_net seems not possible at the moment.

Steps to reproduce the bug

>>> load_dataset('code_search_net')
Downloading: 8.50kB [00:00, 3.09MB/s]                                                                                                                                           
Downloading: 19.1kB [00:00, 10.1MB/s]                                                                                                                                           
No config specified, defaulting to: code_search_net/all
Downloading and preparing dataset code_search_net/all (download: 4.77 GiB, generated: 5.99 GiB, post-processed: Unknown size, total: 10.76 GiB) to /Users/thomwolf/.cache/huggingface/datasets/code_search_net/all/1.0.0/b3e8278faf5d67da1d06981efbeac3b76a2900693bd2239bbca7a4a3b0d6e52a...
Traceback (most recent call last):         
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/builder.py", line 1067, in _prepare_split
    writer.write(example, key)
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/arrow_writer.py", line 343, in write
    self.check_duplicate_keys()
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/arrow_writer.py", line 354, in check_duplicate_keys
    raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 48
Keys should be unique and deterministic in nature

Environment info

  • datasets version: 1.8.1.dev0
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyArrow version: 2.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions