Skip to content

Better DuplicateKeysError error to help the user debug the issue #2556

@lhoestq

Description

@lhoestq

As mentioned in #2552 it would be nice to improve the error message when a dataset fails to build because there are duplicate example keys.

The current one is

datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 48
Keys should be unique and deterministic in nature

and we could have something that guides the user to debugging the issue:

DuplicateKeysError: both 42th and 1337th examples have the same keys `48`.
Please fix the dataset script at <path/to/the/dataset/script>

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions