Skip to content

Conversation

@VijayKalmath
Copy link
Contributor

@VijayKalmath VijayKalmath commented Jun 22, 2022

What does this PR do?

Summary

DuplicateKeysError error does not provide any information regarding the examples which have the same the key.

This information is very helpful for debugging the dataset generator script.

Additions

Changes

  • Changed DuplicateKeysError Class in src/datasets/keyhash.py to add current index and duplicate_key_indices to error message.
  • Changed check_duplicate_keys function in src/datasets/arrow_writer.py to find indices of examples with duplicate hash if duplicate keys are found.

Deletions

To do :

  • Find way to find and print path <Path to Dataset> in Error message

Issues Addressed :

Fixes #2556

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks ! Looking good so far :)

Comment on lines 73 to 83
def __init__(self, key, duplicate_key_indices=None):
if duplicate_key_indices:
self.prefix = f"Found multiple examples with duplicate key: {key}"
self.err_msg = f"\nThe following examples {' ,'.join(duplicate_key_indices)} have the same key {key} "
self.suffix = "\nPlease fix the dataset script at <Path to Dataset>"
super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}")
else:
self.prefix = "FAILURE TO GENERATE DATASET !"
self.err_msg = f"\nFound duplicate Key: {key}"
self.suffix = "\nKeys should be unique and deterministic in nature"
super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can always require duplicate_key_indices and completely drop the old FAILURE TO GENERATE DATASET message

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit by me:

I like this error message format better:

DuplicateKeysError: both 42nd and 1337th examples have the same key `48`.
Please fix the dataset script at <path/to/the/dataset/script>

(Originally proposed by @lhoestq in the referenced issue)

PS: The absolute indices needed for this format can be computed using ArrowWriter's _num_examples attribute.

Copy link
Contributor Author

@VijayKalmath VijayKalmath Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariosasko , Apologies , I saw your comment after I pushed ccba4f5.

The current Error looks like :

DuplicatedKeysError: Found multiple examples generated with the same key
The following examples 15, 8784 have the key 519
Please fix the dataset script at datasets/wikicorpus/wikicorpus.py to avoid duplicate keys

Do let me know if you still prefer the original proposal and I shall update it.

DuplicateKeysError: both 42nd and 1337th examples have the same key `48`.
Please fix the dataset script at <path/to/the/dataset/script>

Also could you elaborate on what you mean by absolute indices can be computed using _num_examples attribute ?
Should the indices be calculated as _num_examples + index ?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks !

After your changes feel free to mark this PR as "ready for review" ;)

DuplicateKeysError error does not provide any information regarding
the examples which have the same the key. This information is very
helpful for debugging the dataset generator script.
The DuplicatedKeysError class was updated to always except
duplicate_key_indices and old Error message is removed.

Path to Dataset in Error updated using dataset name to make it more
user friendly.
Current Index calculation does not calculate the absolute indices.
Error message used str.replace , changed it to simple str append.
@VijayKalmath
Copy link
Contributor Author

Nice thanks !

After your changes feel free to mark this PR as "ready for review" ;)

Marking PR ready for review.

@lhoestq Let me know if there is anything else required or if we are good to go ahead and merge.

@VijayKalmath VijayKalmath marked this pull request as ready for review June 27, 2022 18:39
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! I added a small suggestion to crop the number of indices shown to 20 (to not spam the logs), and also to only mention the dataset script name (without assuming it's in datasets/, since it's not the case in general). Here is an example:

DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19... (9980 more) have the key 0
To avoid duplicate keys, please fix the dataset script squad.py

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 28, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq merged commit 8910eda into huggingface:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better DuplicateKeysError error to help the user debug the issue

4 participants