-
Notifications
You must be signed in to change notification settings - Fork 3k
Make DuplicateKeysError more user friendly [For Issue #2556] #4545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DuplicateKeysError more user friendly [For Issue #2556] #4545
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks ! Looking good so far :)
src/datasets/keyhash.py
Outdated
| def __init__(self, key, duplicate_key_indices=None): | ||
| if duplicate_key_indices: | ||
| self.prefix = f"Found multiple examples with duplicate key: {key}" | ||
| self.err_msg = f"\nThe following examples {' ,'.join(duplicate_key_indices)} have the same key {key} " | ||
| self.suffix = "\nPlease fix the dataset script at <Path to Dataset>" | ||
| super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}") | ||
| else: | ||
| self.prefix = "FAILURE TO GENERATE DATASET !" | ||
| self.err_msg = f"\nFound duplicate Key: {key}" | ||
| self.suffix = "\nKeys should be unique and deterministic in nature" | ||
| super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can always require duplicate_key_indices and completely drop the old FAILURE TO GENERATE DATASET message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit by me:
I like this error message format better:
DuplicateKeysError: both 42nd and 1337th examples have the same key `48`.
Please fix the dataset script at <path/to/the/dataset/script>
(Originally proposed by @lhoestq in the referenced issue)
PS: The absolute indices needed for this format can be computed using ArrowWriter's _num_examples attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariosasko , Apologies , I saw your comment after I pushed ccba4f5.
The current Error looks like :
DuplicatedKeysError: Found multiple examples generated with the same key
The following examples 15, 8784 have the key 519
Please fix the dataset script at datasets/wikicorpus/wikicorpus.py to avoid duplicate keys
Do let me know if you still prefer the original proposal and I shall update it.
DuplicateKeysError: both 42nd and 1337th examples have the same key `48`.
Please fix the dataset script at <path/to/the/dataset/script>
Also could you elaborate on what you mean by absolute indices can be computed using _num_examples attribute ?
Should the indices be calculated as _num_examples + index ?
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks !
After your changes feel free to mark this PR as "ready for review" ;)
DuplicateKeysError error does not provide any information regarding the examples which have the same the key. This information is very helpful for debugging the dataset generator script.
The DuplicatedKeysError class was updated to always except duplicate_key_indices and old Error message is removed. Path to Dataset in Error updated using dataset name to make it more user friendly.
Current Index calculation does not calculate the absolute indices. Error message used str.replace , changed it to simple str append.
Marking PR ready for review. @lhoestq Let me know if there is anything else required or if we are good to go ahead and merge. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks ! I added a small suggestion to crop the number of indices shown to 20 (to not spam the logs), and also to only mention the dataset script name (without assuming it's in datasets/, since it's not the case in general). Here is an example:
DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19... (9980 more) have the key 0
To avoid duplicate keys, please fix the dataset script squad.py
|
The documentation is not available anymore as the PR was closed or merged. |
What does this PR do?
Summary
DuplicateKeysError error does not provide any information regarding the examples which have the same the key.
This information is very helpful for debugging the dataset generator script.
Additions
Changes
DuplicateKeysError Classinsrc/datasets/keyhash.pyto add current index and duplicate_key_indices to error message.check_duplicate_keysfunction insrc/datasets/arrow_writer.pyto find indices of examples with duplicate hash if duplicate keys are found.Deletions
To do :
<Path to Dataset>in Error messageIssues Addressed :
Fixes #2556