Make DuplicateKeysError more user friendly [For Issue #2556] #4545

VijayKalmath · 2022-06-22T21:01:34Z

What does this PR do?

Summary

DuplicateKeysError error does not provide any information regarding the examples which have the same the key.

This information is very helpful for debugging the dataset generator script.

Additions

Changes

Changed DuplicateKeysError Class in src/datasets/keyhash.py to add current index and duplicate_key_indices to error message.
Changed check_duplicate_keys function in src/datasets/arrow_writer.py to find indices of examples with duplicate hash if duplicate keys are found.

Deletions

To do :

Find way to find and print path <Path to Dataset> in Error message

Issues Addressed :

Fixes #2556

lhoestq

Cool thanks ! Looking good so far :)

src/datasets/arrow_writer.py

lhoestq · 2022-06-23T13:30:45Z

src/datasets/keyhash.py

+    def __init__(self, key, duplicate_key_indices=None):
+        if duplicate_key_indices:
+            self.prefix = f"Found multiple examples with duplicate key: {key}"
+            self.err_msg = f"\nThe following examples {' ,'.join(duplicate_key_indices)} have the same key {key} "
+            self.suffix = "\nPlease fix the dataset script at <Path to Dataset>"
+            super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}")
+        else:
+            self.prefix = "FAILURE TO GENERATE DATASET !"
+            self.err_msg = f"\nFound duplicate Key: {key}"
+            self.suffix = "\nKeys should be unique and deterministic in nature"
+            super().__init__(f"{self.prefix}{self.err_msg}{self.suffix}")


I think you can always require duplicate_key_indices and completely drop the old FAILURE TO GENERATE DATASET message

One nit by me:

I like this error message format better:

DuplicateKeysError: both 42nd and 1337th examples have the same key `48`. Please fix the dataset script at <path/to/the/dataset/script>

(Originally proposed by @lhoestq in the referenced issue)

PS: The absolute indices needed for this format can be computed using ArrowWriter's _num_examples attribute.

@mariosasko , Apologies , I saw your comment after I pushed ccba4f5.

The current Error looks like :

DuplicatedKeysError: Found multiple examples generated with the same key The following examples 15, 8784 have the key 519 Please fix the dataset script at datasets/wikicorpus/wikicorpus.py to avoid duplicate keys

Do let me know if you still prefer the original proposal and I shall update it.

DuplicateKeysError: both 42nd and 1337th examples have the same key `48`. Please fix the dataset script at <path/to/the/dataset/script>

Also could you elaborate on what you mean by absolute indices can be computed using _num_examples attribute ?
Should the indices be calculated as _num_examples + index ?

lhoestq

Nice thanks !

After your changes feel free to mark this PR as "ready for review" ;)

src/datasets/keyhash.py

src/datasets/arrow_writer.py

DuplicateKeysError error does not provide any information regarding the examples which have the same the key. This information is very helpful for debugging the dataset generator script.

The DuplicatedKeysError class was updated to always except duplicate_key_indices and old Error message is removed. Path to Dataset in Error updated using dataset name to make it more user friendly.

Current Index calculation does not calculate the absolute indices. Error message used str.replace , changed it to simple str append.

VijayKalmath · 2022-06-27T18:39:24Z

Nice thanks !

After your changes feel free to mark this PR as "ready for review" ;)

Marking PR ready for review.

@lhoestq Let me know if there is anything else required or if we are good to go ahead and merge.

lhoestq

Awesome thanks ! I added a small suggestion to crop the number of indices shown to 20 (to not spam the logs), and also to only mention the dataset script name (without assuming it's in datasets/, since it's not the case in general). Here is an example:

DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19... (9980 more) have the key 0
To avoid duplicate keys, please fix the dataset script squad.py

src/datasets/builder.py

src/datasets/keyhash.py

HuggingFaceDocBuilderDev · 2022-06-28T09:02:13Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq reviewed Jun 23, 2022

View reviewed changes

VijayKalmath requested review from lhoestq and mariosasko June 23, 2022 19:13

lhoestq reviewed Jun 27, 2022

View reviewed changes

src/datasets/keyhash.py Outdated Show resolved Hide resolved

src/datasets/arrow_writer.py Outdated Show resolved Hide resolved

VijayKalmath added 4 commits June 27, 2022 18:30

Update DuplicateKeysError more user friendly

74f1d03

DuplicateKeysError error does not provide any information regarding the examples which have the same the key. This information is very helpful for debugging the dataset generator script.

Removing unused hkey_index variable

ef2612a

Update Dataset Path in DuplicatedKeysError

42476c4

The DuplicatedKeysError class was updated to always except duplicate_key_indices and old Error message is removed. Path to Dataset in Error updated using dataset name to make it more user friendly.

Update index calculation with _num_examples

23513a6

Current Index calculation does not calculate the absolute indices. Error message used str.replace , changed it to simple str append.

VijayKalmath marked this pull request as ready for review June 27, 2022 18:39

Fix Lint Error

3a4108a

lhoestq approved these changes Jun 28, 2022

View reviewed changes

src/datasets/builder.py Outdated Show resolved Hide resolved

src/datasets/keyhash.py Outdated Show resolved Hide resolved

Apply suggestions from code review

471c3b5

lhoestq merged commit 8910eda into huggingface:master Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make DuplicateKeysError more user friendly [For Issue #2556] #4545

Make DuplicateKeysError more user friendly [For Issue #2556] #4545

Uh oh!

VijayKalmath commented Jun 22, 2022 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

lhoestq Jun 23, 2022

Uh oh!

mariosasko Jun 23, 2022

Uh oh!

VijayKalmath Jun 23, 2022 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

VijayKalmath commented Jun 27, 2022

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 28, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make DuplicateKeysError more user friendly [For Issue #2556] #4545

Make DuplicateKeysError more user friendly [For Issue #2556] #4545

Uh oh!

Conversation

VijayKalmath commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Additions

Changes

Deletions

To do :

Issues Addressed :

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

mariosasko Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

VijayKalmath Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VijayKalmath commented Jun 27, 2022

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VijayKalmath commented Jun 22, 2022 •

edited

Loading

VijayKalmath Jun 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 28, 2022 •

edited

Loading