-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Sort unique_no_split_tokens to make it deterministic #6461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort unique_no_split_tokens to make it deterministic #6461
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6461 +/- ##
==========================================
- Coverage 80.09% 79.62% -0.48%
==========================================
Files 153 153
Lines 28005 28005
==========================================
- Hits 22430 22298 -132
- Misses 5575 5707 +132
Continue to review full report at Codecov.
|
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, however it might be breaking existing code if a user uses list properties on this attribute (probably long shot). Would a sort of the list solve the issue as well? This would allow us to keep the same exposed API. If not, no problem with using a set.
|
Maybe we should rather have a sorted list? |
|
I just tested and a |
|
This is such an important use-case (and potential source of regression) for us that we may want to add a test on that in |
|
Yes definitely. Not sure how to test consistency across sessions in the CI though. Or maybe there's a way to do two CI jobs in a row: one to generate the hashes in a first session, and one to verify that they're the same in another session. |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Generation doc * MBartForConditionalGeneration (#6441) * add MBartForConditionalGeneration * style * rebase and fixes * add mbart test in TEST_FILES_WITH_NO_COMMON_TESTS * fix docs * don't ignore mbart * doc * fix mbart fairseq link * put mbart before bart * apply doc suggestions * Use hash to clean the test dirs (#6475) * Use hash to clean the test dirs * Use hash to clean the test dirs * Use hash to clean the test dirs * fix * [EncoderDecoder] Add Cross Attention for GPT2 (#6415) * add cross attention layers for gpt2 * make gpt2 cross attention work * finish bert2gpt2 * add explicit comments * remove attention mask since not yet supported * revert attn mask in pipeline * Update src/transformers/modeling_gpt2.py Co-authored-by: Sylvain Gugger <[email protected]> * Update src/transformers/modeling_encoder_decoder.py Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> * Sort unique_no_split_tokens to make it deterministic (#6461) * change unique_no_split_tokens's type to set * use sorted list instead of set * style * Import accuracy_score (#6480) * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Address comments * Styling * Generation doc * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Address comments * Styling Co-authored-by: Suraj Patil <[email protected]> Co-authored-by: Kevin Canwen Xu <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: gijswijnholds <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
The
unique_no_split_tokensattribute of tokenizers is not deterministic, and it makes the hashing in thenlplib return different hashes for the same tokenizer over different sessions.To fix that I changed its type to a
setinstead of alist.Fix #6460