Sort unique_no_split_tokens to make it deterministic #6461

lhoestq · 2020-08-13T12:40:57Z

The unique_no_split_tokens attribute of tokenizers is not deterministic, and it makes the hashing in the nlp lib return different hashes for the same tokenizer over different sessions.

To fix that I changed its type to a set instead of a list.

Fix #6460

codecov · 2020-08-13T12:48:24Z

Codecov Report

Merging #6461 into master will decrease coverage by 0.47%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6461      +/-   ##
==========================================
- Coverage   80.09%   79.62%   -0.48%     
==========================================
  Files         153      153              
  Lines       28005    28005              
==========================================
- Hits        22430    22298     -132     
- Misses       5575     5707     +132

Impacted Files	Coverage Δ
src/transformers/tokenization_utils.py	`90.40% <100.00%> (ø)`
src/transformers/modeling_tf_electra.py	`25.13% <0.00%> (-70.95%)`	⬇️
src/transformers/modeling_tf_utils.py	`86.97% <0.00%> (-0.33%)`	⬇️
src/transformers/file_utils.py	`82.44% <0.00%> (+0.25%)`	⬆️
src/transformers/modeling_tf_gpt2.py	`94.72% <0.00%> (+22.87%)`	⬆️
src/transformers/tokenization_albert.py	`87.50% <0.00%> (+58.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d94aec...dfb7549. Read the comment docs.

sgugger

This looks good to me, however it might be breaking existing code if a user uses list properties on this attribute (probably long shot). Would a sort of the list solve the issue as well? This would allow us to keep the same exposed API. If not, no problem with using a set.

thomwolf · 2020-08-13T13:05:40Z

I've actually switched in the last version of transformers from a set to a list for the same reason (not deterministic for nlp). Are you sure this really solve the problem @lhoestq ?
Also regarding backward compatibility, I'm fine with changing this from a list to a set @sgugger

thomwolf · 2020-08-13T13:06:28Z

Maybe we should rather have a sorted list?

lhoestq · 2020-08-13T13:07:14Z

sorted should solves the issue.

I just tested and a set doesn't solve it actually. I'll change to sorted, thanks @thomwolf

thomwolf · 2020-08-13T13:16:13Z

This is such an important use-case (and potential source of regression) for us that we may want to add a test on that in nlp or transformers in a not too far future.

lhoestq · 2020-08-13T13:19:34Z

Yes definitely. Not sure how to test consistency across sessions in the CI though.
I guess we could have tests with hardcoded hashes for some tokenizers but I'm not sure that's ideal.

Or maybe there's a way to do two CI jobs in a row: one to generate the hashes in a first session, and one to verify that they're the same in another session.

LysandreJik

LGTM

* Generation doc * MBartForConditionalGeneration (#6441) * add MBartForConditionalGeneration * style * rebase and fixes * add mbart test in TEST_FILES_WITH_NO_COMMON_TESTS * fix docs * don't ignore mbart * doc * fix mbart fairseq link * put mbart before bart * apply doc suggestions * Use hash to clean the test dirs (#6475) * Use hash to clean the test dirs * Use hash to clean the test dirs * Use hash to clean the test dirs * fix * [EncoderDecoder] Add Cross Attention for GPT2 (#6415) * add cross attention layers for gpt2 * make gpt2 cross attention work * finish bert2gpt2 * add explicit comments * remove attention mask since not yet supported * revert attn mask in pipeline * Update src/transformers/modeling_gpt2.py Co-authored-by: Sylvain Gugger <[email protected]> * Update src/transformers/modeling_encoder_decoder.py Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> * Sort unique_no_split_tokens to make it deterministic (#6461) * change unique_no_split_tokens's type to set * use sorted list instead of set * style * Import accuracy_score (#6480) * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Address comments * Styling * Generation doc * Apply suggestions from code review Co-authored-by: Lysandre Debut <[email protected]> * Address comments * Styling Co-authored-by: Suraj Patil <[email protected]> Co-authored-by: Kevin Canwen Xu <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: gijswijnholds <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>

change unique_no_split_tokens's type to set

b7dcfc0

lhoestq requested review from LysandreJik and sgugger August 13, 2020 12:45

sgugger reviewed Aug 13, 2020

View reviewed changes

lhoestq added 2 commits August 13, 2020 15:08

use sorted list instead of set

b3aa139

style

dfb7549

LysandreJik approved these changes Aug 13, 2020

View reviewed changes

lhoestq mentioned this pull request Aug 13, 2020

Fix tokenizers caching huggingface/datasets#502

Merged

lhoestq changed the title ~~Change unique_no_split_tokens's type to set~~ Sort unique_no_split_tokens to make it deterministic Aug 14, 2020

lhoestq merged commit 9a8c168 into huggingface:master Aug 14, 2020

cchen-dialpad mentioned this pull request Mar 17, 2021

Improve the speed of adding tokens from added_tokens.json #10676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sort unique_no_split_tokens to make it deterministic #6461

Sort unique_no_split_tokens to make it deterministic #6461

Uh oh!

lhoestq commented Aug 13, 2020

Uh oh!

codecov bot commented Aug 13, 2020 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

lhoestq commented Aug 13, 2020

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

lhoestq commented Aug 13, 2020

Uh oh!

LysandreJik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sort unique_no_split_tokens to make it deterministic #6461

Sort unique_no_split_tokens to make it deterministic #6461

Uh oh!

Conversation

lhoestq commented Aug 13, 2020

Uh oh!

codecov bot commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

lhoestq commented Aug 13, 2020

Uh oh!

thomwolf commented Aug 13, 2020

Uh oh!

lhoestq commented Aug 13, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Aug 13, 2020 •

edited

Loading