Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 13, 2020

The unique_no_split_tokens attribute of tokenizers is not deterministic, and it makes the hashing in the nlp lib return different hashes for the same tokenizer over different sessions.

To fix that I changed its type to a set instead of a list.

Fix #6460

@lhoestq lhoestq requested review from LysandreJik and sgugger August 13, 2020 12:45
@codecov
Copy link

codecov bot commented Aug 13, 2020

Codecov Report

Merging #6461 into master will decrease coverage by 0.47%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6461      +/-   ##
==========================================
- Coverage   80.09%   79.62%   -0.48%     
==========================================
  Files         153      153              
  Lines       28005    28005              
==========================================
- Hits        22430    22298     -132     
- Misses       5575     5707     +132     
Impacted Files Coverage Δ
src/transformers/tokenization_utils.py 90.40% <100.00%> (ø)
src/transformers/modeling_tf_electra.py 25.13% <0.00%> (-70.95%) ⬇️
src/transformers/modeling_tf_utils.py 86.97% <0.00%> (-0.33%) ⬇️
src/transformers/file_utils.py 82.44% <0.00%> (+0.25%) ⬆️
src/transformers/modeling_tf_gpt2.py 94.72% <0.00%> (+22.87%) ⬆️
src/transformers/tokenization_albert.py 87.50% <0.00%> (+58.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d94aec...dfb7549. Read the comment docs.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, however it might be breaking existing code if a user uses list properties on this attribute (probably long shot). Would a sort of the list solve the issue as well? This would allow us to keep the same exposed API. If not, no problem with using a set.

@thomwolf
Copy link
Member

I've actually switched in the last version of transformers from a set to a list for the same reason (not deterministic for nlp). Are you sure this really solve the problem @lhoestq ?
Also regarding backward compatibility, I'm fine with changing this from a list to a set @sgugger

@thomwolf
Copy link
Member

Maybe we should rather have a sorted list?

@lhoestq
Copy link
Member Author

lhoestq commented Aug 13, 2020

sorted should solves the issue.

I just tested and a set doesn't solve it actually. I'll change to sorted, thanks @thomwolf

@thomwolf
Copy link
Member

This is such an important use-case (and potential source of regression) for us that we may want to add a test on that in nlp or transformers in a not too far future.

@lhoestq
Copy link
Member Author

lhoestq commented Aug 13, 2020

Yes definitely. Not sure how to test consistency across sessions in the CI though.
I guess we could have tests with hardcoded hashes for some tokenizers but I'm not sure that's ideal.

Or maybe there's a way to do two CI jobs in a row: one to generate the hashes in a first session, and one to verify that they're the same in another session.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhoestq lhoestq changed the title Change unique_no_split_tokens's type to set Sort unique_no_split_tokens to make it deterministic Aug 14, 2020
@lhoestq lhoestq merged commit 9a8c168 into huggingface:master Aug 14, 2020
sgugger added a commit that referenced this pull request Aug 14, 2020
* Generation doc

* MBartForConditionalGeneration (#6441)

* add MBartForConditionalGeneration

* style

* rebase and fixes

* add mbart test in TEST_FILES_WITH_NO_COMMON_TESTS

* fix docs

* don't ignore mbart

* doc

* fix mbart fairseq link

* put mbart before bart

* apply doc suggestions

* Use hash to clean the test dirs (#6475)

* Use hash to clean the test dirs

* Use hash to clean the test dirs

* Use hash to clean the test dirs

* fix

* [EncoderDecoder] Add Cross Attention for GPT2 (#6415)

* add cross attention layers for gpt2

* make gpt2 cross attention work

* finish bert2gpt2

* add explicit comments

* remove attention mask since not yet supported

* revert attn mask in pipeline

* Update src/transformers/modeling_gpt2.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/modeling_encoder_decoder.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>

* Sort unique_no_split_tokens to make it deterministic (#6461)

* change unique_no_split_tokens's type to set

* use sorted list instead of set

* style

* Import accuracy_score (#6480)

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address comments

* Styling

* Generation doc

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <[email protected]>

* Address comments

* Styling

Co-authored-by: Suraj Patil <[email protected]>
Co-authored-by: Kevin Canwen Xu <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: gijswijnholds <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hashing a tokenizer using the 🤗 nlp lib is not deterministic

4 participants