Skip to content

ALBERT Tokenizer integration test#9943

Merged
LysandreJik merged 3 commits into
masterfrom
albert-tokenizer-integration-tests
Feb 2, 2021
Merged

ALBERT Tokenizer integration test#9943
LysandreJik merged 3 commits into
masterfrom
albert-tokenizer-integration-tests

Conversation

@LysandreJik
Copy link
Copy Markdown
Member

Implements an integration test for the ALBERT tokenizer.

Comment thread tests/test_tokenization_albert.py Outdated

expected_encoding = {
"input_ids": [2, 2953, 45, 21, 13, 10601, 11502, 26, 1119, 8, 8542, 3762, 69, 2477, 16, 816, 18667, 3],
"token_type_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make a batched integration test to make it a bit more aggressive and make sure atteniton_mask is correctly created?

@LysandreJik
Copy link
Copy Markdown
Member Author

Good point!

"The first one is a factorized embedding parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden layers from the size of vocabulary embedding."
]

encoding = tokenizer(sequences, padding=True)
Copy link
Copy Markdown
Contributor

@patrickvonplaten patrickvonplaten Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit & question) what is the recommendeded way of handling padding? Isn't it rather padding="longest"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True is longest, False is do_not_pad, and then there's max_length. Either of those 5 work, and they're all mentioned in the docs so I would argue they're all recommended.

Copy link
Copy Markdown
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for adding the test!

@LysandreJik LysandreJik merged commit 1809de5 into master Feb 2, 2021
@LysandreJik LysandreJik deleted the albert-tokenizer-integration-tests branch February 2, 2021 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants