-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Make OpenAIGPTTokenizer work with SpaCy 2.x and 3.x #15019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm not able to test the changes locally. When I run |
|
@patrickvonplaten @LysandreJik Not sure why but I can't tag you as reviewers on GitHub, so tagging you in comments here. |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me. @LysandreJik do you think it's worth adding a test the depends on a spacys install?
|
Yes, why not! GPT-2 is among our most used models, so I think testing that the tokenization behaves correctly across all possibilities is important. Would you like to take a stab at it @cody-moveworks? In order to do so, you could start by adding an transformers/src/transformers/file_utils.py Lines 507 to 508 in 2e9af29
Then it would require defining a transformers/src/transformers/testing_utils.py Lines 404 to 412 in 2e9af29
Thirdly, you can add a test in the And finally, we can modify this CircleCI run: transformers/.circleci/config.yml Lines 538 to 565 in 2e9af29
So that it:
|
|
@LysandreJik Thanks for the detailed walkthrough of the changes needed to add testing. I'll take a stab at it. |
d951e6a to
fe57356
Compare
SpaCy 3.x introduced an API change to creating the tokenizer that breaks OpenAIGPTTokenizer. The old API for creating the tokenizer in SpaCy 2.x no longer works under SpaCy 3.x, but the new API for creating the tokenizer in SpaCy 3.x DOES work under SpaCy 2.x. Switching to the new API should allow OpenAIGPTTokenizer to work under both SpaCy 2.x and SpaCy 3.x versions.
fe57356 to
2a19f40
Compare
|
@LysandreJik I made the changes and all checks are passing. Can you take a look? |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very nice! This looks good to me, @cody-moveworks, tested it locally.
Thank you, merging!
SpaCy 3.x introduced an API change to creating the tokenizer that
breaks OpenAIGPTTokenizer. The old API for creating the tokenizer in
SpaCy 2.x no longer works under SpaCy 3.x, but the new API for creating
the tokenizer in SpaCy 3.x DOES work under SpaCy 2.x. Switching to the
new API should allow OpenAIGPTTokenizer to work under both SpaCy 2.x and
SpaCy 3.x versions.
Fixes #14449