Skip to content

Conversation

@cody-moveworks
Copy link
Contributor

SpaCy 3.x introduced an API change to creating the tokenizer that
breaks OpenAIGPTTokenizer. The old API for creating the tokenizer in
SpaCy 2.x no longer works under SpaCy 3.x, but the new API for creating
the tokenizer in SpaCy 3.x DOES work under SpaCy 2.x. Switching to the
new API should allow OpenAIGPTTokenizer to work under both SpaCy 2.x and
SpaCy 3.x versions.

Fixes #14449

@cody-moveworks
Copy link
Contributor Author

I'm not able to test the changes locally. When I run pip install -e ".[dev]", I encounter a build error and the install fails. It doesn't look like pytest coverage is sufficient here either since SpaCy and ftfy are not required by transformers, and the code can execute correctly without them being installed.

@cody-moveworks cody-moveworks marked this pull request as ready for review January 3, 2022 23:15
@cody-moveworks
Copy link
Contributor Author

@patrickvonplaten @LysandreJik Not sure why but I can't tag you as reviewers on GitHub, so tagging you in comments here.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me. @LysandreJik do you think it's worth adding a test the depends on a spacys install?

@LysandreJik
Copy link
Member

Yes, why not! GPT-2 is among our most used models, so I think testing that the tokenization behaves correctly across all possibilities is important. Would you like to take a stab at it @cody-moveworks?

In order to do so, you could start by adding an is_spacy_available method in file_utils.py, analog to other methods such as is_vision_available here:

def is_vision_available():
return importlib.util.find_spec("PIL") is not None

Then it would require defining a require_spacy unittest decorator in testing_utils.py, such as the require_vision here:

def require_vision(test_case):
"""
Decorator marking a test that requires the vision dependencies. These tests are skipped when torchaudio isn't
installed.
"""
if not is_vision_available():
return unittest.skip("test requires vision")(test_case)
else:
return test_case

Thirdly, you can add a test in the tests/test_tokenization_gpt2.py file with the @require_spacy decorator, which will only run when SpaCy is installed.

And finally, we can modify this CircleCI run:

run_tests_custom_tokenizers:
working_directory: ~/transformers
docker:
- image: circleci/python:3.7
environment:
RUN_CUSTOM_TOKENIZERS: yes
TRANSFORMERS_IS_CI: yes
steps:
- checkout
- restore_cache:
keys:
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[ja,testing,sentencepiece,jieba]
- run: python -m unidic download
- save_cache:
key: v0.4-custom_tokenizers-{{ checksum "setup.py" }}
paths:
- '~/.cache/pip'
- run: |
if [ -f test_list.txt ]; then
python -m pytest -s --make-reports=tests_custom_tokenizers ./tests/test_tokenization_bert_japanese.py | tee tests_output.txt
fi
- store_artifacts:
path: ~/transformers/tests_output.txt
- store_artifacts:
path: ~/transformers/reports

So that it:

  1. Installs SpaCy
  2. Runs the tokenization GPT-2 test file!

@cody-moveworks
Copy link
Contributor Author

@LysandreJik Thanks for the detailed walkthrough of the changes needed to add testing. I'll take a stab at it.

@cody-moveworks cody-moveworks force-pushed the fix_tokenizer branch 3 times, most recently from d951e6a to fe57356 Compare January 7, 2022 07:15
SpaCy 3.x introduced an API change to creating the tokenizer that
breaks OpenAIGPTTokenizer. The old API for creating the tokenizer in
SpaCy 2.x no longer works under SpaCy 3.x, but the new API for creating
the tokenizer in SpaCy 3.x DOES work under SpaCy 2.x. Switching to the
new API should allow OpenAIGPTTokenizer to work under both SpaCy 2.x and
SpaCy 3.x versions.
@cody-moveworks
Copy link
Contributor Author

@LysandreJik I made the changes and all checks are passing. Can you take a look?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very nice! This looks good to me, @cody-moveworks, tested it locally.

Thank you, merging!

@LysandreJik LysandreJik merged commit a54961c into huggingface:master Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAIGPTTokenizer does not work with spacy 3.x installed

3 participants