-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Remove tokenizer creation from sft example script
#4197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We also have |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
I think that unless we have a good reason to create the tokenizer outside the trainer, we should let the trainer create it. I'm fine with removing in other scripts now if you want |
commit ae6837f Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 18:40:18 2025 +0200 Removed tokenizer/processor creation from example scripts (#4211) commit 56a8f11 Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 17:45:44 2025 +0200 Replace setup with pyproject and fix packaging unintended modules (#4194) commit 5291015 Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 16:04:06 2025 +0200 Remove `Optional` from `processing_class` in `PPOTrainer` (#4212) commit 0588b1f Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 15:57:17 2025 +0200 Updated vLLM integration guide (#4162) Co-authored-by: Quentin Gallouédec <[email protected]> commit 45ee98b Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 11:14:54 2025 +0200 Replace unittest with pytest (#4188) commit 3800a6e Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 11:13:21 2025 +0200 Hotfix: Exclude transformers 4.57.0 for Python 3.9 (#4209) Co-authored-by: Sergio Paniego Blanco <[email protected]> commit 7ad9ce8 Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 11:04:20 2025 +0200 Remove tokenizer creation from `sft` example script (#4197) commit 0c2dc14 Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 08:31:58 2025 +0200 Remove custome_container for building the docs (#4198) commit ced8b33 Author: burtenshaw <[email protected]> Date: Mon Oct 6 08:23:11 2025 +0200 [DOCS/FIX] lora without regrets - fix lr (#4207)
commit 65eb45c Author: Quentin Gallouédec <[email protected]> Date: Mon Oct 6 13:07:18 2025 -0600 Apply style and revert change in `sft_video_llm` example (#4214) commit ae6837f Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 18:40:18 2025 +0200 Removed tokenizer/processor creation from example scripts (#4211) commit 56a8f11 Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 17:45:44 2025 +0200 Replace setup with pyproject and fix packaging unintended modules (#4194) commit 5291015 Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 16:04:06 2025 +0200 Remove `Optional` from `processing_class` in `PPOTrainer` (#4212) commit 0588b1f Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 15:57:17 2025 +0200 Updated vLLM integration guide (#4162) Co-authored-by: Quentin Gallouédec <[email protected]> commit 45ee98b Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 11:14:54 2025 +0200 Replace unittest with pytest (#4188) commit 3800a6e Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 11:13:21 2025 +0200 Hotfix: Exclude transformers 4.57.0 for Python 3.9 (#4209) Co-authored-by: Sergio Paniego Blanco <[email protected]> commit 7ad9ce8 Author: Sergio Paniego Blanco <[email protected]> Date: Mon Oct 6 11:04:20 2025 +0200 Remove tokenizer creation from `sft` example script (#4197) commit 0c2dc14 Author: Albert Villanova del Moral <[email protected]> Date: Mon Oct 6 08:31:58 2025 +0200 Remove custome_container for building the docs (#4198) commit ced8b33 Author: burtenshaw <[email protected]> Date: Mon Oct 6 08:23:11 2025 +0200 [DOCS/FIX] lora without regrets - fix lr (#4207)
What does this PR do?
In the
sftexample script, the tokenizer is currently created before creating the trainer. This causes errors for VLM since they use processor. It can safely be removed since it's created inside the trainerBefore submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.