-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[rollout, vllm, sglang] fix: allow user customization of repetition_penalty to avoid watchdog timeout during GRPO rollout
#3309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rollout, vllm, sglang] fix: allow user customization of repetition_penalty to avoid watchdog timeout during GRPO rollout
#3309
Conversation
… to avoid watchdog timeout during GRPO rollout
|
Could you also modify the args of server mode and sglang? Thanks! |
Sure, let me read the related codes. |
repetition_penalty to avoid watchdog timeout during GRPO rolloutrepetition_penalty to avoid watchdog timeout during GRPO rollout
Could you grant me CI test permissions to execute CI pipeline test cases? And please take another look, thank you. |
|
/gemini how to fix the failed sgl CI testcase? |
|
Hi Mighten, I understand you're looking for help with a failed However, I can offer some general guidance on how to approach debugging CI failures, especially for
Focusing on the error messages in the CI logs will be key to understanding the root cause. Let me know if you have specific code snippets or error messages you'd like me to review from a code correctness or best practices perspective, but I cannot directly debug the CI system itself. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable repetition_penalty for SGLang and vLLM rollouts, which is a useful addition to prevent generation loops. The implementation correctly reads the new configuration in the initialization paths. However, in the asynchronous server implementations for both SGLang and vLLM, the new logic incorrectly overrides any repetition_penalty provided by the caller, instead of using the configured value as a default. I've added critical comments with suggestions to fix this behavior to ensure that per-request customization is still possible.
…penalty` to avoid watchdog timeout during GRPO rollout (volcengine#3309) Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…penalty` to avoid watchdog timeout during GRPO rollout (volcengine#3309) Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…penalty` to avoid watchdog timeout during GRPO rollout (volcengine#3309) Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…penalty` to avoid watchdog timeout during GRPO rollout (volcengine#3309) Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…penalty` to avoid watchdog timeout during GRPO rollout (volcengine#3309) Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Allow user customization of
repetition_penaltyto avoid watchdog timeout during GRPO rolloutWhat does this PR do?
This PR adds an interface for users to specify
repetition_penalty, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified,repetition_penaltywill remain at its default value of1.0.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
This PR can be vetted by existing CI test cases.
API and Usage Example
Previously, users could not specify
repetition_penalty, but this PR adds support for it.For example, users can now start GRPO training with a command like:
python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here...Design & Code Changes
This PR adds an interface allowing users to specify the
repetition_penalty(e.g.,1.05), while maintaining backward compatibility with the default value of1.0.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)