[rollout, vllm, sglang] fix: allow user customization of repetition_penalty to avoid watchdog timeout during GRPO rollout (volcengine#3309)

Mighten · techkang · commit 8616165f0e47 · 2025-10-31T16:18:14.000+08:00
Allow user customization of `repetition_penalty` to avoid watchdog timeout during GRPO rollout ### What does this PR do? This PR adds an interface for users to specify `repetition_penalty`, which helps avoid repetition in LLM generation and prevents watchdog timeouts during GRPO rollout. If not specified, `repetition_penalty` will remain at its default value of `1.0`. ### Checklist Before Starting - [X] Search for similar PRs. No similar PRs found. - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test This PR can be vetted by existing CI test cases. ### API and Usage Example Previously, users could not specify `repetition_penalty`, but this PR adds support for it. For example, users can now start GRPO training with a command like: ```bash python -m verl.trainer.main_ppo \ +actor_rollout_ref.rollout.repetition_penalty=1.05 \ # other params here... ``` ### Design & Code Changes This PR adds an interface allowing users to specify the `repetition_penalty` (e.g., `1.05`), while maintaining backward compatibility with the default value of `1.0`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
diff --git a/verl/workers/rollout/sglang_rollout/async_sglang_server.py b/verl/workers/rollout/sglang_rollout/async_sglang_server.py
@@ -82,6 +82,7 @@ async def generate(
         request_id: str,
         image_data: Optional[list[Any]] = None,
     ) -> TokenOutput:
+        sampling_params.setdefault("repetition_penalty", self.config.rollout.get("repetition_penalty", 1.0))
         return await self.master_worker.generate.remote(prompt_ids, sampling_params, request_id, image_data=image_data)
 
     async def wake_up(self):
diff --git a/verl/workers/rollout/sglang_rollout/sglang_rollout.py b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
@@ -481,7 +481,7 @@ def _init_sampling_params(self, **kwargs):
             max_new_tokens=self.config.response_length,
             presence_penalty=0.0,
             frequency_penalty=0.0,
-            repetition_penalty=1.0,
+            repetition_penalty=self.config.get("repetition_penalty", 1.0),
         )
         # supporting adding any sampling params from the config file
         for k in self.config.keys():
diff --git a/verl/workers/rollout/vllm_rollout/vllm_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
@@ -350,6 +350,7 @@ async def generate(
     ) -> TokenOutput:
         max_tokens = self.max_model_len - len(prompt_ids)
         sampling_params["logprobs"] = 0 if sampling_params.pop("logprobs", False) else None
+        sampling_params.setdefault("repetition_penalty", self.config.rollout.get("repetition_penalty", 1.0))
         sampling_params = SamplingParams(max_tokens=max_tokens, **sampling_params)
         prompt_ids = _qwen2_5_vl_dedup_image_tokens(prompt_ids, self.processor)
         prompt = TokensPrompt(
diff --git a/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py b/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
@@ -206,6 +206,7 @@ def __init__(
             n=1,
             logprobs=0,  # can be set to 0 and let actor to recompute
             max_tokens=config.response_length,
+            repetition_penalty=config.get("repetition_penalty", 1.0),
         )
 
         kwargs["detokenize"] = False

Original file line number	Diff line number	Diff line change
`@@ -481,7 +481,7 @@ def _init_sampling_params(self, **kwargs):`
`481`	`481`	`max_new_tokens=self.config.response_length,`
`482`	`482`	`presence_penalty=0.0,`
`483`	`483`	`frequency_penalty=0.0,`
`484`		`- repetition_penalty=1.0,`
	`484`	`+ repetition_penalty=self.config.get("repetition_penalty", 1.0),`
`485`	`485`	`)`
`486`	`486`	`# supporting adding any sampling params from the config file`
`487`	`487`	`for k in self.config.keys():`
Original file line number	Diff line number	Diff line change
`@@ -206,6 +206,7 @@ def __init__(`
`206`	`206`	`n=1,`
`207`	`207`	`logprobs=0, # can be set to 0 and let actor to recompute`
`208`	`208`	`max_tokens=config.response_length,`
	`209`	`+ repetition_penalty=config.get("repetition_penalty", 1.0),`
`209`	`210`	`)`
`210`	`211`
`211`	`212`	`kwargs["detokenize"] = False`