-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
We tried to train a system similar to #1682 using VeRL and noticed a curious situation where we had a sudden drop in performance, entropy, a spike in KL and most predominantly HUGE spike in response length.
On further investigation we found that our behavior matches in profile to the merged implementation #1682 (see their wandb) as well as results in a bunch of other projects that use verl and some issues like #1967 .
I traced this issue down to the fact that the issue always happened after the first evaluation cycle and that warranted a further look at the sampling params. We found that that after the first eval/test cycle, the temperature parameter is not properly restored and training continues with a temperature of 0, acting completely greedy for the rest of training. This explained the spike in response lengths, drop in entropy, tendency to collapse, etc.
I tried to implement a proper fix and make a PR but couldn't sufficiently deal with multiple contexts trying to restore corrupted parameters.
For the moment, you can test this hypothesis with a rather dirty fix that got rid of all of these issues for me:
@contextmanager
def update_sampling_params(self, **kwargs):
# ... existing context manager code ...
try:
yield
finally:
# roll back to previous sampling params
for key, value in old_sampling_params_args.items():
self.sampling_params[key] = value
# NUCLEAR FIX: Ensure temperature is always 1.0 for training
if 'temperature' in self.sampling_params and self.sampling_params['temperature'] == 0:
self.sampling_params['temperature'] = 1.0
And another slightly unrelated issue:
There are sampling parameters like stop supported by SGLang I'd have expected to be supported through simply setting actor_rollout_ref.rollout.stop (since most of the others are passed as-is). It would be nice if these could be dynamically just passed off to SGLang in the future so each param doesn't explicitly have to be supported in the future.