-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Closed
Description
verl/verl/trainer/ppo/core_algos.py
Lines 294 to 303 in a1dd922
| negative_approx_kl = log_prob - old_log_prob | |
| ratio = torch.exp(negative_approx_kl) | |
| ppo_kl = verl_F.masked_mean(-negative_approx_kl, eos_mask) | |
| pg_losses = -advantages * ratio | |
| pg_losses2 = -advantages * torch.clamp(ratio, 1.0 - cliprange, 1.0 + cliprange) | |
| pg_loss = verl_F.masked_mean(torch.max(pg_losses, pg_losses2), eos_mask) | |
| pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses).float(), eos_mask) | |
| return pg_loss, pg_clipfrac, ppo_kl |
In your implementation, the clipping of pg_loss is applied when pg_loss is super small. However, pg_loss can be a positive value (as advantages can be negative), in that case, should the clipping be applied depending the sign of pg_loss (namely, when pg_loss is positive, clip by torch.max; when pg_loss is negative, clip by torch.min)?
在你们的代码实现中,pg_loss的截断操作的实现方法是 torch.max(pg_losses, pg_losses2)。但是,pg_loss可能是正值,因为advantage可能是负值,这种情况下,是否应该根据pg_loss的符号来决定是用torch.max还是torch.min来截断?
I noticed this because in my PPO experiment, pg_loss goes sky-high at certain steps:
AIBionics, yang-ybb, zhaoyd1 and llf1234
Metadata
Metadata
Assignees
Labels
No labels
