With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:

To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:
- compute_score will be executed by a Ray actor.
- The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.
With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:
To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline: