Async pipeline in generate and compute score

With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:

![Image](https://github.com/user-attachments/assets/4f918bee-4b36-454f-8dab-2378199b6aae)

To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:

1. compute_score will be executed by a Ray actor.
2. The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async pipeline in generate and compute score #1584

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Async pipeline in generate and compute score #1584

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions