Skip to content

Loss rollout while training #218

@XianglongTan

Description

@XianglongTan

The Error traceback:

File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 152, in run
    trainer.fit()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/trainer.py", line 318, in fit
    metrics = self._train_step(batch_dict)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/trainer.py", line 95, in _train_step
    batch, agent_metrics = self.agent_mode_daemon.get_train_data_batch(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/daemon.py", line 379, in get_train_data_batch
    original_sample = self._task_id_to_original_sample[rollout_id]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^

training log

(Process-11615 agentlightning.server) Requeuing task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 after timeout (attempt 1)
(Process-11615 agentlightning.server) Task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 timed out after 600.0s, requeued (attempt 1)
(Process-11615 agentlightning.server) Task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 re-claimed (attempt 2)
(Process-11615 agentlightning.server) Rollout received and stored: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

agent log

[Task 10133 Received] ID: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

[Task 10190 Received] ID: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)   [Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Message length details:
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 0: 2633 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 1: 3002 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 2: 176 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 3: 3013 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 4: 323 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 5: 4113 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Total: 6 messages, 13260 characters

(Process-1116 agentlightning.runner)   [Worker 3 | Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Completed in 25.88s. Triplet length: 4. Reward: 0.0

2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)   [Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Message length details:
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 0: 2633 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 1: 4985 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 2: 265 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 3: 3013 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 4: 412 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 5: 4444 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Total: 6 messages, 15752 characters

(Process-1113 agentlightning.runner)   [Worker 0 | Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Completed in 1505.21s. Triplet length: 4. Reward: 0.0

I guess the server raise timeout error bcz agent takes too much time to finish task. I suggest that if time out, just ignore that rollout.

BTW, is there any wechat group or rednote group?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions