Online policy distillation for agentic tool-use, using textual hindsight hints extracted from environment feedback to construct a stronger teacher signal.
Unlike binary RL which assigns a scalar reward, OPD constructs a token-level teacher distribution by augmenting the original prompt with a hindsight hint, then distills this improved distribution back into the student policy.
For each main-line turn, the system:
- Forwards the request to the policy model (SGLang) and collects the response with per-token log-probabilities.
- When the next turn arrives, the next state (user reply / environment feedback) reveals whether the previous response was helpful.
- A judge model (served on the PRM GPUs) evaluates the (response, next_state) pair
mtimes. Each evaluation produces:- A binary decision:
\boxed{1}(the next state reveals useful hindsight) or\boxed{-1}(no useful signal). - If positive: a textual hint wrapped in
[HINT_START]...[HINT_END]— a concise, actionable description of what the response should have done differently.
- A binary decision:
-
Hint selection: Among all votes scored
+1with a non-trivial hint (>10 chars), the longest hint is selected. If no valid hint exists, this sample is dropped entirely from training. - The selected hint is appended to the original prompt as
[user's hint / instruction]\n{hint}, creating an enhanced prompt. -
Teacher log-probs are computed by running the enhanced prompt + original response through the teacher model. This gives
$\log\pi_{\text{teacher}}(a_t \mid s_{\text{enhanced}})$ — what the model would have predicted if it had known the hint. - The sample is submitted for training with these teacher log-probs as the distillation target.
The advantage at each token is the log-probability gap between the hint-enhanced teacher and the current student:
Intuitively:
- When
$A_t > 0$ : the teacher (with the hint) assigns higher probability to token$a_t$ than the student — the student should increase this probability. - When
$A_t < 0$ : the teacher considers this token less likely given the hint — the student should decrease this probability.
This provides a token-level, directional training signal that is richer than a single scalar reward.
The same PPO-style clipped surrogate is used:
But now
where
cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.shopenclaw-opd/
├── README.md
├── run_qwen3_4b_openclaw_opd.sh # Launch script
├── openclaw_opd_api_server.py # FastAPI proxy + judge + teacher log-probs + sample submission
├── openclaw_opd_rollout.py # Async rollout worker (bridges API server ↔ SLIME trainer)
└── results/ # Runtime records (auto-created)