|
1 | | -# E2E fixes journal |
| 1 | +# E2E Live-Run Fixes |
2 | 2 |
|
3 | | -Log of everything we fixed while getting the tinker-nemogym ↔ nemo-gym stack |
4 | | -to run end-to-end without a real ``TINKER_API_KEY``. One paragraph per fix. |
| 3 | +Running-log of bugs found and patches applied while doing live-Tinker |
| 4 | +end-to-end validation beyond the minimal `smoke_single_tool` path. For the |
| 5 | +original session (first-ever live run) see commit `a558e63` — this document |
| 6 | +covers follow-ups. |
5 | 7 |
|
6 | | -## 1. `policy_model` was using `vllm_model` and crashed on `choice.logprobs.content` |
| 8 | +## 2026-04-22 — Varied-reward / multi-step / resume / wandb validation |
7 | 9 |
|
8 | | -**Problem.** Every `/run` call returned HTTP 500 from the agent with a cascade |
9 | | -pointing at `responses_api_models/vllm_model/app.py:374` — |
10 | | -`KeyError: 'logprobs'`. |
| 10 | +### Fix #1: per-step INFO line missing hot-swap id + loss |
11 | 11 |
|
12 | | -**Root cause.** `vllm_model` is configured for real vLLM: when |
13 | | -`return_token_id_information: true` it POSTs `/v1/chat/completions` to the |
14 | | -backing model with `logprobs=True, return_tokens_as_token_ids=True` and then |
15 | | -parses `choice.logprobs.content[*].token` (of the form `"token_id:<int>"`). |
16 | | -Our shim speaks OpenAI Responses API shape directly and does not produce |
17 | | -vLLM's native `choice.logprobs.content`. |
| 12 | +**Symptom**: `/tmp/live_run_mcqa.log` showed only |
| 13 | +`Step N: mean_reward=X n_datums=Y dropped=Z`. No way to tell from that line |
| 14 | +alone whether the hot-swap fired or what the loss was, so the exercise of |
| 15 | +"prove hot-swap IDs progress between steps + prove loss is computed" had to |
| 16 | +fall back to reading wandb artifacts or Tinker server logs. |
18 | 17 |
|
19 | | -**Fix.** Switched `configs/nemogym_agent.yaml` from `vllm_model` to |
20 | | -`openai_model`, which forwards `/v1/responses` verbatim to our shim. Because |
21 | | -nemo-gym's `NeMoGymResponse` pydantic model auto-selects the |
22 | | -`ForTraining` output variants when `prompt_token_ids/generation_token_ids/ |
23 | | -generation_log_probs` are present on a message, the training fields flow |
24 | | -through end-to-end without vLLM-specific translation. |
| 18 | +**Fix**: `tinker_nemogym/trainer.py::run` now appends ` save_count=<N> |
| 19 | +loss=<F>` to the INFO line whenever those keys are present in the metrics |
| 20 | +dict. `save_count` is only populated when the hot-swap succeeded, and `loss` |
| 21 | +only when `forward_backward_async` returned a metrics dict (tinker SDK |
| 22 | +convention: `loss:sum`). Constant-reward skipped steps have neither — the |
| 23 | +string remains unchanged for them, preserving backward-compatible log output |
| 24 | +for those cases. |
25 | 25 |
|
26 | | -**Verification.** Manually ran a `/run` against the agent and confirmed a |
27 | | -200 with `reward=1.0` and the training fields intact on |
28 | | -`response.output[0]`. |
| 26 | +### Fix #2: Checkpoint URI not surfaced to operator |
29 | 27 |
|
30 | | -## 2. `NeMoGymResponse.model_validate` rejected our shim's `/v1/responses` output |
| 28 | +**Symptom**: `Saving checkpoint label=smoke_mcqa-step-2 (logical path: ...)` |
| 29 | +tells the operator the local dir name but not the `tinker://<run-id>/.../<label>` |
| 30 | +URI that `load_state_async` expects. Previously we were awaiting the save |
| 31 | +future and discarding its `.path`. So after a run, if you wanted to |
| 32 | +`resume_from_checkpoint`, you had to ask Tinker admin for the run_id. |
31 | 33 |
|
32 | | -**Problem.** Agent returned HTTP 500 when validating our shim's response: |
33 | | -``3 validation errors for NeMoGymResponse`` (three missing fields). |
| 34 | +**Fix**: `_maybe_save_checkpoint` now captures `result = await |
| 35 | +future.result_async()`, reads `result.path`, and emits a second INFO line: |
| 36 | +`Checkpoint saved: label=<L> uri=tinker://<run-id>:train:0/weights/<L>`. |
| 37 | +That URI is now copy-pasteable into `tinker.resume_from_checkpoint` for |
| 38 | +the next run. Phase 4 of the live validation relies on this. |
34 | 39 |
|
35 | | -**Root cause.** `nemo_gym.openai_utils.NeMoGymResponse` inherits from the |
36 | | -OpenAI `Response` model which requires `parallel_tool_calls`, `tool_choice`, |
37 | | -and `tools` even for tool-free responses. |
| 40 | +### Config tweak: mcqa smoke needs group_size=8 (not 4) |
38 | 41 |
|
39 | | -**Fix.** `tinker_responses_model.responses_endpoint` now echoes these fields |
40 | | -from the request (defaulting to `auto`/`True`/`[]` when absent) onto the |
41 | | -response body. |
| 42 | +**Symptom**: first mcqa smoke with group_size=4 gave 1/5 step with |
| 43 | +n_datums>0; the other four all collapsed to constant reward (0 or 1) within |
| 44 | +each group and were dropped. |
42 | 45 |
|
43 | | -**Verification.** Re-ran a rollout; response validated and the |
44 | | -`/run` round-trip succeeded. |
45 | | - |
46 | | -## 3. `extract_trajectory` looked at the wrong nesting level |
47 | | - |
48 | | -**Problem.** After fix #2 the rollout returned 200 but |
49 | | -`extract_trajectory` raised: ``Response is missing token_ids/logprobs``. |
50 | | - |
51 | | -**Root cause.** `NeMoGymResponse.output` is a **list** of heterogeneous |
52 | | -items; the training fields are carried on the first message item |
53 | | -(`NeMoGymResponseOutputMessageForTraining`), **not** at |
54 | | -`response["prompt_token_ids"]`. Our extractor was only checking the top |
55 | | -level of the `response` dict. |
56 | | - |
57 | | -**Fix.** Added `_find_training_output` that scans `response.output` for |
58 | | -the first item with all three training fields. Keeps the top-level |
59 | | -fallback for backward compatibility with older synthetic fixtures and |
60 | | -unit tests. |
61 | | - |
62 | | -**Verification.** `rollout_group` → `extract_trajectory` → |
63 | | -`build_datum` pipeline now succeeds on the real agent response. |
64 | | - |
65 | | -## 4. Trainer returned before its uvicorn shim was listening |
66 | | - |
67 | | -**Problem.** In fast-iteration scripts that did `trainer.setup()` followed |
68 | | -immediately by HTTP calls to `http://127.0.0.1:8001`, `aiohttp` raised |
69 | | -`ConnectionRefusedError: Cannot connect to host 127.0.0.1:8001`. |
70 | | - |
71 | | -**Root cause.** `_start_server_thread` spawned the uvicorn thread but |
72 | | -returned immediately — no readiness barrier. The main thread raced the |
73 | | -listener socket. |
74 | | - |
75 | | -**Fix.** `_start_server_thread` now polls `server.started` (uvicorn's |
76 | | -post-bind signal) for up to 10 seconds and logs the readiness state. |
77 | | - |
78 | | -**Verification.** Scripts stop producing connection-refused errors; e2e |
79 | | -script completes cleanly. |
80 | | - |
81 | | -## 5. `example_single_tool_call`'s verifier always returns reward=1.0 |
82 | | - |
83 | | -**Problem.** After the above fixes, `_run_step` kept early-exiting: |
84 | | -``all groups had constant reward, skipping training step``. The call |
85 | | -history showed `save_weights_and_get_sampling_client_async` but **no** |
86 | | -`forward_backward_async` or `optim_step_async`. |
87 | | - |
88 | | -**Root cause.** Not a bug — `SimpleWeatherResourcesServer.verify` is |
89 | | -hardcoded to reward=1.0. GRPO's "drop constant-reward groups" short-circuit |
90 | | -(designed to skip zero-gradient steps) was doing its job. |
91 | | - |
92 | | -**Fix.** Added `training.drop_constant_reward: bool = True` to |
93 | | -`TrainingConfig`; the e2e script sets it to `False` so the full training |
94 | | -pipeline gets exercised even against the toy verifier. |
95 | | - |
96 | | -**Verification.** `call_history` now contains `forward_backward_async`, |
97 | | -`optim_step_async`, and multiple `save_weights_and_get_sampling_client_async` |
98 | | -calls across steps. Hot-swap verified by comparing |
99 | | -`trainer.current_sampling_client` identity between steps — it changes |
100 | | -after each step. |
101 | | - |
102 | | -## Nothing else needed a fix |
103 | | - |
104 | | -- **Tokenizer.** `meta-llama/Llama-3.2-1B` was already cached in |
105 | | - `~/.cache/huggingface/hub` on the dev machine and loads without an HF |
106 | | - token. If a different environment gates this model, the fix is to set |
107 | | - `tinker.base_model` in the YAML to an ungated model (e.g. |
108 | | - `Qwen/Qwen2.5-0.5B`). Keeping the gated one as the default mirrors |
109 | | - production. |
110 | | -- **Renderer.** `tinker_cookbook.model_info.get_recommended_renderer_name` |
111 | | - supports the Llama 3.2 chat template out of the box. |
112 | | -- **Import paths.** `TokenIDLogProbMixin` and `SimpleResponsesAPIModel` are |
113 | | - imported from `nemo_gym.openai_utils` and `nemo_gym.base_responses_api_model` |
114 | | - respectively — already correct. |
115 | | -- **Agent port.** The agent listens on a random port allocated by nemo-gym |
116 | | - (default range 10001-20000). We discover it at runtime by polling |
117 | | - `GET {head_url}/server_instances` and filtering by |
118 | | - `server_type == "responses_api_agents"`. |
119 | | - |
120 | | -## 6. Multi-turn rollouts: datum_builder only used the first ForTraining output |
121 | | - |
122 | | -**Problem.** When pointed at nemo-gym's `example_multi_step` env (or any |
123 | | -agent that emits multiple assistant turns before the final answer) the |
124 | | -`/v1/responses` body contains *several* `NeMoGym*ForTraining` output items |
125 | | -interleaved with `function_call_output` items. The old |
126 | | -`_find_training_output` returned only the first match, so |
127 | | -`extract_trajectory` silently dropped every generation after turn 1 — the |
128 | | -datum covered just one turn of a potentially-10-turn trajectory. |
129 | | - |
130 | | -**Root cause.** `extract_trajectory` assumed a single ForTraining item. It |
131 | | -called `_find_training_output` (first-match) and stopped. |
132 | | - |
133 | | -**Fix.** Added `_collect_training_outputs` in `datum_builder.py` that |
134 | | -returns *every* ForTraining item in list order. `extract_trajectory` now |
135 | | -concatenates their `generation_token_ids` + `generation_log_probs` across |
136 | | -turns; the prompt for the trajectory is the prompt attached to the first |
137 | | -item (subsequent items carry the same prefix plus prior generations). |
138 | | -Falls back to the old top-level-fields shape for legacy fixtures. The |
139 | | -old single-match helper is kept for backward compatibility. |
140 | | - |
141 | | -**Verification.** New integration tests drive a multi-turn response |
142 | | -through `extract_trajectory` + `build_datum` and assert |
143 | | -`len(target) == len(logprobs) == len(advantages)` with the concatenated |
144 | | -generations (see `tests/integration/test_multi_turn.py`). |
145 | | - |
146 | | -## 7. Trainer had no error recovery around forward_backward / hot-swap |
147 | | - |
148 | | -**Problem.** A single transient tinker failure (503 from the backend, |
149 | | -intermittent network hiccup) killed the training loop — the exception |
150 | | -propagated up through `_run_step` and the outer run loop saw it only via |
151 | | -`logger.exception`, but by then sampling state could be half-updated. |
152 | | - |
153 | | -**Root cause.** `_run_step` awaited `forward_backward_async`, `optim_step_async`, |
154 | | -and `save_weights_and_get_sampling_client_async` without try/except. |
155 | | - |
156 | | -**Fix.** Wrapped the fwd_bwd + optim calls in a single try/except that |
157 | | -logs at ERROR and returns metrics (with `fwd_bwd_error` marker) instead of |
158 | | -raising. Separately wrapped the hot-swap. `_maybe_save_checkpoint` was |
159 | | -already wrapped but is now exercised by a test. This lets the outer loop |
160 | | -count the failed step and move on. |
161 | | - |
162 | | -**Verification.** `tests/integration/test_error_recovery.py` injects |
163 | | -exceptions into each call and asserts the trainer survives and subsequent |
164 | | -steps still run. |
165 | | - |
166 | | -## 8. Shim mapped every RuntimeError to 503, hiding real sampling failures |
167 | | - |
168 | | -**Problem.** When a sampling client raised — e.g. a real tinker backend |
169 | | -error surfaced as a RuntimeError — the shim returned a 503. Callers |
170 | | -treated that as "service not ready, retry after wiring up", which is wrong |
171 | | -for a failure originating *inside* the sampling client (no amount of retry |
172 | | -helps without a code fix). |
173 | | - |
174 | | -**Root cause.** The bare `except RuntimeError` in `chat_completions_endpoint` |
175 | | -+ `responses_endpoint` swallowed *every* RuntimeError as 503. |
176 | | - |
177 | | -**Fix.** Narrowed the 503 mapping to RuntimeErrors whose message matches |
178 | | -the "not initialized" / "no sampling client" strings we raise ourselves. |
179 | | -Everything else (including arbitrary RuntimeErrors from the sampling |
180 | | -client) maps to 500 with `detail="sampling_client error: ..."`. Process |
181 | | -stays alive either way. |
182 | | - |
183 | | -**Verification.** New tests set up a sampling client that always raises |
184 | | -RuntimeError; POST to `/v1/responses` and `/v1/chat/completions` now |
185 | | -return 500 (not 503), the shim remains responsive, and a subsequent |
186 | | -request after swapping in a healthy sampler succeeds. See |
187 | | -`tests/integration/test_error_recovery.py::test_shim_handles_sampling_client_exception`. |
188 | | - |
189 | | -## 9. Resume from checkpoint was not wired |
190 | | - |
191 | | -**Problem.** The config schema had `checkpoint_dir` + `save_every` but no |
192 | | -knob to *load* a saved checkpoint — the only way to restart a run was to |
193 | | -hot-patch `training_client.load_state_async` manually. |
194 | | - |
195 | | -**Root cause.** Feature wasn't implemented. Not a bug per se, but the |
196 | | -e2e test matrix needed a real resume path. |
197 | | - |
198 | | -**Fix.** Added `tinker.resume_from_checkpoint: str | None` to |
199 | | -`TinkerConfig`. When set, `setup()` calls |
200 | | -`training_client.load_state_async(path)` *before* the initial |
201 | | -`save_weights_and_get_sampling_client_async(name="step_0")` call so the |
202 | | -first rollout samples from the restored weights. Matches the real tinker |
203 | | -SDK's signature: `load_state_async(path, weights_access_token=None)`. |
204 | | -`FakeLoRATrainingClient` grows matching `load_state_async` + |
205 | | -`load_state_with_optimizer_async` methods that record the call and stash |
206 | | -the path on `self._loaded_state` for test assertions. |
207 | | - |
208 | | -**Verification.** `tests/integration/test_checkpoint_resume.py` verifies |
209 | | -(a) load_state_async is called with the configured path, (b) it precedes |
210 | | -the initial sampler snapshot, (c) load failures don't abort setup, and |
211 | | -(d) a full save-then-resume-in-a-new-trainer roundtrip round-trips the |
212 | | -path string correctly. |
213 | | - |
214 | | -## 10. Default `base_model` was a Tinker-unsupported variant |
215 | | - |
216 | | -**Problem.** Real Tinker rejected `create_lora_training_client_async` with |
217 | | -an "unsupported base model" error when the smoke YAMLs specified |
218 | | -`meta-llama/Llama-3.2-1B-Instruct`. Tinker's production roster doesn't |
219 | | -include the 1B Llama-3.2 variant at the moment. |
220 | | - |
221 | | -**Root cause.** We picked the smallest Llama for the default because |
222 | | -tokenizer caches were handy, but the backend's supported-model list is |
223 | | -narrower than what the HF tokenizer zoo contains. `get_server_capabilities_async()` |
224 | | -confirms which base models the server accepts. |
225 | | - |
226 | | -**Fix.** Switched the default `base_model` in both |
227 | | -`configs/smoke_single_tool.yaml` and `configs/smoke_multi_step.yaml` to |
228 | | -`meta-llama/Llama-3.1-8B-Instruct` — the smallest Llama currently on |
229 | | -Tinker's supported list. Tokenizer is ungated enough to download on a |
230 | | -fresh box; context length (32768) comfortably covers our smoke prompts. |
231 | | - |
232 | | -**Verification.** Real-Tinker smoke run now passes the |
233 | | -`create_lora_training_client_async` step (next failure is further down |
234 | | -the pipeline). |
235 | | - |
236 | | -## 11. `save_state_async(name=...)` rejected filesystem-style paths |
237 | | - |
238 | | -**Problem.** First real training step got past `forward_backward_async` |
239 | | -and `optim_step_async` but `_maybe_save_checkpoint` raised — Tinker's |
240 | | -server returned a validation error on the `name` field. |
241 | | - |
242 | | -**Root cause.** Tinker validates `save_state_async`'s `name` as a |
243 | | -"weights label" and only allows `[A-Za-z0-9._-]`. Our code was passing |
244 | | -a full filesystem path like `./checkpoints/smoke_single_tool/step_1`, |
245 | | -which contains slashes and a leading dot-slash. |
246 | | - |
247 | | -**Fix.** `_maybe_save_checkpoint` now constructs a safe identifier |
248 | | -from the checkpoint_dir basename plus the step number |
249 | | -(`{dir_basename}-step-{n}`) and passes *that* to `save_state_async`. |
250 | | -The filesystem-style path is still logged for operator readability but |
251 | | -doesn't go to the API. |
252 | | - |
253 | | -**Verification.** Real-Tinker run: no more 400 on checkpoint save; the |
254 | | -saved label appears in subsequent `list_state_ids_async` output. |
255 | | - |
256 | | -## 12. Hardcoded `agent_url` raced nemo-gym's random port allocation |
257 | | - |
258 | | -**Problem.** Trainer POSTed to `http://127.0.0.1:11001/run` and got |
259 | | -`ConnectionRefusedError`. nemo-gym's agent was listening — just on a |
260 | | -different random port inside the 10001-20000 range. |
261 | | - |
262 | | -**Root cause.** `configs/smoke_single_tool.yaml` (and `smoke_multi_step.yaml`) |
263 | | -hardcoded `agent_url: http://127.0.0.1:11001` based on an old assumption |
264 | | -that agents always land on 11001. In reality nemo-gym allocates agent |
265 | | -ports randomly from 10001-20000 and only registers them with the |
266 | | -HeadServer. We already have a discovery helper |
267 | | -(`tinker_nemogym/utils/process.py::discover_agent_url`) that the trainer |
268 | | -invokes when `cfg.nemogym.agent_url is None`. |
269 | | - |
270 | | -**Fix.** Set `agent_url: null` in both YAMLs to unlock the dynamic |
271 | | -discovery path. No code change required — the discovery was wired but |
272 | | -gated behind an explicit opt-in. |
273 | | - |
274 | | -**Verification.** Real-Tinker run: trainer logs |
275 | | -`discovered agent_url=http://127.0.0.1:<port>` and the first `/run` |
276 | | -request goes through without `ConnectionRefusedError`. |
| 46 | +**Fix**: bump `group_size: 8` in `configs/smoke_mcqa.yaml`. At T=1.0 with |
| 47 | +10-option MCQ on an 8B Instruct, 8 samples gives consistent intra-group |
| 48 | +variance. Confirmed: 3/5 steps produce n_datums=8, advantages + loss fire. |
277 | 49 |
|
| 50 | +Not a code fix — parameters only. Documented in the YAML with a comment so |
| 51 | +future operators don't regress it. |
0 commit comments