Skip to content

Commit 8e8dd14

Browse files
committed
docs: live-Tinker E2E validation results + bugfix log
docs/LIVE_RUN_RESULTS.md: full write-up of the 6-phase validation (varied-reward wiring, live mcqa run, multi-turn pipeline check, checkpoint resume, wandb offline logging, regression). Includes per-step result tables with actual loss values, captured checkpoint URIs, wandb artifacts path, and a tinker budget note. docs/E2E_FIXES.md: short running-log of the two trainer.py patches needed to make live runs self-validating (per-step save_count+loss; emit tinker:// URI after checkpoint save) plus the group_size=8 tuning note for the mcqa smoke.
1 parent 3a8e799 commit 8e8dd14

2 files changed

Lines changed: 206 additions & 265 deletions

File tree

docs/E2E_FIXES.md

Lines changed: 39 additions & 265 deletions
Original file line numberDiff line numberDiff line change
@@ -1,277 +1,51 @@
1-
# E2E fixes journal
1+
# E2E Live-Run Fixes
22

3-
Log of everything we fixed while getting the tinker-nemogym ↔ nemo-gym stack
4-
to run end-to-end without a real ``TINKER_API_KEY``. One paragraph per fix.
3+
Running-log of bugs found and patches applied while doing live-Tinker
4+
end-to-end validation beyond the minimal `smoke_single_tool` path. For the
5+
original session (first-ever live run) see commit `a558e63` — this document
6+
covers follow-ups.
57

6-
## 1. `policy_model` was using `vllm_model` and crashed on `choice.logprobs.content`
8+
## 2026-04-22 — Varied-reward / multi-step / resume / wandb validation
79

8-
**Problem.** Every `/run` call returned HTTP 500 from the agent with a cascade
9-
pointing at `responses_api_models/vllm_model/app.py:374`
10-
`KeyError: 'logprobs'`.
10+
### Fix #1: per-step INFO line missing hot-swap id + loss
1111

12-
**Root cause.** `vllm_model` is configured for real vLLM: when
13-
`return_token_id_information: true` it POSTs `/v1/chat/completions` to the
14-
backing model with `logprobs=True, return_tokens_as_token_ids=True` and then
15-
parses `choice.logprobs.content[*].token` (of the form `"token_id:<int>"`).
16-
Our shim speaks OpenAI Responses API shape directly and does not produce
17-
vLLM's native `choice.logprobs.content`.
12+
**Symptom**: `/tmp/live_run_mcqa.log` showed only
13+
`Step N: mean_reward=X n_datums=Y dropped=Z`. No way to tell from that line
14+
alone whether the hot-swap fired or what the loss was, so the exercise of
15+
"prove hot-swap IDs progress between steps + prove loss is computed" had to
16+
fall back to reading wandb artifacts or Tinker server logs.
1817

19-
**Fix.** Switched `configs/nemogym_agent.yaml` from `vllm_model` to
20-
`openai_model`, which forwards `/v1/responses` verbatim to our shim. Because
21-
nemo-gym's `NeMoGymResponse` pydantic model auto-selects the
22-
`ForTraining` output variants when `prompt_token_ids/generation_token_ids/
23-
generation_log_probs` are present on a message, the training fields flow
24-
through end-to-end without vLLM-specific translation.
18+
**Fix**: `tinker_nemogym/trainer.py::run` now appends ` save_count=<N>
19+
loss=<F>` to the INFO line whenever those keys are present in the metrics
20+
dict. `save_count` is only populated when the hot-swap succeeded, and `loss`
21+
only when `forward_backward_async` returned a metrics dict (tinker SDK
22+
convention: `loss:sum`). Constant-reward skipped steps have neither — the
23+
string remains unchanged for them, preserving backward-compatible log output
24+
for those cases.
2525

26-
**Verification.** Manually ran a `/run` against the agent and confirmed a
27-
200 with `reward=1.0` and the training fields intact on
28-
`response.output[0]`.
26+
### Fix #2: Checkpoint URI not surfaced to operator
2927

30-
## 2. `NeMoGymResponse.model_validate` rejected our shim's `/v1/responses` output
28+
**Symptom**: `Saving checkpoint label=smoke_mcqa-step-2 (logical path: ...)`
29+
tells the operator the local dir name but not the `tinker://<run-id>/.../<label>`
30+
URI that `load_state_async` expects. Previously we were awaiting the save
31+
future and discarding its `.path`. So after a run, if you wanted to
32+
`resume_from_checkpoint`, you had to ask Tinker admin for the run_id.
3133

32-
**Problem.** Agent returned HTTP 500 when validating our shim's response:
33-
``3 validation errors for NeMoGymResponse`` (three missing fields).
34+
**Fix**: `_maybe_save_checkpoint` now captures `result = await
35+
future.result_async()`, reads `result.path`, and emits a second INFO line:
36+
`Checkpoint saved: label=<L> uri=tinker://<run-id>:train:0/weights/<L>`.
37+
That URI is now copy-pasteable into `tinker.resume_from_checkpoint` for
38+
the next run. Phase 4 of the live validation relies on this.
3439

35-
**Root cause.** `nemo_gym.openai_utils.NeMoGymResponse` inherits from the
36-
OpenAI `Response` model which requires `parallel_tool_calls`, `tool_choice`,
37-
and `tools` even for tool-free responses.
40+
### Config tweak: mcqa smoke needs group_size=8 (not 4)
3841

39-
**Fix.** `tinker_responses_model.responses_endpoint` now echoes these fields
40-
from the request (defaulting to `auto`/`True`/`[]` when absent) onto the
41-
response body.
42+
**Symptom**: first mcqa smoke with group_size=4 gave 1/5 step with
43+
n_datums>0; the other four all collapsed to constant reward (0 or 1) within
44+
each group and were dropped.
4245

43-
**Verification.** Re-ran a rollout; response validated and the
44-
`/run` round-trip succeeded.
45-
46-
## 3. `extract_trajectory` looked at the wrong nesting level
47-
48-
**Problem.** After fix #2 the rollout returned 200 but
49-
`extract_trajectory` raised: ``Response is missing token_ids/logprobs``.
50-
51-
**Root cause.** `NeMoGymResponse.output` is a **list** of heterogeneous
52-
items; the training fields are carried on the first message item
53-
(`NeMoGymResponseOutputMessageForTraining`), **not** at
54-
`response["prompt_token_ids"]`. Our extractor was only checking the top
55-
level of the `response` dict.
56-
57-
**Fix.** Added `_find_training_output` that scans `response.output` for
58-
the first item with all three training fields. Keeps the top-level
59-
fallback for backward compatibility with older synthetic fixtures and
60-
unit tests.
61-
62-
**Verification.** `rollout_group``extract_trajectory`
63-
`build_datum` pipeline now succeeds on the real agent response.
64-
65-
## 4. Trainer returned before its uvicorn shim was listening
66-
67-
**Problem.** In fast-iteration scripts that did `trainer.setup()` followed
68-
immediately by HTTP calls to `http://127.0.0.1:8001`, `aiohttp` raised
69-
`ConnectionRefusedError: Cannot connect to host 127.0.0.1:8001`.
70-
71-
**Root cause.** `_start_server_thread` spawned the uvicorn thread but
72-
returned immediately — no readiness barrier. The main thread raced the
73-
listener socket.
74-
75-
**Fix.** `_start_server_thread` now polls `server.started` (uvicorn's
76-
post-bind signal) for up to 10 seconds and logs the readiness state.
77-
78-
**Verification.** Scripts stop producing connection-refused errors; e2e
79-
script completes cleanly.
80-
81-
## 5. `example_single_tool_call`'s verifier always returns reward=1.0
82-
83-
**Problem.** After the above fixes, `_run_step` kept early-exiting:
84-
``all groups had constant reward, skipping training step``. The call
85-
history showed `save_weights_and_get_sampling_client_async` but **no**
86-
`forward_backward_async` or `optim_step_async`.
87-
88-
**Root cause.** Not a bug — `SimpleWeatherResourcesServer.verify` is
89-
hardcoded to reward=1.0. GRPO's "drop constant-reward groups" short-circuit
90-
(designed to skip zero-gradient steps) was doing its job.
91-
92-
**Fix.** Added `training.drop_constant_reward: bool = True` to
93-
`TrainingConfig`; the e2e script sets it to `False` so the full training
94-
pipeline gets exercised even against the toy verifier.
95-
96-
**Verification.** `call_history` now contains `forward_backward_async`,
97-
`optim_step_async`, and multiple `save_weights_and_get_sampling_client_async`
98-
calls across steps. Hot-swap verified by comparing
99-
`trainer.current_sampling_client` identity between steps — it changes
100-
after each step.
101-
102-
## Nothing else needed a fix
103-
104-
- **Tokenizer.** `meta-llama/Llama-3.2-1B` was already cached in
105-
`~/.cache/huggingface/hub` on the dev machine and loads without an HF
106-
token. If a different environment gates this model, the fix is to set
107-
`tinker.base_model` in the YAML to an ungated model (e.g.
108-
`Qwen/Qwen2.5-0.5B`). Keeping the gated one as the default mirrors
109-
production.
110-
- **Renderer.** `tinker_cookbook.model_info.get_recommended_renderer_name`
111-
supports the Llama 3.2 chat template out of the box.
112-
- **Import paths.** `TokenIDLogProbMixin` and `SimpleResponsesAPIModel` are
113-
imported from `nemo_gym.openai_utils` and `nemo_gym.base_responses_api_model`
114-
respectively — already correct.
115-
- **Agent port.** The agent listens on a random port allocated by nemo-gym
116-
(default range 10001-20000). We discover it at runtime by polling
117-
`GET {head_url}/server_instances` and filtering by
118-
`server_type == "responses_api_agents"`.
119-
120-
## 6. Multi-turn rollouts: datum_builder only used the first ForTraining output
121-
122-
**Problem.** When pointed at nemo-gym's `example_multi_step` env (or any
123-
agent that emits multiple assistant turns before the final answer) the
124-
`/v1/responses` body contains *several* `NeMoGym*ForTraining` output items
125-
interleaved with `function_call_output` items. The old
126-
`_find_training_output` returned only the first match, so
127-
`extract_trajectory` silently dropped every generation after turn 1 — the
128-
datum covered just one turn of a potentially-10-turn trajectory.
129-
130-
**Root cause.** `extract_trajectory` assumed a single ForTraining item. It
131-
called `_find_training_output` (first-match) and stopped.
132-
133-
**Fix.** Added `_collect_training_outputs` in `datum_builder.py` that
134-
returns *every* ForTraining item in list order. `extract_trajectory` now
135-
concatenates their `generation_token_ids` + `generation_log_probs` across
136-
turns; the prompt for the trajectory is the prompt attached to the first
137-
item (subsequent items carry the same prefix plus prior generations).
138-
Falls back to the old top-level-fields shape for legacy fixtures. The
139-
old single-match helper is kept for backward compatibility.
140-
141-
**Verification.** New integration tests drive a multi-turn response
142-
through `extract_trajectory` + `build_datum` and assert
143-
`len(target) == len(logprobs) == len(advantages)` with the concatenated
144-
generations (see `tests/integration/test_multi_turn.py`).
145-
146-
## 7. Trainer had no error recovery around forward_backward / hot-swap
147-
148-
**Problem.** A single transient tinker failure (503 from the backend,
149-
intermittent network hiccup) killed the training loop — the exception
150-
propagated up through `_run_step` and the outer run loop saw it only via
151-
`logger.exception`, but by then sampling state could be half-updated.
152-
153-
**Root cause.** `_run_step` awaited `forward_backward_async`, `optim_step_async`,
154-
and `save_weights_and_get_sampling_client_async` without try/except.
155-
156-
**Fix.** Wrapped the fwd_bwd + optim calls in a single try/except that
157-
logs at ERROR and returns metrics (with `fwd_bwd_error` marker) instead of
158-
raising. Separately wrapped the hot-swap. `_maybe_save_checkpoint` was
159-
already wrapped but is now exercised by a test. This lets the outer loop
160-
count the failed step and move on.
161-
162-
**Verification.** `tests/integration/test_error_recovery.py` injects
163-
exceptions into each call and asserts the trainer survives and subsequent
164-
steps still run.
165-
166-
## 8. Shim mapped every RuntimeError to 503, hiding real sampling failures
167-
168-
**Problem.** When a sampling client raised — e.g. a real tinker backend
169-
error surfaced as a RuntimeError — the shim returned a 503. Callers
170-
treated that as "service not ready, retry after wiring up", which is wrong
171-
for a failure originating *inside* the sampling client (no amount of retry
172-
helps without a code fix).
173-
174-
**Root cause.** The bare `except RuntimeError` in `chat_completions_endpoint`
175-
+ `responses_endpoint` swallowed *every* RuntimeError as 503.
176-
177-
**Fix.** Narrowed the 503 mapping to RuntimeErrors whose message matches
178-
the "not initialized" / "no sampling client" strings we raise ourselves.
179-
Everything else (including arbitrary RuntimeErrors from the sampling
180-
client) maps to 500 with `detail="sampling_client error: ..."`. Process
181-
stays alive either way.
182-
183-
**Verification.** New tests set up a sampling client that always raises
184-
RuntimeError; POST to `/v1/responses` and `/v1/chat/completions` now
185-
return 500 (not 503), the shim remains responsive, and a subsequent
186-
request after swapping in a healthy sampler succeeds. See
187-
`tests/integration/test_error_recovery.py::test_shim_handles_sampling_client_exception`.
188-
189-
## 9. Resume from checkpoint was not wired
190-
191-
**Problem.** The config schema had `checkpoint_dir` + `save_every` but no
192-
knob to *load* a saved checkpoint — the only way to restart a run was to
193-
hot-patch `training_client.load_state_async` manually.
194-
195-
**Root cause.** Feature wasn't implemented. Not a bug per se, but the
196-
e2e test matrix needed a real resume path.
197-
198-
**Fix.** Added `tinker.resume_from_checkpoint: str | None` to
199-
`TinkerConfig`. When set, `setup()` calls
200-
`training_client.load_state_async(path)` *before* the initial
201-
`save_weights_and_get_sampling_client_async(name="step_0")` call so the
202-
first rollout samples from the restored weights. Matches the real tinker
203-
SDK's signature: `load_state_async(path, weights_access_token=None)`.
204-
`FakeLoRATrainingClient` grows matching `load_state_async` +
205-
`load_state_with_optimizer_async` methods that record the call and stash
206-
the path on `self._loaded_state` for test assertions.
207-
208-
**Verification.** `tests/integration/test_checkpoint_resume.py` verifies
209-
(a) load_state_async is called with the configured path, (b) it precedes
210-
the initial sampler snapshot, (c) load failures don't abort setup, and
211-
(d) a full save-then-resume-in-a-new-trainer roundtrip round-trips the
212-
path string correctly.
213-
214-
## 10. Default `base_model` was a Tinker-unsupported variant
215-
216-
**Problem.** Real Tinker rejected `create_lora_training_client_async` with
217-
an "unsupported base model" error when the smoke YAMLs specified
218-
`meta-llama/Llama-3.2-1B-Instruct`. Tinker's production roster doesn't
219-
include the 1B Llama-3.2 variant at the moment.
220-
221-
**Root cause.** We picked the smallest Llama for the default because
222-
tokenizer caches were handy, but the backend's supported-model list is
223-
narrower than what the HF tokenizer zoo contains. `get_server_capabilities_async()`
224-
confirms which base models the server accepts.
225-
226-
**Fix.** Switched the default `base_model` in both
227-
`configs/smoke_single_tool.yaml` and `configs/smoke_multi_step.yaml` to
228-
`meta-llama/Llama-3.1-8B-Instruct` — the smallest Llama currently on
229-
Tinker's supported list. Tokenizer is ungated enough to download on a
230-
fresh box; context length (32768) comfortably covers our smoke prompts.
231-
232-
**Verification.** Real-Tinker smoke run now passes the
233-
`create_lora_training_client_async` step (next failure is further down
234-
the pipeline).
235-
236-
## 11. `save_state_async(name=...)` rejected filesystem-style paths
237-
238-
**Problem.** First real training step got past `forward_backward_async`
239-
and `optim_step_async` but `_maybe_save_checkpoint` raised — Tinker's
240-
server returned a validation error on the `name` field.
241-
242-
**Root cause.** Tinker validates `save_state_async`'s `name` as a
243-
"weights label" and only allows `[A-Za-z0-9._-]`. Our code was passing
244-
a full filesystem path like `./checkpoints/smoke_single_tool/step_1`,
245-
which contains slashes and a leading dot-slash.
246-
247-
**Fix.** `_maybe_save_checkpoint` now constructs a safe identifier
248-
from the checkpoint_dir basename plus the step number
249-
(`{dir_basename}-step-{n}`) and passes *that* to `save_state_async`.
250-
The filesystem-style path is still logged for operator readability but
251-
doesn't go to the API.
252-
253-
**Verification.** Real-Tinker run: no more 400 on checkpoint save; the
254-
saved label appears in subsequent `list_state_ids_async` output.
255-
256-
## 12. Hardcoded `agent_url` raced nemo-gym's random port allocation
257-
258-
**Problem.** Trainer POSTed to `http://127.0.0.1:11001/run` and got
259-
`ConnectionRefusedError`. nemo-gym's agent was listening — just on a
260-
different random port inside the 10001-20000 range.
261-
262-
**Root cause.** `configs/smoke_single_tool.yaml` (and `smoke_multi_step.yaml`)
263-
hardcoded `agent_url: http://127.0.0.1:11001` based on an old assumption
264-
that agents always land on 11001. In reality nemo-gym allocates agent
265-
ports randomly from 10001-20000 and only registers them with the
266-
HeadServer. We already have a discovery helper
267-
(`tinker_nemogym/utils/process.py::discover_agent_url`) that the trainer
268-
invokes when `cfg.nemogym.agent_url is None`.
269-
270-
**Fix.** Set `agent_url: null` in both YAMLs to unlock the dynamic
271-
discovery path. No code change required — the discovery was wired but
272-
gated behind an explicit opt-in.
273-
274-
**Verification.** Real-Tinker run: trainer logs
275-
`discovered agent_url=http://127.0.0.1:<port>` and the first `/run`
276-
request goes through without `ConnectionRefusedError`.
46+
**Fix**: bump `group_size: 8` in `configs/smoke_mcqa.yaml`. At T=1.0 with
47+
10-option MCQ on an 8B Instruct, 8 samples gives consistent intra-group
48+
variance. Confirmed: 3/5 steps produce n_datums=8, advantages + loss fire.
27749

50+
Not a code fix — parameters only. Documented in the YAML with a comment so
51+
future operators don't regress it.

0 commit comments

Comments
 (0)