NousResearch
diff --git a/‎docs/E2E_FIXES.md‎
Lines changed: 39 additions & 265 deletions b/‎docs/E2E_FIXES.md‎
Lines changed: 39 additions & 265 deletions
@@ -1,277 +1,51 @@
-# E2E fixes journal
+# E2E Live-Run Fixes
 
-Log of everything we fixed while getting the tinker-nemogym ↔ nemo-gym stack
-to run end-to-end without a real ``TINKER_API_KEY``. One paragraph per fix.
+Running-log of bugs found and patches applied while doing live-Tinker
+end-to-end validation beyond the minimal `smoke_single_tool` path. For the
+original session (first-ever live run) see commit `a558e63` — this document
+covers follow-ups.
 
-## 1. `policy_model` was using `vllm_model` and crashed on `choice.logprobs.content`
+## 2026-04-22 — Varied-reward / multi-step / resume / wandb validation
 
-**Problem.** Every `/run` call returned HTTP 500 from the agent with a cascade
-pointing at `responses_api_models/vllm_model/app.py:374` —
-`KeyError: 'logprobs'`.
+### Fix #1: per-step INFO line missing hot-swap id + loss
 
-**Root cause.** `vllm_model` is configured for real vLLM: when
-`return_token_id_information: true` it POSTs `/v1/chat/completions` to the
-backing model with `logprobs=True, return_tokens_as_token_ids=True` and then
-parses `choice.logprobs.content[*].token` (of the form `"token_id:<int>"`).
-Our shim speaks OpenAI Responses API shape directly and does not produce
-vLLM's native `choice.logprobs.content`.
+**Symptom**: `/tmp/live_run_mcqa.log` showed only
+`Step N: mean_reward=X n_datums=Y dropped=Z`. No way to tell from that line
+alone whether the hot-swap fired or what the loss was, so the exercise of
+"prove hot-swap IDs progress between steps + prove loss is computed" had to
+fall back to reading wandb artifacts or Tinker server logs.
 
-**Fix.** Switched `configs/nemogym_agent.yaml` from `vllm_model` to
-`openai_model`, which forwards `/v1/responses` verbatim to our shim. Because
-nemo-gym's `NeMoGymResponse` pydantic model auto-selects the
-`ForTraining` output variants when `prompt_token_ids/generation_token_ids/
-generation_log_probs` are present on a message, the training fields flow
-through end-to-end without vLLM-specific translation.
+**Fix**: `tinker_nemogym/trainer.py::run` now appends ` save_count=<N>
+loss=<F>` to the INFO line whenever those keys are present in the metrics
+dict. `save_count` is only populated when the hot-swap succeeded, and `loss`
+only when `forward_backward_async` returned a metrics dict (tinker SDK
+convention: `loss:sum`). Constant-reward skipped steps have neither — the
+string remains unchanged for them, preserving backward-compatible log output
+for those cases.
 
-**Verification.** Manually ran a `/run` against the agent and confirmed a
-200 with `reward=1.0` and the training fields intact on
-`response.output[0]`.
+### Fix #2: Checkpoint URI not surfaced to operator
 
-## 2. `NeMoGymResponse.model_validate` rejected our shim's `/v1/responses` output
+**Symptom**: `Saving checkpoint label=smoke_mcqa-step-2 (logical path: ...)`
+tells the operator the local dir name but not the `tinker://<run-id>/.../<label>`
+URI that `load_state_async` expects. Previously we were awaiting the save
+future and discarding its `.path`. So after a run, if you wanted to
+`resume_from_checkpoint`, you had to ask Tinker admin for the run_id.
 
-**Problem.** Agent returned HTTP 500 when validating our shim's response:
-``3 validation errors for NeMoGymResponse`` (three missing fields).
+**Fix**: `_maybe_save_checkpoint` now captures `result = await
+future.result_async()`, reads `result.path`, and emits a second INFO line:
+`Checkpoint saved: label=<L> uri=tinker://<run-id>:train:0/weights/<L>`.
+That URI is now copy-pasteable into `tinker.resume_from_checkpoint` for
+the next run. Phase 4 of the live validation relies on this.
 
-**Root cause.** `nemo_gym.openai_utils.NeMoGymResponse` inherits from the
-OpenAI `Response` model which requires `parallel_tool_calls`, `tool_choice`,
-and `tools` even for tool-free responses.
+### Config tweak: mcqa smoke needs group_size=8 (not 4)
 
-**Fix.** `tinker_responses_model.responses_endpoint` now echoes these fields
-from the request (defaulting to `auto`/`True`/`[]` when absent) onto the
-response body.
+**Symptom**: first mcqa smoke with group_size=4 gave 1/5 step with
+n_datums>0; the other four all collapsed to constant reward (0 or 1) within
+each group and were dropped.
 
-**Verification.** Re-ran a rollout; response validated and the
-`/run` round-trip succeeded.
-
-## 3. `extract_trajectory` looked at the wrong nesting level
-
-**Problem.** After fix #2 the rollout returned 200 but
-`extract_trajectory` raised: ``Response is missing token_ids/logprobs``.
-
-**Root cause.** `NeMoGymResponse.output` is a **list** of heterogeneous
-items; the training fields are carried on the first message item
-(`NeMoGymResponseOutputMessageForTraining`), **not** at
-`response["prompt_token_ids"]`. Our extractor was only checking the top
-level of the `response` dict.
-
-**Fix.** Added `_find_training_output` that scans `response.output` for
-the first item with all three training fields. Keeps the top-level
-fallback for backward compatibility with older synthetic fixtures and
-unit tests.
-
-**Verification.** `rollout_group` → `extract_trajectory` →
-`build_datum` pipeline now succeeds on the real agent response.
-
-## 4. Trainer returned before its uvicorn shim was listening
-
-**Problem.** In fast-iteration scripts that did `trainer.setup()` followed
-immediately by HTTP calls to `http://127.0.0.1:8001`, `aiohttp` raised
-`ConnectionRefusedError: Cannot connect to host 127.0.0.1:8001`.
-
-**Root cause.** `_start_server_thread` spawned the uvicorn thread but
-returned immediately — no readiness barrier. The main thread raced the
-listener socket.
-
-**Fix.** `_start_server_thread` now polls `server.started` (uvicorn's
-post-bind signal) for up to 10 seconds and logs the readiness state.
-
-**Verification.** Scripts stop producing connection-refused errors; e2e
-script completes cleanly.
-
-## 5. `example_single_tool_call`'s verifier always returns reward=1.0
-
-**Problem.** After the above fixes, `_run_step` kept early-exiting:
-``all groups had constant reward, skipping training step``. The call
-history showed `save_weights_and_get_sampling_client_async` but **no**
-`forward_backward_async` or `optim_step_async`.
-
-**Root cause.** Not a bug — `SimpleWeatherResourcesServer.verify` is
-hardcoded to reward=1.0. GRPO's "drop constant-reward groups" short-circuit
-(designed to skip zero-gradient steps) was doing its job.
-
-**Fix.** Added `training.drop_constant_reward: bool = True` to
-`TrainingConfig`; the e2e script sets it to `False` so the full training
-pipeline gets exercised even against the toy verifier.
-
-**Verification.** `call_history` now contains `forward_backward_async`,
-`optim_step_async`, and multiple `save_weights_and_get_sampling_client_async`
-calls across steps. Hot-swap verified by comparing
-`trainer.current_sampling_client` identity between steps — it changes
-after each step.
-
-## Nothing else needed a fix
-
-- **Tokenizer.** `meta-llama/Llama-3.2-1B` was already cached in
-  `~/.cache/huggingface/hub` on the dev machine and loads without an HF
-  token. If a different environment gates this model, the fix is to set
-  `tinker.base_model` in the YAML to an ungated model (e.g.
-  `Qwen/Qwen2.5-0.5B`). Keeping the gated one as the default mirrors
-  production.
-- **Renderer.** `tinker_cookbook.model_info.get_recommended_renderer_name`
-  supports the Llama 3.2 chat template out of the box.
-- **Import paths.** `TokenIDLogProbMixin` and `SimpleResponsesAPIModel` are
-  imported from `nemo_gym.openai_utils` and `nemo_gym.base_responses_api_model`
-  respectively — already correct.
-- **Agent port.** The agent listens on a random port allocated by nemo-gym
-  (default range 10001-20000). We discover it at runtime by polling
-  `GET {head_url}/server_instances` and filtering by
-  `server_type == "responses_api_agents"`.
-
-## 6. Multi-turn rollouts: datum_builder only used the first ForTraining output
-
-**Problem.** When pointed at nemo-gym's `example_multi_step` env (or any
-agent that emits multiple assistant turns before the final answer) the
-`/v1/responses` body contains *several* `NeMoGym*ForTraining` output items
-interleaved with `function_call_output` items. The old
-`_find_training_output` returned only the first match, so
-`extract_trajectory` silently dropped every generation after turn 1 — the
-datum covered just one turn of a potentially-10-turn trajectory.
-
-**Root cause.** `extract_trajectory` assumed a single ForTraining item. It
-called `_find_training_output` (first-match) and stopped.
-
-**Fix.** Added `_collect_training_outputs` in `datum_builder.py` that
-returns *every* ForTraining item in list order. `extract_trajectory` now
-concatenates their `generation_token_ids` + `generation_log_probs` across
-turns; the prompt for the trajectory is the prompt attached to the first
-item (subsequent items carry the same prefix plus prior generations).
-Falls back to the old top-level-fields shape for legacy fixtures. The
-old single-match helper is kept for backward compatibility.
-
-**Verification.** New integration tests drive a multi-turn response
-through `extract_trajectory` + `build_datum` and assert
-`len(target) == len(logprobs) == len(advantages)` with the concatenated
-generations (see `tests/integration/test_multi_turn.py`).
-
-## 7. Trainer had no error recovery around forward_backward / hot-swap
-
-**Problem.** A single transient tinker failure (503 from the backend,
-intermittent network hiccup) killed the training loop — the exception
-propagated up through `_run_step` and the outer run loop saw it only via
-`logger.exception`, but by then sampling state could be half-updated.
-
-**Root cause.** `_run_step` awaited `forward_backward_async`, `optim_step_async`,
-and `save_weights_and_get_sampling_client_async` without try/except.
-
-**Fix.** Wrapped the fwd_bwd + optim calls in a single try/except that
-logs at ERROR and returns metrics (with `fwd_bwd_error` marker) instead of
-raising. Separately wrapped the hot-swap. `_maybe_save_checkpoint` was
-already wrapped but is now exercised by a test. This lets the outer loop
-count the failed step and move on.
-
-**Verification.** `tests/integration/test_error_recovery.py` injects
-exceptions into each call and asserts the trainer survives and subsequent
-steps still run.
-
-## 8. Shim mapped every RuntimeError to 503, hiding real sampling failures
-
-**Problem.** When a sampling client raised — e.g. a real tinker backend
-error surfaced as a RuntimeError — the shim returned a 503. Callers
-treated that as "service not ready, retry after wiring up", which is wrong
-for a failure originating *inside* the sampling client (no amount of retry
-helps without a code fix).
-
-**Root cause.** The bare `except RuntimeError` in `chat_completions_endpoint`
-+ `responses_endpoint` swallowed *every* RuntimeError as 503.
-
-**Fix.** Narrowed the 503 mapping to RuntimeErrors whose message matches
-the "not initialized" / "no sampling client" strings we raise ourselves.
-Everything else (including arbitrary RuntimeErrors from the sampling
-client) maps to 500 with `detail="sampling_client error: ..."`. Process
-stays alive either way.
-
-**Verification.** New tests set up a sampling client that always raises
-RuntimeError; POST to `/v1/responses` and `/v1/chat/completions` now
-return 500 (not 503), the shim remains responsive, and a subsequent
-request after swapping in a healthy sampler succeeds. See
-`tests/integration/test_error_recovery.py::test_shim_handles_sampling_client_exception`.
-
-## 9. Resume from checkpoint was not wired
-
-**Problem.** The config schema had `checkpoint_dir` + `save_every` but no
-knob to *load* a saved checkpoint — the only way to restart a run was to
-hot-patch `training_client.load_state_async` manually.
-
-**Root cause.** Feature wasn't implemented. Not a bug per se, but the
-e2e test matrix needed a real resume path.
-
-**Fix.** Added `tinker.resume_from_checkpoint: str | None` to
-`TinkerConfig`. When set, `setup()` calls
-`training_client.load_state_async(path)` *before* the initial
-`save_weights_and_get_sampling_client_async(name="step_0")` call so the
-first rollout samples from the restored weights. Matches the real tinker
-SDK's signature: `load_state_async(path, weights_access_token=None)`.
-`FakeLoRATrainingClient` grows matching `load_state_async` +
-`load_state_with_optimizer_async` methods that record the call and stash
-the path on `self._loaded_state` for test assertions.
-
-**Verification.** `tests/integration/test_checkpoint_resume.py` verifies
-(a) load_state_async is called with the configured path, (b) it precedes
-the initial sampler snapshot, (c) load failures don't abort setup, and
-(d) a full save-then-resume-in-a-new-trainer roundtrip round-trips the
-path string correctly.
-
-## 10. Default `base_model` was a Tinker-unsupported variant
-
-**Problem.** Real Tinker rejected `create_lora_training_client_async` with
-an "unsupported base model" error when the smoke YAMLs specified
-`meta-llama/Llama-3.2-1B-Instruct`. Tinker's production roster doesn't
-include the 1B Llama-3.2 variant at the moment.
-
-**Root cause.** We picked the smallest Llama for the default because
-tokenizer caches were handy, but the backend's supported-model list is
-narrower than what the HF tokenizer zoo contains. `get_server_capabilities_async()`
-confirms which base models the server accepts.
-
-**Fix.** Switched the default `base_model` in both
-`configs/smoke_single_tool.yaml` and `configs/smoke_multi_step.yaml` to
-`meta-llama/Llama-3.1-8B-Instruct` — the smallest Llama currently on
-Tinker's supported list. Tokenizer is ungated enough to download on a
-fresh box; context length (32768) comfortably covers our smoke prompts.
-
-**Verification.** Real-Tinker smoke run now passes the
-`create_lora_training_client_async` step (next failure is further down
-the pipeline).
-
-## 11. `save_state_async(name=...)` rejected filesystem-style paths
-
-**Problem.** First real training step got past `forward_backward_async`
-and `optim_step_async` but `_maybe_save_checkpoint` raised — Tinker's
-server returned a validation error on the `name` field.
-
-**Root cause.** Tinker validates `save_state_async`'s `name` as a
-"weights label" and only allows `[A-Za-z0-9._-]`. Our code was passing
-a full filesystem path like `./checkpoints/smoke_single_tool/step_1`,
-which contains slashes and a leading dot-slash.
-
-**Fix.** `_maybe_save_checkpoint` now constructs a safe identifier
-from the checkpoint_dir basename plus the step number
-(`{dir_basename}-step-{n}`) and passes *that* to `save_state_async`.
-The filesystem-style path is still logged for operator readability but
-doesn't go to the API.
-
-**Verification.** Real-Tinker run: no more 400 on checkpoint save; the
-saved label appears in subsequent `list_state_ids_async` output.
-
-## 12. Hardcoded `agent_url` raced nemo-gym's random port allocation
-
-**Problem.** Trainer POSTed to `http://127.0.0.1:11001/run` and got
-`ConnectionRefusedError`. nemo-gym's agent was listening — just on a
-different random port inside the 10001-20000 range.
-
-**Root cause.** `configs/smoke_single_tool.yaml` (and `smoke_multi_step.yaml`)
-hardcoded `agent_url: http://127.0.0.1:11001` based on an old assumption
-that agents always land on 11001. In reality nemo-gym allocates agent
-ports randomly from 10001-20000 and only registers them with the
-HeadServer. We already have a discovery helper
-(`tinker_nemogym/utils/process.py::discover_agent_url`) that the trainer
-invokes when `cfg.nemogym.agent_url is None`.
-
-**Fix.** Set `agent_url: null` in both YAMLs to unlock the dynamic
-discovery path. No code change required — the discovery was wired but
-gated behind an explicit opt-in.
-
-**Verification.** Real-Tinker run: trainer logs
-`discovered agent_url=http://127.0.0.1:<port>` and the first `/run`
-request goes through without `ConnectionRefusedError`.
+**Fix**: bump `group_size: 8` in `configs/smoke_mcqa.yaml`. At T=1.0 with
+10-option MCQ on an 8B Instruct, 8 samples gives consistent intra-group
+variance. Confirmed: 3/5 steps produce n_datums=8, advantages + loss fire.
 
+Not a code fix — parameters only. Documented in the YAML with a comment so
+future operators don't regress it.