|  | 
|  | 1 | +# Tinker + Agent-lightning Integration | 
|  | 2 | + | 
|  | 3 | +This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces. | 
|  | 4 | + | 
|  | 5 | +**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the cost of running the Tinker training service.** | 
|  | 6 | + | 
|  | 7 | +## How this differs from the original Tinker Cookbook RL recipe | 
|  | 8 | + | 
|  | 9 | +Real-world agent apps orchestrate logic in familiar frameworks (CrewAI, LangChain, AutoGen, OpenAI Agents, etc.) or by calling OpenAI-compatible REST APIs. A simple number-guessing agent might look like this: | 
|  | 10 | + | 
|  | 11 | +```python | 
|  | 12 | +def guess_number_agent(): | 
|  | 13 | +    client = openai.OpenAI() | 
|  | 14 | +    messages = [{"role": "user", "content": "Guess a number between 1 and 100."}] | 
|  | 15 | +    for _ in range(MAX_TURNS): | 
|  | 16 | +        response = client.chat.completions.create(model="gpt-4.1", messages=messages) | 
|  | 17 | +        response_content = response.choices[0].message.content | 
|  | 18 | +        messages.append({"role": "assistant", "content": response_content}) | 
|  | 19 | +        guessed_number = extract_number(response_content) | 
|  | 20 | +        if guessed_number == gold_answer: | 
|  | 21 | +            return 1.0 | 
|  | 22 | +        elif guessed_number < gold_answer: | 
|  | 23 | +            messages.append({"role": "user", "content": "Too low"}) | 
|  | 24 | +        else: | 
|  | 25 | +            messages.append({"role": "user", "content": "Too high"}) | 
|  | 26 | +    return 0.0 | 
|  | 27 | +``` | 
|  | 28 | + | 
|  | 29 | +The reference [Tinker Cookbook example](https://github.com/thinking-machines-lab/tinker-cookbook/tree/51d9e8226f2dcf82ceac272c734a5f6e3b4f0203/tinker_cookbook/recipes/multiplayer_rl/guess_number), however, expects you to rewrite the same logic into a callback-style `Env`, and it creates a simple loop to iterate between a language model (`TokenCompleter`) and the `Env`. | 
|  | 30 | + | 
|  | 31 | +```python | 
|  | 32 | +class GuessNumberEnv: | 
|  | 33 | +    def __init__(self, gold_answer: int): | 
|  | 34 | +        self.system_prompt: Message = {"role": "system", "content": SYSTEM_PROMPT} | 
|  | 35 | +        self.turns: list[Message] = [] | 
|  | 36 | +        self.gold_answer: int = gold_answer | 
|  | 37 | + | 
|  | 38 | +    async def initial_observation(self) -> list[int]: | 
|  | 39 | +        return message_to_tokens(self.system_prompt) | 
|  | 40 | + | 
|  | 41 | +    async def step(self, action_tokens: list[int]) -> tuple[list[int], float, bool]: | 
|  | 42 | +        action_message = tokens_to_message(action_tokens) | 
|  | 43 | +        guessed_number = extract_number(action_message["content"]) | 
|  | 44 | + | 
|  | 45 | +        if guessed_number == self.gold_answer: | 
|  | 46 | +            text, reward = "Correct", 1.0 | 
|  | 47 | +        elif guessed_number < self.gold_answer: | 
|  | 48 | +            text, reward = "Too low", 0.0 | 
|  | 49 | +        else: | 
|  | 50 | +            text, reward = "Too high", 0.0 | 
|  | 51 | + | 
|  | 52 | +        self.turns.append(action_message) | 
|  | 53 | +        self.turns.append({"role": "assistant", "content": text}) | 
|  | 54 | +        episode_done = reward == 1 or len(self.turns) // 2 >= MAX_TURNS | 
|  | 55 | +        return message_to_tokens(self.turns), reward, episode_done | 
|  | 56 | +``` | 
|  | 57 | + | 
|  | 58 | +As agents grow more complex, writing them in callback style becomes increasingly painful. You have to break the control flow whenever an LLM call is required, which fragments the code and makes it harder to maintain. | 
|  | 59 | + | 
|  | 60 | +Agent-lightning hides that translation step: you keep the first style for development and production, while the framework queues tasks to the store, rebuilds trajectories from spans, and feeds them to the training loop. This example shows how to make Tinker's original training loop work with Agent-lightning. | 
|  | 61 | + | 
|  | 62 | +## Included files | 
|  | 63 | + | 
|  | 64 | +| Path | Purpose | | 
|  | 65 | +| ---- | ------- | | 
|  | 66 | +| `hello.py` | Minimal end-to-end fine-tuning example. Trains a model to repeat small identity strings. | | 
|  | 67 | +| `q20_agent.py` | CrewAI flow that powers the 20 Questions player, answerer, and mock search tool. Shared by training and evaluation. **Unrelated to Agent-lightning or Tinker.** | | 
|  | 68 | +| `q20_train.py` | Reinforcement-learning driver that adapts the Cookbook loop to Agent-lightning rollouts. Supports dry-run, distributed training, and search tool toggles. **Related to both Agent-lightning and Tinker.** | | 
|  | 69 | +| `q20_evaluate.py` | Offline evaluator that reuses the CrewAI flow to benchmark any OpenAI- or Qwen-backed model against the provided dataset. **Related to Tinker only.** | | 
|  | 70 | +| `q20_nouns.csv` | Categories and answers used for training and validation. Contains `split` and `search_enabled` metadata. | | 
|  | 71 | +| `agl_tinker/` | Bridge package for integrating Agent-lightning with Tinker (see breakdown below). | | 
|  | 72 | +| `tests/test_tinker_llm.py` | Sanity tests for the custom LiteLLM provider. Run with `pytest examples/tinker/tests`. | | 
|  | 73 | +| `.env.example` | Template for environment variables required by LiteLLM, CrewAI helpers, and the hosted Tinker service. | | 
|  | 74 | + | 
|  | 75 | +`agl_tinker/` components: | 
|  | 76 | + | 
|  | 77 | +| Path | Purpose | | 
|  | 78 | +| ---- | ------- | | 
|  | 79 | +| `agl_tinker/algo.py` | Agent-lightning `Algorithm` wrapper that plugs the training loop into `agl.Trainer`. | | 
|  | 80 | +| `agl_tinker/env.py` | Dummy env and dataset builders that adapt Agent-lightning tasks to Tinker expectations. | | 
|  | 81 | +| `agl_tinker/llm.py` | LiteLLM custom provider backed by the Tinker sampling client. | | 
|  | 82 | +| `agl_tinker/rollout.py` | Span-to-trajectory reconstruction and rollout batching helpers. | | 
|  | 83 | +| `agl_tinker/train.py` | RL training loop adapted from the Tinker Cookbook. | | 
|  | 84 | + | 
|  | 85 | +## Setup | 
|  | 86 | + | 
|  | 87 | +**1. Install dependencies.** From the repo root: | 
|  | 88 | + | 
|  | 89 | +```bash | 
|  | 90 | +uv sync --frozen --extra apo --group dev --group agents --group tinker | 
|  | 91 | +``` | 
|  | 92 | + | 
|  | 93 | +If you are not using `uv`, make sure `tinker`, `tinker_cookbook`, `litellm`, `crewai`, and Agent-lightning are available in the same environment. | 
|  | 94 | + | 
|  | 95 | +**2. Copy the environment template and fill in credentials:** | 
|  | 96 | + | 
|  | 97 | +```bash | 
|  | 98 | +cp examples/tinker/.env.example examples/tinker/.env | 
|  | 99 | +``` | 
|  | 100 | + | 
|  | 101 | +- `OPENAI_API_KEY` / `OPENAI_BASE_URL`: routes helper agents (answerer, search, tool simulations) through a LiteLLM or OpenAI-compatible endpoint. | 
|  | 102 | +- `TINKER_API_KEY`: required to talk to the hosted Tinker training service. Skip if you are using OpenAI models only. | 
|  | 103 | +- `WANDB_API_KEY`: optional, enables Weights & Biases logging when configured in `q20_train.py`. | 
|  | 104 | +- `CREWAI_DISABLE_TELEMETRY=true`: keeps CrewAI from emitting its own telemetry so that Agent-lightning tracing stays coherent. | 
|  | 105 | + | 
|  | 106 | +3. Load the environment before running commands, e.g. `dotenv run -- <command>` or export the variables manually. | 
|  | 107 | + | 
|  | 108 | +## Running the Hello 1024 example | 
|  | 109 | + | 
|  | 110 | +This is the quickest way to see the integration in action. It fine-tunes a Qwen model so it introduces itself with the target identity. | 
|  | 111 | + | 
|  | 112 | +**One-click workflow (spawns store, algorithm, and runners in a single process)** | 
|  | 113 | + | 
|  | 114 | +```bash | 
|  | 115 | +dotenv run python hello.py oneclick | 
|  | 116 | +``` | 
|  | 117 | + | 
|  | 118 | +The script will pick free ports for the LiteLLM proxy and Agent-lightning store, then iterate through the synthetic dataset of identities. | 
|  | 119 | + | 
|  | 120 | +**Distributed workflow (useful for inspecting each component)** | 
|  | 121 | + | 
|  | 122 | +```bash | 
|  | 123 | +agl store --port 4747 | 
|  | 124 | +dotenv run python hello.py algo | 
|  | 125 | +dotenv run python hello.py runner | 
|  | 126 | +``` | 
|  | 127 | + | 
|  | 128 | +Start the commands in separate terminals. The algorithm process connects to the existing store, while the runner process launches eight worker processes by default. Logs are written to `examples/tinker/logs/hello`. | 
|  | 129 | + | 
|  | 130 | +## Training the 20 Questions agent | 
|  | 131 | + | 
|  | 132 | +The 20 Questions setup mirrors the official Cookbook recipe but drives rollouts through the shared CrewAI flow. | 
|  | 133 | + | 
|  | 134 | +**Dry run (in-memory store and LiteLLM proxy)** | 
|  | 135 | + | 
|  | 136 | +```bash | 
|  | 137 | +dotenv run python q20_train.py dryrun | 
|  | 138 | +``` | 
|  | 139 | + | 
|  | 140 | +Useful to verify that the CrewAI flow, reward emission, and span reconstruction succeed on a handful of samples without touching the hosted Tinker service. | 
|  | 141 | + | 
|  | 142 | +**Full distributed training** | 
|  | 143 | + | 
|  | 144 | +```bash | 
|  | 145 | +agl store --port 4747 | 
|  | 146 | +dotenv run python q20_train.py algo --model qwen30b --search --port 4747 | 
|  | 147 | +dotenv run python q20_train.py runner --port 4747 --n-runners 4 | 
|  | 148 | +``` | 
|  | 149 | + | 
|  | 150 | +`--model` selects the Tinker-hosted checkpoint (`qwen4b` or `qwen30b`). Add `--search` to enable the mocked search tool, which relies on the helper LLM defined in the environment variables (the example uses an LLM-powered search simulation instead of a real API). Training metrics and checkpoints are recorded under `examples/tinker/logs/q20_*`. | 
|  | 151 | + | 
|  | 152 | +You can run additional runner processes at any time; they register with the store and start dequeuing tasks immediately. | 
|  | 153 | + | 
|  | 154 | +## Evaluating a model on 20 Questions | 
|  | 155 | + | 
|  | 156 | +Reuse the CrewAI flow to benchmark any OpenAI-compatible model (hosted on Tinker, OpenAI, or another LiteLLM backend): | 
|  | 157 | + | 
|  | 158 | +```bash | 
|  | 159 | +dotenv run python q20_evaluate.py \ | 
|  | 160 | +  --model Qwen/Qwen3-30B-A3B-Instruct-2507 \ | 
|  | 161 | +  --output-file logs/twenty_questions_results.jsonl \ | 
|  | 162 | +  --search | 
|  | 163 | +``` | 
|  | 164 | + | 
|  | 165 | +Results append to the specified JSONL file so you can compute aggregate stats later. | 
|  | 166 | + | 
|  | 167 | +## How the bridge works | 
|  | 168 | + | 
|  | 169 | +The `agl_tinker` package keeps the rest of the Tinker or Tinker Cookbook's codebase untouched by emulating the interfaces it expects: | 
|  | 170 | + | 
|  | 171 | +- `AGLDatasetBuilder` and `AGLDummyEnv` wrap plain Agent-lightning datasets so batches still yield Tinker `EnvGroupBuilder` objects, even though rollouts run remotely. | 
|  | 172 | +- `do_group_of_group_rollouts` (in [`rollout.py`](agl_tinker/rollout.py)) enqueues tasks to the Agent-lightning store, waits for runners to finish, then reconstructs `Trajectory` objects from span triplets collected by `TracerTraceToTriplet`. | 
|  | 173 | +- `TinkerLLM` implements LiteLLM's `CustomLLM` so the training loop can update sampling clients and expose them through an OpenAI-compatible endpoint without rewriting agent code. | 
|  | 174 | +- `agl_tinker.algo.Tinker` satisfies Agent-lightning's `Algorithm` contract, meaning you can launch training via `agl.Trainer` alongside other algorithms, schedulers, or resources. | 
|  | 175 | + | 
|  | 176 | +Because spans and rewards are emitted by the same rollout function you would deploy, evaluation and production stay in sync—no separate simulator code paths to maintain. | 
|  | 177 | + | 
|  | 178 | +## Troubleshooting tips | 
|  | 179 | + | 
|  | 180 | +- If the runner logs show `Triplet has no token_ids`, ensure your LiteLLM proxy returns logprobs and token IDs, and that the token IDs are present in the store. The provided adapter requires them to rebuild trajectories. See the debugging tutorial for more details. | 
|  | 181 | +- CrewAI telemetry must stay disabled (see `.env.example`) so AgentOps traces remain self-contained; otherwise, you may see malformed traces. | 
|  | 182 | +- Tune `learning_rate`, `batch_size` and `group_size` carefully. The training is sensitive to these hyper-parameters. | 
0 commit comments