[Agent refactor] Event-driven agent loop with hooks, interrupts, and steering

<details>
<summary>中文</summary>

当前的 agent loop（`pkg/agent/loop.go`）是一个黑盒——外部代码无法观察内部执行状态，无法 hook 执行过程，无法打断正在进行的处理，也无法在处理过程中追加消息。本提案将 agent loop 重新设计为事件驱动、可 hook、可中断、可追加的系统。

## 问题是什么？

现有的 `runAgentLoop` → `runLLMIteration` 处理流水线存在四个架构层面的限制：

1. **无可观测性** — Loop 不发射任何事件。外部消费者（UI、日志、自动化脚本）无法得知 loop 处于哪个阶段、正在执行哪个工具、或 Turn 为何结束。
2. **无 Hook 点** — 外部代码无法拦截或修改 LLM 请求，无法审批/拒绝工具执行，也无法在运行时改变 loop 行为。
3. **无中断机制** — Turn 一旦开始就无法停止。没有优雅中断（跳过剩余工具让 LLM 生成总结），也没有硬中断（立即取消一切）。
4. **无消息注入** — 外部代码无法向正在进行的 Turn 注入消息（steering），也无法排队等待 Turn 结束后处理的消息（follow-up）。

此外，工具执行通过 WaitGroup 完全并行，工具之间没有检查点，无法在工具执行间检查中断或新的 steering 消息。

---

## 重构目标与非目标

### 目标

1. **事件透明**: loop 内部的每个关键阶段都向外发射事件
2. **Hook 可控**: 外部可以在关键阶段拦截和修改行为
3. **可打断**: 支持 Graceful Interrupt 和 Hard Abort
4. **可追加**: 支持 Steering（影响当前 Turn）和 FollowUp（排队等待）
5. **工具智能并行**: 只读工具并行，有副作用的工具顺序执行

### 非目标

- 不改变 MessageBus 的 inbound/outbound 模型
- 不改变 Provider 接口（`LLMProvider.Chat`）
- 不改变 Tool 接口的核心签名（`Execute` / `ExecuteAsync`）
- 不改变 `ContextBuilder.BuildMessages` 的输入输出契约
- 不改变 session 的存储格式（JSON）
- 不引入外部依赖

---

## 边界定义

### 什么在 Loop 内部，什么在外部

```mermaid
graph TD
 subgraph AgentLoop["AgentLoop (外壳)"]
 Run["Run() 消费 MessageBus，发送最终回复 这一层不变，只增加 EventBus 初始化"]

 subgraph ProcessMessage["processMessage() — 前置处理层"]
 Transcribe["transcribeAudio 保持在外部"]
 Route["resolveMessageRoute 保持在外部"]
 Command["handleCommand 保持在外部"]
 end

 subgraph RunTurn["runTurn() — 重构后的核心 (原 runAgentLoop)"]
 subgraph TurnBoundary["Turn 边界 — 事件 + Hook 覆盖范围"]
 Context["初始上下文组装 ContextBuilder.BuildMessages() 仍为纯函数 resolveMediaRefs() 在此处调用 产出: []providers.Message"]

 subgraph LLMLoop["LLM 迭代循环"]
 DrainSteering["drain steering queue"]
 BeforeLLM["Hook: BeforeLLMRequest"]
 CallLLM["调用 Provider.Chat()"]
 AfterLLM["Hook: AfterLLMResponse"]
 ParseTools["解析 tool calls"]

 subgraph ToolExec["工具执行 (智能并行)"]
 BeforeTool["Hook: BeforeToolExecute"]
 Approval["Hook: RequestApproval"]
 ExecTool["Execute tool"]
 AfterTool["Hook: AfterToolExecute"]
 end

 CheckInterrupt["检查 steering / interrupt"]
 end

 Compress["上下文压缩 (retry 分支) forceCompression() Hook: BeforeContextCompress"]
 SessionSave["Session 保存 中间: AddFullMessage (只写内存) 结束: AddMessage + Save (落盘) 中断: 回滚内存中的中间状态"]
 Summarize["摘要触发 (Turn 结束后异步)"]
 end
 end

 API1["InjectSteering() ← 公开 API"]
 API2["InjectFollowUp() ← 公开 API"]
 API3["InterruptGraceful() ← 公开 API"]
 API4["InterruptHard() ← 公开 API"]
 API5["SubscribeEvents() ← 公开 API"]
 API6["RegisterHook() ← 公开 API"]
 API7["GetActiveTurn() ← 公开 API"]
 end

 Run --> ProcessMessage
 ProcessMessage --> RunTurn
 Context --> LLMLoop
 DrainSteering --> BeforeLLM --> CallLLM --> AfterLLM --> ParseTools --> ToolExec
 BeforeTool --> Approval --> ExecTool --> AfterTool
 ToolExec --> CheckInterrupt
 CheckInterrupt -->|"继续迭代"| DrainSteering
 LLMLoop --> Compress
 LLMLoop --> SessionSave
 SessionSave --> Summarize
```

---

## 核心概念

### Turn 与 Iteration

| 术语 | 定义 |
|------|------|
| **Turn** | 一次完整的「用户输入 → LLM 迭代 → 最终回复」处理过程 |
| **Iteration** | Turn 内的一次 LLM 调用（可能产生工具调用，触发下一次 Iteration） |
| **Steering** | 在 Turn 进行中注入的消息，影响当前 Turn 的下一次 Iteration |
| **FollowUp** | 在 Turn 进行中排队的消息，当前 Turn 结束后作为新的 Inbound 处理 |
| **Graceful Interrupt** | 跳过剩余工具，可注入最后一条 steering 提示，让 LLM 生成总结后结束 Turn |
| **Hard Abort** | 立即取消 Provider 调用和所有工具执行，Turn 以错误结束 |
| **SubTurn** | 由父 Turn 内部 spawn 的子 Turn，走完整 `runTurn` 路径，拥有独立 turnID |
| **Ephemeral Session** | SubTurn 使用的纯内存 session，不落盘，SubTurn 结束后清理 |

---

## EventBus 是什么？

**EventBus** 是一个多订阅者广播系统，agent loop 通过它在每个关键执行阶段发射结构化事件。

可以理解为：*「让任何人都能实时观察 loop 在做什么，而不影响其行为。」*

核心设计要点：
- **非阻塞 fan-out**：Emit 遍历订阅者并以非阻塞方式发送；channel 满时丢弃事件（永远不阻塞 loop）
- **Per-EventKind 丢弃计数**：`[EventKindCount]atomic.Int64` 数组，可精确诊断哪类事件正在丢失
- **无独立 goroutine**：Emit 在 loop 自身的 goroutine 中运行（减少 goroutine 数量，适合 $10 开发板、<10MB 内存的场景）
- **每个订阅者缓冲容量 16**：考虑到 LLM 调用延迟主导迭代时间，此容量足够

事件类型覆盖完整的 Turn 生命周期：

| 分类 | 事件 |
|------|------|
| Turn 生命周期 | `TurnStart`, `TurnEnd` |
| LLM 调用 | `LLMRequest`, `LLMDelta`, `LLMResponse`, `LLMRetry` |
| 上下文管理 | `ContextCompress`, `SessionSummarize` |
| 工具执行 | `ToolExecStart`, `ToolExecEnd`, `ToolExecSkipped` |
| 外部交互 | `SteeringInjected`, `FollowUpQueued`, `InterruptReceived` |
| 子 Agent 生命周期 | `SubTurnSpawn`, `SubTurnEnd`, `SubTurnResultDelivered` |
| 错误 | `Error` |

---

## Hook 系统是什么？

**Hook 系统**允许外部代码在明确定义的执行节点上观察和拦截 loop 的执行。Hook 是同步的、按优先级排序的、且有超时保护。

共五个 hook 接口，均为可选——hook 只需实现它需要的接口：

| 接口 | 用途 | 可修改？ |
|------|------|----------|
| `EventObserver` | 被动接收事件 | 否 |
| `LLMInterceptor` | 在 LLM 调用前后拦截 | 是——可修改消息、模型、工具定义；可中止 Turn |
| `ToolInterceptor` | 在工具执行前后拦截 | 是——可修改参数、拒绝工具、中止 Turn |
| `ToolApprover` | 暂停等待外部审批（如 UI 确认） | 否——审批或拒绝 |
| `ContextCompressInterceptor` | 在上下文压缩前拦截 | 否——可中止 Turn |

Hook 动作：`Continue`、`Modify`、`AbortTurn`、`HardAbort`、`DenyTool`。

**超时分级**：
- 普通 Hook（Observer、LLMInterceptor、ToolInterceptor）：默认 **5s**，超时 → `Continue`
- 审批 Hook（ToolApprover）：默认 **60s**（可配置），超时 → `deny`（安全优先）

审批在独立 goroutine 中运行，使用 channel+timeout 模式，因此即使 hook 实现者行为不当也无法永久阻塞 loop。

---

## 中断与消息注入机制是什么？

### 中断

两种中断模式，作为 `AgentLoop` 的公开 API 暴露：

| | 优雅中断（Graceful Interrupt） | 硬中断（Hard Abort） |
|---|---|---|
| **行为** | 设置 `interrupted=true`，跳过剩余工具，注入 steering 提示，允许再做一轮 LLM 迭代生成总结 | 设置 `aborted=true`，立即取消 provider context 和 turn context |
| **Session** | 保留已完成的工具结果，正常保存 | 回滚到快照点（仅内存回滚，不写盘） |
| **子 Turn** | **不级联**——让子 Turn 自然完成 | 通过 `cascadeAbort` 递归级联到所有子 Turn |

### Steering 与 Follow-up

| | Steering | Follow-up |
|---|---|---|
| **时机** | 影响当前 Turn 的下一次 Iteration | 排队等待；在当前 Turn 结束后作为新的 Inbound 处理 |
| **消费方式** | 在每次迭代循环顶部 drain | 作为 `runTurn` 的返回值，由调用者在所有清理完成后投递 |
| **边界情况** | 如果 LLM 返回最终内容（无工具调用）但 steering 队列非空：若迭代余量充足则继续迭代，若已达 MaxIterations 则降级为 follow-up | 通过 `bus.PublishInbound` 在 `runTurn` 完全返回后投递——避免与 defer 清理产生竞态 |

---

## 工具执行策略

当前基于 WaitGroup 的并行执行替换为**智能分组**：

```
工具调用: [read_file, list_dir, write_file, web_search, exec]

分组:
 → {read_file, list_dir} parallel=true (均为只读)
 → {write_file} parallel=false (有副作用)
 → {web_search} parallel=true (只读)
 → {exec} parallel=false (有副作用)
```

- **只读工具**（通过可选的 `ReadOnlyIndicator` 接口声明）：连续的只读工具并行执行
- **有副作用的工具**：顺序执行，每个工具之间有 steering/中断检查点
- 未声明的工具默认为有副作用（保守策略——只会降低并行度，不会导致正确性问题）
- `IsReadOnly()` 是静态的（无参数）——有意不支持依赖参数的只读判断，以保持简单和安全

---

## 多 Agent 协作（子 Turn）

当前的 `SubagentManager` + `RunToolLoop` 架构存在关键缺陷：

1. `RunToolLoop` 完全脱离 EventBus/Hook/中断体系
2. 异步子 agent 结果路由到 `agent:main:main` session，丢失原始对话上下文
3. 无父子 Turn 关系——无法级联中断
4. 子 agent 不可被外部中断

### 设计：通过 `runTurn` 实现子 Turn

子 agent 重构为使用与顶层 Turn 相同的 `runTurn` 路径：

- **`spawnSubTurn(parentTS, config)`** — 同步和异步子 Turn 的统一入口
- **Turn 树**：通过 `turnState` 中的 `parentTurnID` / `childTurnIDs` 追踪父子关系
- **Ephemeral Session**：子 Turn 使用纯内存 session（无磁盘 I/O，无需清理）
- **资源控制**：深度限制（默认 3），per-agent 并发信号量（buffered channel，非阻塞获取）
- **结果投递**：同步 → 直接返回 ToolResult；异步 → `deliverSubTurnResult` 投递到父 Turn 的 `pendingResults` 队列（在下次 Iteration 前 drain）或写入 session 历史（如果父 Turn 已结束）

---

## 与现有代码的关系

| 当前 | 重构后 |
|------|--------|
| `runAgentLoop()` | 重命名为 `runTurn()`，使用 turnState + 事件发射重构 |
| `runLLMIteration()` | 合并到 `runTurn()` 的迭代循环中 |
| 工具执行（`runLLMIteration` 中的 WaitGroup） | 提取到 `tool_exec.go`，使用智能分组 |
| `SubagentTool` → `RunToolLoop` | `SubagentTool` → `spawnSubTurn(Sync)` → `runTurn` |
| `SpawnTool` → `RunToolLoop`（异步） | `SpawnTool` → `spawnSubTurn(Async)` → goroutine 中 `runTurn` |
| 无可观测性 | EventBus，17 种事件类型 |
| 无 Hook | HookManager，5 种 Hook 接口 |
| 无中断 | `InterruptGraceful` / `InterruptHard`，含 session 回滚 |
| 无消息注入 | `InjectSteering` / `InjectFollowUp`，基于队列的投递 |

### 新增文件

| 文件 | 职责 |
|------|------|
| `pkg/agent/events.go` | 事件类型、EventKind 枚举、Payload 结构体 |
| `pkg/agent/eventbus.go` | EventBus：Subscribe/Emit/Close |
| `pkg/agent/hooks.go` | Hook 接口族、HookManager |
| `pkg/agent/turn_state.go` | turnState、steering/followup 队列、中断标记 |
| `pkg/agent/tool_exec.go` | 带 Hook 的智能工具执行 |
| `pkg/agent/subturn.go` | spawnSubTurn、deliverSubTurnResult |

### 修改文件

| 文件 | 范围 |
|------|------|
| `pkg/agent/loop.go` | 大——新增字段，`runTurn` 重写，公开 API |
| `pkg/tools/base.go` | 极小——新增 `ReadOnlyIndicator` 接口 |
| `pkg/tools/*.go` | 小——为已知只读工具添加 `IsReadOnly()` |
| `pkg/agent/instance.go` | 小——新增子 Turn 配置字段 |
| `pkg/tools/subagent.go` | 大——重写为调用 `spawnSubTurn` |
| `pkg/tools/spawn.go` | 大——重写为调用 `spawnSubTurn` |

---

## 并发安全

核心不变量：

- **锁顺序**：`turnState.mu` → `eventBus.mu (RLock)` → `sessionManager.mu`——永远不逆序
- **锁外发射事件**：EventBus.Emit 在释放 `turnState.mu` 后调用，最小化持锁时间
- **providerCancel 双向检查**：Loop 设置 `providerCancel` 后检查 `aborted`；`InterruptHard` 设置 `aborted` 后调用 `providerCancel`——在 `ts.mu` 保护下互为冗余
- **FollowUp 作为返回值**：`runTurn` 返回 follow-up 而非在 defer 中投递——消除清理与新 Turn 启动之间的竞态
- **审批 goroutine 隔离**：审批在独立 goroutine 中运行并带超时——即使 Hook 行为不当也无法阻塞 loop
- **Session 快照修正**：`forceCompression()` 后立即更新 `ts.sessionSnapshot`，防止回滚目标过时

---

## 开放问题

1. EventBus 是否应支持过滤订阅（仅订阅特定 EventKind），还是应由订阅者自行过滤？
2. 是否需要内置一个「debug hook」将所有事件写入文件，通过配置开关启用？
3. 如果异步子 Turn 的结果在父 Turn 结束后到达，是否默认通知用户，还是需要用户 opt-in？
4. `ReadOnlyIndicator` 是否应扩展到 MCP 工具（远程工具通过 MCP 元数据声明只读）？

</details>
The current agent loop (`pkg/agent/loop.go`) is a black box — external code cannot observe internal execution state, hook into the execution process, interrupt ongoing processing, or inject messages mid-turn. This proposal redesigns the agent loop as an event-driven, hookable, interruptible, and appendable system.

## What is the problem?

The existing `runAgentLoop` → `runLLMIteration` pipeline has four architectural limitations:

1. **No observability** — The loop emits no events. External consumers (UI, logging, automation) cannot know what phase the loop is in, which tool is executing, or why a turn ended.
2. **No hook points** — There is no way for external code to intercept or modify LLM requests, approve/deny tool executions, or alter loop behavior at runtime.
3. **No interrupt mechanism** — Once a turn starts, it cannot be stopped. There is no graceful interrupt (skip remaining tools, let LLM summarize) or hard abort (cancel everything immediately).
4. **No message injection** — External code cannot inject messages into an ongoing turn (steering) or queue messages for after the turn ends (follow-up).

Additionally, tool execution is fully parallel via WaitGroup with no checkpoints between tools, making it impossible to check for interrupts or new steering messages between tool executions.

---

## Goals & Non-goals

### Goals

1. **Event transparency**: Every critical phase inside the loop emits events to external consumers
2. **Hook control**: External code can intercept and modify behavior at key phases
3. **Interruptible**: Support both Graceful Interrupt and Hard Abort
4. **Appendable**: Support Steering (affects current Turn) and FollowUp (queued for later)
5. **Smart tool parallelism**: Read-only tools run in parallel, mutating tools run sequentially

### Non-goals

- Do not change the MessageBus inbound/outbound model
- Do not change the Provider interface (`LLMProvider.Chat`)
- Do not change the core Tool interface signatures (`Execute` / `ExecuteAsync`)
- Do not change the `ContextBuilder.BuildMessages` input/output contract
- Do not change the session storage format (JSON)
- Do not introduce external dependencies

---

## Boundary definitions

### What is inside the Loop vs. outside

```mermaid
graph TD
 subgraph AgentLoop["AgentLoop (outer shell)"]
 Run["Run() Consume MessageBus, send final reply This layer unchanged, only adds EventBus init"]

 subgraph ProcessMessage["processMessage() — preprocessing layer"]
 Transcribe["transcribeAudio stays outside"]
 Route["resolveMessageRoute stays outside"]
 Command["handleCommand stays outside"]
 end

 subgraph RunTurn["runTurn() — refactored core (formerly runAgentLoop)"]
 subgraph TurnBoundary["Turn boundary — event + hook coverage"]
 Context["Initial context assembly ContextBuilder.BuildMessages() remains a pure function resolveMediaRefs() called here Output: []providers.Message"]

 subgraph LLMLoop["LLM iteration loop"]
 DrainSteering["drain steering queue"]
 BeforeLLM["Hook: BeforeLLMRequest"]
 CallLLM["call Provider.Chat()"]
 AfterLLM["Hook: AfterLLMResponse"]
 ParseTools["parse tool calls"]

 subgraph ToolExec["Tool execution (smart parallel)"]
 BeforeTool["Hook: BeforeToolExecute"]
 Approval["Hook: RequestApproval"]
 ExecTool["Execute tool"]
 AfterTool["Hook: AfterToolExecute"]
 end

 CheckInterrupt["check steering / interrupt"]
 end

 Compress["Context compression (retry branch) forceCompression() Hook: BeforeContextCompress"]
 SessionSave["Session save Intermediate: AddFullMessage (memory only) End: AddMessage + Save (disk) Abort: rollback in-memory state"]
 Summarize["Summary trigger (async after Turn ends)"]
 end
 end

 API1["InjectSteering() ← public API"]
 API2["InjectFollowUp() ← public API"]
 API3["InterruptGraceful() ← public API"]
 API4["InterruptHard() ← public API"]
 API5["SubscribeEvents() ← public API"]
 API6["RegisterHook() ← public API"]
 API7["GetActiveTurn() ← public API"]
 end

 Run --> ProcessMessage
 ProcessMessage --> RunTurn
 Context --> LLMLoop
 DrainSteering --> BeforeLLM --> CallLLM --> AfterLLM --> ParseTools --> ToolExec
 BeforeTool --> Approval --> ExecTool --> AfterTool
 ToolExec --> CheckInterrupt
 CheckInterrupt -->|"continue iteration"| DrainSteering
 LLMLoop --> Compress
 LLMLoop --> SessionSave
 SessionSave --> Summarize
```

---

## Core concepts

### Turn & Iteration

| Term | Definition |
|------|------------|
| **Turn** | One complete "user input → LLM iterations → final reply" processing cycle |
| **Iteration** | One LLM call within a Turn (may produce tool calls, triggering the next Iteration) |
| **Steering** | A message injected during a Turn that affects the next Iteration |
| **FollowUp** | A message queued during a Turn, processed as a new inbound after the Turn ends |
| **Graceful Interrupt** | Skip remaining tools, optionally inject a steering hint, let LLM generate a summary, then end the Turn |
| **Hard Abort** | Immediately cancel the Provider call and all tool executions; Turn ends with an error |
| **SubTurn** | A child Turn spawned inside a parent Turn, running the full `runTurn` path with its own turnID |
| **Ephemeral Session** | An in-memory-only session used by SubTurns — never persisted to disk |

---

## What is the EventBus?

The **EventBus** is a multi-subscriber broadcast system that the agent loop uses to emit structured events at every critical phase of execution.

You can think of it as: *"a way for anyone to watch what the loop is doing, in real time, without affecting its behavior."*

Key design points:
- **Non-blocking fan-out**: Emit iterates subscribers with non-blocking sends; full channels drop events (never block the loop)
- **Per-EventKind drop counters**: `[EventKindCount]atomic.Int64` array enables precise diagnostics of which event types are being lost
- **No dedicated goroutine**: Emit runs in the loop's own goroutine (fewer goroutines, suitable for $10 boards with <10MB RAM)
- **Buffer capacity 16** per subscriber: Sufficient given LLM call latency dominates iteration timing

Event types cover the full turn lifecycle:

| Category | Events |
|----------|--------|
| Turn lifecycle | `TurnStart`, `TurnEnd` |
| LLM calls | `LLMRequest`, `LLMDelta`, `LLMResponse`, `LLMRetry` |
| Context management | `ContextCompress`, `SessionSummarize` |
| Tool execution | `ToolExecStart`, `ToolExecEnd`, `ToolExecSkipped` |
| External interaction | `SteeringInjected`, `FollowUpQueued`, `InterruptReceived` |
| Sub-agent lifecycle | `SubTurnSpawn`, `SubTurnEnd`, `SubTurnResultDelivered` |
| Errors | `Error` |

---

## What is the Hook system?

The **Hook system** allows external code to observe and intercept the loop's execution at well-defined points. Hooks are synchronous, priority-ordered, and timeout-protected.

There are five hook interfaces, each optional — a hook implements only the ones it needs:

| Interface | Purpose | Can modify? |
|-----------|---------|-------------|
| `EventObserver` | Passively receive events | No |
| `LLMInterceptor` | Intercept before/after LLM calls | Yes — can modify messages, model, tools; can abort turn |
| `ToolInterceptor` | Intercept before/after tool execution | Yes — can modify args, deny tool, abort turn |
| `ToolApprover` | Pause for external approval (e.g., UI confirmation) | No — approve or deny |
| `ContextCompressInterceptor` | Intercept before context compression | No — can abort turn |

Hook actions: `Continue`, `Modify`, `AbortTurn`, `HardAbort`, `DenyTool`.

**Timeout tiers**:
- Normal hooks (Observer, LLMInterceptor, ToolInterceptor): **5s** default, timeout → `Continue`
- Approval hooks (ToolApprover): **60s** default (configurable), timeout → `deny` (safety-first)

Approval runs in a separate goroutine with a channel+timeout pattern, so a misbehaving hook cannot block the loop indefinitely.

---

## What is the Interrupt & Steering mechanism?

### Interrupts

Two interrupt modes, exposed as public APIs on `AgentLoop`:

| | Graceful Interrupt | Hard Abort |
|---|---|---|
| **Behavior** | Sets `interrupted=true`, skips remaining tools, injects steering hint, allows one more LLM iteration for summary | Sets `aborted=true`, cancels provider context + turn context immediately |
| **Session** | Keeps completed tool results, saves normally | Rolls back to snapshot (memory only, no disk write) |
| **Sub-turns** | Does NOT cascade — lets children finish naturally | Cascades recursively to all children via `cascadeAbort` |

### Steering & Follow-up

| | Steering | Follow-up |
|---|---|---|
| **Timing** | Affects the current turn's next iteration | Queued; processed as new inbound after current turn ends |
| **Consumption** | Drained at the top of each iteration loop | Returned from `runTurn`, published by caller after all cleanup |
| **Edge case** | If LLM returns final content (no tools) but steering is pending: continues iteration if budget remains, degrades to follow-up at MaxIterations | Published via `bus.PublishInbound` after `runTurn` fully returns — avoids race with defer cleanup |

---

## Tool execution strategy

The current WaitGroup-based parallel execution is replaced with **smart grouping**:

```
Tool calls: [read_file, list_dir, write_file, web_search, exec]

Groups:
 → {read_file, list_dir} parallel=true (both ReadOnly)
 → {write_file} parallel=false (mutating)
 → {web_search} parallel=true (ReadOnly)
 → {exec} parallel=false (mutating)
```

- **ReadOnly tools** (declared via optional `ReadOnlyIndicator` interface): consecutive read-only tools run in parallel
- **Mutating tools**: run sequentially with steering/interrupt checkpoints between each
- Undeclared tools default to mutating (conservative — only reduces parallelism, never causes correctness issues)
- `IsReadOnly()` is static (no args parameter) — args-dependent read-only is intentionally not supported for simplicity and safety

---

## Multi-agent coordination (Sub-turns)

The current `SubagentManager` + `RunToolLoop` architecture has critical flaws:

1. `RunToolLoop` is completely outside the EventBus/Hook/interrupt system
2. Async sub-agent results route to `agent:main:main` session, losing the original conversation context
3. No parent-child Turn relationship — no cascading interrupts
4. Sub-agents cannot be externally interrupted

### Design: Sub-turns via `runTurn`

Sub-agents are refactored to use the same `runTurn` path as top-level turns:

- **`spawnSubTurn(parentTS, config)`** — single entry point for both sync and async sub-turns
- **Turn tree**: parent/child relationship tracked via `parentTurnID` / `childTurnIDs` in `turnState`
- **Ephemeral sessions**: sub-turns use in-memory-only sessions (no disk I/O, no cleanup needed)
- **Resource control**: depth limit (default 3), per-agent concurrency semaphore (buffered channel, non-blocking acquire)
- **Result delivery**: sync → direct ToolResult; async → `deliverSubTurnResult` to parent's `pendingResults` queue (drained at next iteration) or session history (if parent finished)

---

## Relationship to existing code

| Current | After refactor |
|---------|----------------|
| `runAgentLoop()` | Renamed to `runTurn()`, restructured with turnState + event emission |
| `runLLMIteration()` | Merged into `runTurn()` iteration loop |
| Tool execution (WaitGroup in `runLLMIteration`) | Extracted to `tool_exec.go` with smart grouping |
| `SubagentTool` → `RunToolLoop` | `SubagentTool` → `spawnSubTurn(Sync)` → `runTurn` |
| `SpawnTool` → `RunToolLoop` (async) | `SpawnTool` → `spawnSubTurn(Async)` → `runTurn` in goroutine |
| No observability | EventBus with 17 event types |
| No hooks | HookManager with 5 hook interfaces |
| No interrupts | `InterruptGraceful` / `InterruptHard` with session rollback |
| No steering | `InjectSteering` / `InjectFollowUp` with queue-based delivery |

### New files

| File | Responsibility |
|------|----------------|
| `pkg/agent/events.go` | Event types, EventKind enum, payload structs |
| `pkg/agent/eventbus.go` | EventBus: Subscribe/Emit/Close |
| `pkg/agent/hooks.go` | Hook interfaces, HookManager |
| `pkg/agent/turn_state.go` | turnState, steering/followup queues, interrupt flags |
| `pkg/agent/tool_exec.go` | Smart tool execution with hooks |
| `pkg/agent/subturn.go` | spawnSubTurn, deliverSubTurnResult |

### Modified files

| File | Scope |
|------|-------|
| `pkg/agent/loop.go` | Major — new fields, `runTurn` rewrite, public APIs |
| `pkg/tools/base.go` | Minimal — add `ReadOnlyIndicator` interface |
| `pkg/tools/*.go` | Small — add `IsReadOnly()` to known read-only tools |
| `pkg/agent/instance.go` | Small — add sub-turn config fields |
| `pkg/tools/subagent.go` | Large — rewrite to call `spawnSubTurn` |
| `pkg/tools/spawn.go` | Large — rewrite to call `spawnSubTurn` |

---

## Concurrency safety

Key invariants:

- **Lock ordering**: `turnState.mu` → `eventBus.mu (RLock)` → `sessionManager.mu` — never reversed
- **Emit outside locks**: EventBus.Emit is called after releasing `turnState.mu` to minimize lock hold time
- **Double-check on providerCancel**: Loop sets `providerCancel` then checks `aborted`; `InterruptHard` sets `aborted` then calls `providerCancel` — mutual redundancy under `ts.mu`
- **FollowUp as return value**: `runTurn` returns follow-ups instead of publishing in defer — eliminates race between cleanup and new turn startup
- **Approval goroutine isolation**: Approval runs in separate goroutine with timeout — cannot block loop even if hook misbehaves
- **Session snapshot correction**: `ts.sessionSnapshot` is updated after `forceCompression()` to prevent stale rollback targets

---

## Open questions

1. Should EventBus support filtered subscriptions (subscribe to specific EventKinds only), or should filtering be subscriber-side?
2. Should there be a built-in "debug hook" that logs all events to a file, enabled via config flag?
3. should async sub-turn results that arrive after parent turn ends trigger a user notification by default, or should this be opt-in?
4. Should `ReadOnlyIndicator` be extended to MCP tools (remote tools declaring read-only via MCP metadata)?

	优雅中断（Graceful Interrupt）	硬中断（Hard Abort）
行为	设置 `interrupted=true`，跳过剩余工具，注入 steering 提示，允许再做一轮 LLM 迭代生成总结	设置 `aborted=true`，立即取消 provider context 和 turn context
Session	保留已完成的工具结果，正常保存	回滚到快照点（仅内存回滚，不写盘）
子 Turn	不级联——让子 Turn 自然完成	通过 `cascadeAbort` 递归级联到所有子 Turn

当前	重构后
`runAgentLoop()`	重命名为 `runTurn()`，使用 turnState + 事件发射重构
`runLLMIteration()`	合并到 `runTurn()` 的迭代循环中
工具执行（`runLLMIteration` 中的 WaitGroup）	提取到 `tool_exec.go`，使用智能分组
`SubagentTool` → `RunToolLoop`	`SubagentTool` → `spawnSubTurn(Sync)` → `runTurn`
`SpawnTool` → `RunToolLoop`（异步）	`SpawnTool` → `spawnSubTurn(Async)` → goroutine 中 `runTurn`
无可观测性	EventBus，17 种事件类型
无 Hook	HookManager，5 种 Hook 接口
无中断	`InterruptGraceful` / `InterruptHard`，含 session 回滚
无消息注入	`InjectSteering` / `InjectFollowUp`，基于队列的投递

文件	范围
`pkg/agent/loop.go`	大——新增字段，`runTurn` 重写，公开 API
`pkg/tools/base.go`	极小——新增 `ReadOnlyIndicator` 接口
`pkg/tools/*.go`	小——为已知只读工具添加 `IsReadOnly()`
`pkg/agent/instance.go`	小——新增子 Turn 配置字段
`pkg/tools/subagent.go`	大——重写为调用 `spawnSubTurn`
`pkg/tools/spawn.go`	大——重写为调用 `spawnSubTurn`

	Graceful Interrupt	Hard Abort
Behavior	Sets `interrupted=true`, skips remaining tools, injects steering hint, allows one more LLM iteration for summary	Sets `aborted=true`, cancels provider context + turn context immediately
Session	Keeps completed tool results, saves normally	Rolls back to snapshot (memory only, no disk write)
Sub-turns	Does NOT cascade — lets children finish naturally	Cascades recursively to all children via `cascadeAbort`

Current	After refactor
`runAgentLoop()`	Renamed to `runTurn()`, restructured with turnState + event emission
`runLLMIteration()`	Merged into `runTurn()` iteration loop
Tool execution (WaitGroup in `runLLMIteration`)	Extracted to `tool_exec.go` with smart grouping
`SubagentTool` → `RunToolLoop`	`SubagentTool` → `spawnSubTurn(Sync)` → `runTurn`
`SpawnTool` → `RunToolLoop` (async)	`SpawnTool` → `spawnSubTurn(Async)` → `runTurn` in goroutine
No observability	EventBus with 17 event types
No hooks	HookManager with 5 hook interfaces
No interrupts	`InterruptGraceful` / `InterruptHard` with session rollback
No steering	`InjectSteering` / `InjectFollowUp` with queue-based delivery

术语	定义
Turn	一次完整的「用户输入 → LLM 迭代 → 最终回复」处理过程
Iteration	Turn 内的一次 LLM 调用（可能产生工具调用，触发下一次 Iteration）
Steering	在 Turn 进行中注入的消息，影响当前 Turn 的下一次 Iteration
FollowUp	在 Turn 进行中排队的消息，当前 Turn 结束后作为新的 Inbound 处理
Graceful Interrupt	跳过剩余工具，可注入最后一条 steering 提示，让 LLM 生成总结后结束 Turn
Hard Abort	立即取消 Provider 调用和所有工具执行，Turn 以错误结束
SubTurn	由父 Turn 内部 spawn 的子 Turn，走完整 `runTurn` 路径，拥有独立 turnID
Ephemeral Session	SubTurn 使用的纯内存 session，不落盘，SubTurn 结束后清理

分类	事件
Turn 生命周期	`TurnStart`, `TurnEnd`
LLM 调用	`LLMRequest`, `LLMDelta`, `LLMResponse`, `LLMRetry`
上下文管理	`ContextCompress`, `SessionSummarize`
工具执行	`ToolExecStart`, `ToolExecEnd`, `ToolExecSkipped`
外部交互	`SteeringInjected`, `FollowUpQueued`, `InterruptReceived`
子 Agent 生命周期	`SubTurnSpawn`, `SubTurnEnd`, `SubTurnResultDelivered`
错误	`Error`

接口	用途	可修改？
`EventObserver`	被动接收事件	否
`LLMInterceptor`	在 LLM 调用前后拦截	是——可修改消息、模型、工具定义；可中止 Turn
`ToolInterceptor`	在工具执行前后拦截	是——可修改参数、拒绝工具、中止 Turn
`ToolApprover`	暂停等待外部审批（如 UI 确认）	否——审批或拒绝
`ContextCompressInterceptor`	在上下文压缩前拦截	否——可中止 Turn

	Steering	Follow-up
时机	影响当前 Turn 的下一次 Iteration	排队等待；在当前 Turn 结束后作为新的 Inbound 处理
消费方式	在每次迭代循环顶部 drain	作为 `runTurn` 的返回值，由调用者在所有清理完成后投递
边界情况	如果 LLM 返回最终内容（无工具调用）但 steering 队列非空：若迭代余量充足则继续迭代，若已达 MaxIterations 则降级为 follow-up	通过 `bus.PublishInbound` 在 `runTurn` 完全返回后投递——避免与 defer 清理产生竞态

文件	职责
`pkg/agent/events.go`	事件类型、EventKind 枚举、Payload 结构体
`pkg/agent/eventbus.go`	EventBus：Subscribe/Emit/Close
`pkg/agent/hooks.go`	Hook 接口族、HookManager
`pkg/agent/turn_state.go`	turnState、steering/followup 队列、中断标记
`pkg/agent/tool_exec.go`	带 Hook 的智能工具执行
`pkg/agent/subturn.go`	spawnSubTurn、deliverSubTurnResult

Term	Definition
Turn	One complete "user input → LLM iterations → final reply" processing cycle
Iteration	One LLM call within a Turn (may produce tool calls, triggering the next Iteration)
Steering	A message injected during a Turn that affects the next Iteration
FollowUp	A message queued during a Turn, processed as a new inbound after the Turn ends
Graceful Interrupt	Skip remaining tools, optionally inject a steering hint, let LLM generate a summary, then end the Turn
Hard Abort	Immediately cancel the Provider call and all tool executions; Turn ends with an error
SubTurn	A child Turn spawned inside a parent Turn, running the full `runTurn` path with its own turnID
Ephemeral Session	An in-memory-only session used by SubTurns — never persisted to disk

Category	Events
Turn lifecycle	`TurnStart`, `TurnEnd`
LLM calls	`LLMRequest`, `LLMDelta`, `LLMResponse`, `LLMRetry`
Context management	`ContextCompress`, `SessionSummarize`
Tool execution	`ToolExecStart`, `ToolExecEnd`, `ToolExecSkipped`
External interaction	`SteeringInjected`, `FollowUpQueued`, `InterruptReceived`
Sub-agent lifecycle	`SubTurnSpawn`, `SubTurnEnd`, `SubTurnResultDelivered`
Errors	`Error`

Interface	Purpose	Can modify?
`EventObserver`	Passively receive events	No
`LLMInterceptor`	Intercept before/after LLM calls	Yes — can modify messages, model, tools; can abort turn
`ToolInterceptor`	Intercept before/after tool execution	Yes — can modify args, deny tool, abort turn
`ToolApprover`	Pause for external approval (e.g., UI confirmation)	No — approve or deny
`ContextCompressInterceptor`	Intercept before context compression	No — can abort turn

	Steering	Follow-up
Timing	Affects the current turn's next iteration	Queued; processed as new inbound after current turn ends
Consumption	Drained at the top of each iteration loop	Returned from `runTurn`, published by caller after all cleanup
Edge case	If LLM returns final content (no tools) but steering is pending: continues iteration if budget remains, degrades to follow-up at MaxIterations	Published via `bus.PublishInbound` after `runTurn` fully returns — avoids race with defer cleanup

File	Responsibility
`pkg/agent/events.go`	Event types, EventKind enum, payload structs
`pkg/agent/eventbus.go`	EventBus: Subscribe/Emit/Close
`pkg/agent/hooks.go`	Hook interfaces, HookManager
`pkg/agent/turn_state.go`	turnState, steering/followup queues, interrupt flags
`pkg/agent/tool_exec.go`	Smart tool execution with hooks
`pkg/agent/subturn.go`	spawnSubTurn, deliverSubTurnResult

File	Scope
`pkg/agent/loop.go`	Major — new fields, `runTurn` rewrite, public APIs
`pkg/tools/base.go`	Minimal — add `ReadOnlyIndicator` interface
`pkg/tools/*.go`	Small — add `IsReadOnly()` to known read-only tools
`pkg/agent/instance.go`	Small — add sub-turn config fields
`pkg/tools/subagent.go`	Large — rewrite to call `spawnSubTurn`
`pkg/tools/spawn.go`	Large — rewrite to call `spawnSubTurn`

[Agent refactor] Event-driven agent loop with hooks, interrupts, and steering #1316

Description

问题是什么？

重构目标与非目标

目标

非目标

边界定义

什么在 Loop 内部，什么在外部

核心概念

Turn 与 Iteration

EventBus 是什么？

Hook 系统是什么？

中断与消息注入机制是什么？

中断

Steering 与 Follow-up

工具执行策略

多 Agent 协作（子 Turn）

设计：通过 runTurn 实现子 Turn

与现有代码的关系

新增文件

修改文件

并发安全

开放问题

What is the problem?

Goals & Non-goals

Goals

Non-goals

Boundary definitions

What is inside the Loop vs. outside

Core concepts

Turn & Iteration

What is the EventBus?

What is the Hook system?

What is the Interrupt & Steering mechanism?

Interrupts

Steering & Follow-up

Tool execution strategy

Multi-agent coordination (Sub-turns)

Design: Sub-turns via runTurn

Relationship to existing code

New files

Modified files

Concurrency safety

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

设计：通过 `runTurn` 实现子 Turn

Design: Sub-turns via `runTurn`