-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
中文
当前的 agent loop(pkg/agent/loop.go)是一个黑盒——外部代码无法观察内部执行状态,无法 hook 执行过程,无法打断正在进行的处理,也无法在处理过程中追加消息。本提案将 agent loop 重新设计为事件驱动、可 hook、可中断、可追加的系统。
问题是什么?
现有的 runAgentLoop → runLLMIteration 处理流水线存在四个架构层面的限制:
- 无可观测性 — Loop 不发射任何事件。外部消费者(UI、日志、自动化脚本)无法得知 loop 处于哪个阶段、正在执行哪个工具、或 Turn 为何结束。
- 无 Hook 点 — 外部代码无法拦截或修改 LLM 请求,无法审批/拒绝工具执行,也无法在运行时改变 loop 行为。
- 无中断机制 — Turn 一旦开始就无法停止。没有优雅中断(跳过剩余工具让 LLM 生成总结),也没有硬中断(立即取消一切)。
- 无消息注入 — 外部代码无法向正在进行的 Turn 注入消息(steering),也无法排队等待 Turn 结束后处理的消息(follow-up)。
此外,工具执行通过 WaitGroup 完全并行,工具之间没有检查点,无法在工具执行间检查中断或新的 steering 消息。
重构目标与非目标
目标
- 事件透明: loop 内部的每个关键阶段都向外发射事件
- Hook 可控: 外部可以在关键阶段拦截和修改行为
- 可打断: 支持 Graceful Interrupt 和 Hard Abort
- 可追加: 支持 Steering(影响当前 Turn)和 FollowUp(排队等待)
- 工具智能并行: 只读工具并行,有副作用的工具顺序执行
非目标
- 不改变 MessageBus 的 inbound/outbound 模型
- 不改变 Provider 接口(
LLMProvider.Chat) - 不改变 Tool 接口的核心签名(
Execute/ExecuteAsync) - 不改变
ContextBuilder.BuildMessages的输入输出契约 - 不改变 session 的存储格式(JSON)
- 不引入外部依赖
边界定义
什么在 Loop 内部,什么在外部
graph TD
subgraph AgentLoop["AgentLoop (外壳)"]
Run["Run()<br/>消费 MessageBus,发送最终回复<br/>这一层不变,只增加 EventBus 初始化"]
subgraph ProcessMessage["processMessage() — 前置处理层"]
Transcribe["transcribeAudio<br/><i>保持在外部</i>"]
Route["resolveMessageRoute<br/><i>保持在外部</i>"]
Command["handleCommand<br/><i>保持在外部</i>"]
end
subgraph RunTurn["runTurn() — 重构后的核心 (原 runAgentLoop)"]
subgraph TurnBoundary["Turn 边界 — 事件 + Hook 覆盖范围"]
Context["初始上下文组装<br/>ContextBuilder.BuildMessages() 仍为纯函数<br/>resolveMediaRefs() 在此处调用<br/>产出: []providers.Message"]
subgraph LLMLoop["LLM 迭代循环"]
DrainSteering["drain steering queue"]
BeforeLLM["Hook: BeforeLLMRequest"]
CallLLM["调用 Provider.Chat()"]
AfterLLM["Hook: AfterLLMResponse"]
ParseTools["解析 tool calls"]
subgraph ToolExec["工具执行 (智能并行)"]
BeforeTool["Hook: BeforeToolExecute"]
Approval["Hook: RequestApproval"]
ExecTool["Execute tool"]
AfterTool["Hook: AfterToolExecute"]
end
CheckInterrupt["检查 steering / interrupt"]
end
Compress["上下文压缩 (retry 分支)<br/>forceCompression()<br/>Hook: BeforeContextCompress"]
SessionSave["Session 保存<br/>中间: AddFullMessage (只写内存)<br/>结束: AddMessage + Save (落盘)<br/>中断: 回滚内存中的中间状态"]
Summarize["摘要触发 (Turn 结束后异步)"]
end
end
API1["InjectSteering() ← 公开 API"]
API2["InjectFollowUp() ← 公开 API"]
API3["InterruptGraceful() ← 公开 API"]
API4["InterruptHard() ← 公开 API"]
API5["SubscribeEvents() ← 公开 API"]
API6["RegisterHook() ← 公开 API"]
API7["GetActiveTurn() ← 公开 API"]
end
Run --> ProcessMessage
ProcessMessage --> RunTurn
Context --> LLMLoop
DrainSteering --> BeforeLLM --> CallLLM --> AfterLLM --> ParseTools --> ToolExec
BeforeTool --> Approval --> ExecTool --> AfterTool
ToolExec --> CheckInterrupt
CheckInterrupt -->|"继续迭代"| DrainSteering
LLMLoop --> Compress
LLMLoop --> SessionSave
SessionSave --> Summarize
核心概念
Turn 与 Iteration
| 术语 | 定义 |
|---|---|
| Turn | 一次完整的「用户输入 → LLM 迭代 → 最终回复」处理过程 |
| Iteration | Turn 内的一次 LLM 调用(可能产生工具调用,触发下一次 Iteration) |
| Steering | 在 Turn 进行中注入的消息,影响当前 Turn 的下一次 Iteration |
| FollowUp | 在 Turn 进行中排队的消息,当前 Turn 结束后作为新的 Inbound 处理 |
| Graceful Interrupt | 跳过剩余工具,可注入最后一条 steering 提示,让 LLM 生成总结后结束 Turn |
| Hard Abort | 立即取消 Provider 调用和所有工具执行,Turn 以错误结束 |
| SubTurn | 由父 Turn 内部 spawn 的子 Turn,走完整 runTurn 路径,拥有独立 turnID |
| Ephemeral Session | SubTurn 使用的纯内存 session,不落盘,SubTurn 结束后清理 |
EventBus 是什么?
EventBus 是一个多订阅者广播系统,agent loop 通过它在每个关键执行阶段发射结构化事件。
可以理解为:「让任何人都能实时观察 loop 在做什么,而不影响其行为。」
核心设计要点:
- 非阻塞 fan-out:Emit 遍历订阅者并以非阻塞方式发送;channel 满时丢弃事件(永远不阻塞 loop)
- Per-EventKind 丢弃计数:
[EventKindCount]atomic.Int64数组,可精确诊断哪类事件正在丢失 - 无独立 goroutine:Emit 在 loop 自身的 goroutine 中运行(减少 goroutine 数量,适合 $10 开发板、<10MB 内存的场景)
- 每个订阅者缓冲容量 16:考虑到 LLM 调用延迟主导迭代时间,此容量足够
事件类型覆盖完整的 Turn 生命周期:
| 分类 | 事件 |
|---|---|
| Turn 生命周期 | TurnStart, TurnEnd |
| LLM 调用 | LLMRequest, LLMDelta, LLMResponse, LLMRetry |
| 上下文管理 | ContextCompress, SessionSummarize |
| 工具执行 | ToolExecStart, ToolExecEnd, ToolExecSkipped |
| 外部交互 | SteeringInjected, FollowUpQueued, InterruptReceived |
| 子 Agent 生命周期 | SubTurnSpawn, SubTurnEnd, SubTurnResultDelivered |
| 错误 | Error |
Hook 系统是什么?
Hook 系统允许外部代码在明确定义的执行节点上观察和拦截 loop 的执行。Hook 是同步的、按优先级排序的、且有超时保护。
共五个 hook 接口,均为可选——hook 只需实现它需要的接口:
| 接口 | 用途 | 可修改? |
|---|---|---|
EventObserver |
被动接收事件 | 否 |
LLMInterceptor |
在 LLM 调用前后拦截 | 是——可修改消息、模型、工具定义;可中止 Turn |
ToolInterceptor |
在工具执行前后拦截 | 是——可修改参数、拒绝工具、中止 Turn |
ToolApprover |
暂停等待外部审批(如 UI 确认) | 否——审批或拒绝 |
ContextCompressInterceptor |
在上下文压缩前拦截 | 否——可中止 Turn |
Hook 动作:Continue、Modify、AbortTurn、HardAbort、DenyTool。
超时分级:
- 普通 Hook(Observer、LLMInterceptor、ToolInterceptor):默认 5s,超时 →
Continue - 审批 Hook(ToolApprover):默认 60s(可配置),超时 →
deny(安全优先)
审批在独立 goroutine 中运行,使用 channel+timeout 模式,因此即使 hook 实现者行为不当也无法永久阻塞 loop。
中断与消息注入机制是什么?
中断
两种中断模式,作为 AgentLoop 的公开 API 暴露:
| 优雅中断(Graceful Interrupt) | 硬中断(Hard Abort) | |
|---|---|---|
| 行为 | 设置 interrupted=true,跳过剩余工具,注入 steering 提示,允许再做一轮 LLM 迭代生成总结 |
设置 aborted=true,立即取消 provider context 和 turn context |
| Session | 保留已完成的工具结果,正常保存 | 回滚到快照点(仅内存回滚,不写盘) |
| 子 Turn | 不级联——让子 Turn 自然完成 | 通过 cascadeAbort 递归级联到所有子 Turn |
Steering 与 Follow-up
| Steering | Follow-up | |
|---|---|---|
| 时机 | 影响当前 Turn 的下一次 Iteration | 排队等待;在当前 Turn 结束后作为新的 Inbound 处理 |
| 消费方式 | 在每次迭代循环顶部 drain | 作为 runTurn 的返回值,由调用者在所有清理完成后投递 |
| 边界情况 | 如果 LLM 返回最终内容(无工具调用)但 steering 队列非空:若迭代余量充足则继续迭代,若已达 MaxIterations 则降级为 follow-up | 通过 bus.PublishInbound 在 runTurn 完全返回后投递——避免与 defer 清理产生竞态 |
工具执行策略
当前基于 WaitGroup 的并行执行替换为智能分组:
工具调用: [read_file, list_dir, write_file, web_search, exec]
分组:
→ {read_file, list_dir} parallel=true (均为只读)
→ {write_file} parallel=false (有副作用)
→ {web_search} parallel=true (只读)
→ {exec} parallel=false (有副作用)
- 只读工具(通过可选的
ReadOnlyIndicator接口声明):连续的只读工具并行执行 - 有副作用的工具:顺序执行,每个工具之间有 steering/中断检查点
- 未声明的工具默认为有副作用(保守策略——只会降低并行度,不会导致正确性问题)
IsReadOnly()是静态的(无参数)——有意不支持依赖参数的只读判断,以保持简单和安全
多 Agent 协作(子 Turn)
当前的 SubagentManager + RunToolLoop 架构存在关键缺陷:
RunToolLoop完全脱离 EventBus/Hook/中断体系- 异步子 agent 结果路由到
agent:main:mainsession,丢失原始对话上下文 - 无父子 Turn 关系——无法级联中断
- 子 agent 不可被外部中断
设计:通过 runTurn 实现子 Turn
子 agent 重构为使用与顶层 Turn 相同的 runTurn 路径:
spawnSubTurn(parentTS, config)— 同步和异步子 Turn 的统一入口- Turn 树:通过
turnState中的parentTurnID/childTurnIDs追踪父子关系 - Ephemeral Session:子 Turn 使用纯内存 session(无磁盘 I/O,无需清理)
- 资源控制:深度限制(默认 3),per-agent 并发信号量(buffered channel,非阻塞获取)
- 结果投递:同步 → 直接返回 ToolResult;异步 →
deliverSubTurnResult投递到父 Turn 的pendingResults队列(在下次 Iteration 前 drain)或写入 session 历史(如果父 Turn 已结束)
与现有代码的关系
| 当前 | 重构后 |
|---|---|
runAgentLoop() |
重命名为 runTurn(),使用 turnState + 事件发射重构 |
runLLMIteration() |
合并到 runTurn() 的迭代循环中 |
工具执行(runLLMIteration 中的 WaitGroup) |
提取到 tool_exec.go,使用智能分组 |
SubagentTool → RunToolLoop |
SubagentTool → spawnSubTurn(Sync) → runTurn |
SpawnTool → RunToolLoop(异步) |
SpawnTool → spawnSubTurn(Async) → goroutine 中 runTurn |
| 无可观测性 | EventBus,17 种事件类型 |
| 无 Hook | HookManager,5 种 Hook 接口 |
| 无中断 | InterruptGraceful / InterruptHard,含 session 回滚 |
| 无消息注入 | InjectSteering / InjectFollowUp,基于队列的投递 |
新增文件
| 文件 | 职责 |
|---|---|
pkg/agent/events.go |
事件类型、EventKind 枚举、Payload 结构体 |
pkg/agent/eventbus.go |
EventBus:Subscribe/Emit/Close |
pkg/agent/hooks.go |
Hook 接口族、HookManager |
pkg/agent/turn_state.go |
turnState、steering/followup 队列、中断标记 |
pkg/agent/tool_exec.go |
带 Hook 的智能工具执行 |
pkg/agent/subturn.go |
spawnSubTurn、deliverSubTurnResult |
修改文件
| 文件 | 范围 |
|---|---|
pkg/agent/loop.go |
大——新增字段,runTurn 重写,公开 API |
pkg/tools/base.go |
极小——新增 ReadOnlyIndicator 接口 |
pkg/tools/*.go |
小——为已知只读工具添加 IsReadOnly() |
pkg/agent/instance.go |
小——新增子 Turn 配置字段 |
pkg/tools/subagent.go |
大——重写为调用 spawnSubTurn |
pkg/tools/spawn.go |
大——重写为调用 spawnSubTurn |
并发安全
核心不变量:
- 锁顺序:
turnState.mu→eventBus.mu (RLock)→sessionManager.mu——永远不逆序 - 锁外发射事件:EventBus.Emit 在释放
turnState.mu后调用,最小化持锁时间 - providerCancel 双向检查:Loop 设置
providerCancel后检查aborted;InterruptHard设置aborted后调用providerCancel——在ts.mu保护下互为冗余 - FollowUp 作为返回值:
runTurn返回 follow-up 而非在 defer 中投递——消除清理与新 Turn 启动之间的竞态 - 审批 goroutine 隔离:审批在独立 goroutine 中运行并带超时——即使 Hook 行为不当也无法阻塞 loop
- Session 快照修正:
forceCompression()后立即更新ts.sessionSnapshot,防止回滚目标过时
开放问题
- EventBus 是否应支持过滤订阅(仅订阅特定 EventKind),还是应由订阅者自行过滤?
- 是否需要内置一个「debug hook」将所有事件写入文件,通过配置开关启用?
- 如果异步子 Turn 的结果在父 Turn 结束后到达,是否默认通知用户,还是需要用户 opt-in?
ReadOnlyIndicator是否应扩展到 MCP 工具(远程工具通过 MCP 元数据声明只读)?
What is the problem?
The existing runAgentLoop → runLLMIteration pipeline has four architectural limitations:
- No observability — The loop emits no events. External consumers (UI, logging, automation) cannot know what phase the loop is in, which tool is executing, or why a turn ended.
- No hook points — There is no way for external code to intercept or modify LLM requests, approve/deny tool executions, or alter loop behavior at runtime.
- No interrupt mechanism — Once a turn starts, it cannot be stopped. There is no graceful interrupt (skip remaining tools, let LLM summarize) or hard abort (cancel everything immediately).
- No message injection — External code cannot inject messages into an ongoing turn (steering) or queue messages for after the turn ends (follow-up).
Additionally, tool execution is fully parallel via WaitGroup with no checkpoints between tools, making it impossible to check for interrupts or new steering messages between tool executions.
Goals & Non-goals
Goals
- Event transparency: Every critical phase inside the loop emits events to external consumers
- Hook control: External code can intercept and modify behavior at key phases
- Interruptible: Support both Graceful Interrupt and Hard Abort
- Appendable: Support Steering (affects current Turn) and FollowUp (queued for later)
- Smart tool parallelism: Read-only tools run in parallel, mutating tools run sequentially
Non-goals
- Do not change the MessageBus inbound/outbound model
- Do not change the Provider interface (
LLMProvider.Chat) - Do not change the core Tool interface signatures (
Execute/ExecuteAsync) - Do not change the
ContextBuilder.BuildMessagesinput/output contract - Do not change the session storage format (JSON)
- Do not introduce external dependencies
Boundary definitions
What is inside the Loop vs. outside
graph TD
subgraph AgentLoop["AgentLoop (outer shell)"]
Run["Run()<br/>Consume MessageBus, send final reply<br/>This layer unchanged, only adds EventBus init"]
subgraph ProcessMessage["processMessage() — preprocessing layer"]
Transcribe["transcribeAudio<br/><i>stays outside</i>"]
Route["resolveMessageRoute<br/><i>stays outside</i>"]
Command["handleCommand<br/><i>stays outside</i>"]
end
subgraph RunTurn["runTurn() — refactored core (formerly runAgentLoop)"]
subgraph TurnBoundary["Turn boundary — event + hook coverage"]
Context["Initial context assembly<br/>ContextBuilder.BuildMessages() remains a pure function<br/>resolveMediaRefs() called here<br/>Output: []providers.Message"]
subgraph LLMLoop["LLM iteration loop"]
DrainSteering["drain steering queue"]
BeforeLLM["Hook: BeforeLLMRequest"]
CallLLM["call Provider.Chat()"]
AfterLLM["Hook: AfterLLMResponse"]
ParseTools["parse tool calls"]
subgraph ToolExec["Tool execution (smart parallel)"]
BeforeTool["Hook: BeforeToolExecute"]
Approval["Hook: RequestApproval"]
ExecTool["Execute tool"]
AfterTool["Hook: AfterToolExecute"]
end
CheckInterrupt["check steering / interrupt"]
end
Compress["Context compression (retry branch)<br/>forceCompression()<br/>Hook: BeforeContextCompress"]
SessionSave["Session save<br/>Intermediate: AddFullMessage (memory only)<br/>End: AddMessage + Save (disk)<br/>Abort: rollback in-memory state"]
Summarize["Summary trigger (async after Turn ends)"]
end
end
API1["InjectSteering() ← public API"]
API2["InjectFollowUp() ← public API"]
API3["InterruptGraceful() ← public API"]
API4["InterruptHard() ← public API"]
API5["SubscribeEvents() ← public API"]
API6["RegisterHook() ← public API"]
API7["GetActiveTurn() ← public API"]
end
Run --> ProcessMessage
ProcessMessage --> RunTurn
Context --> LLMLoop
DrainSteering --> BeforeLLM --> CallLLM --> AfterLLM --> ParseTools --> ToolExec
BeforeTool --> Approval --> ExecTool --> AfterTool
ToolExec --> CheckInterrupt
CheckInterrupt -->|"continue iteration"| DrainSteering
LLMLoop --> Compress
LLMLoop --> SessionSave
SessionSave --> Summarize
Core concepts
Turn & Iteration
| Term | Definition |
|---|---|
| Turn | One complete "user input → LLM iterations → final reply" processing cycle |
| Iteration | One LLM call within a Turn (may produce tool calls, triggering the next Iteration) |
| Steering | A message injected during a Turn that affects the next Iteration |
| FollowUp | A message queued during a Turn, processed as a new inbound after the Turn ends |
| Graceful Interrupt | Skip remaining tools, optionally inject a steering hint, let LLM generate a summary, then end the Turn |
| Hard Abort | Immediately cancel the Provider call and all tool executions; Turn ends with an error |
| SubTurn | A child Turn spawned inside a parent Turn, running the full runTurn path with its own turnID |
| Ephemeral Session | An in-memory-only session used by SubTurns — never persisted to disk |
What is the EventBus?
The EventBus is a multi-subscriber broadcast system that the agent loop uses to emit structured events at every critical phase of execution.
You can think of it as: "a way for anyone to watch what the loop is doing, in real time, without affecting its behavior."
Key design points:
- Non-blocking fan-out: Emit iterates subscribers with non-blocking sends; full channels drop events (never block the loop)
- Per-EventKind drop counters:
[EventKindCount]atomic.Int64array enables precise diagnostics of which event types are being lost - No dedicated goroutine: Emit runs in the loop's own goroutine (fewer goroutines, suitable for $10 boards with <10MB RAM)
- Buffer capacity 16 per subscriber: Sufficient given LLM call latency dominates iteration timing
Event types cover the full turn lifecycle:
| Category | Events |
|---|---|
| Turn lifecycle | TurnStart, TurnEnd |
| LLM calls | LLMRequest, LLMDelta, LLMResponse, LLMRetry |
| Context management | ContextCompress, SessionSummarize |
| Tool execution | ToolExecStart, ToolExecEnd, ToolExecSkipped |
| External interaction | SteeringInjected, FollowUpQueued, InterruptReceived |
| Sub-agent lifecycle | SubTurnSpawn, SubTurnEnd, SubTurnResultDelivered |
| Errors | Error |
What is the Hook system?
The Hook system allows external code to observe and intercept the loop's execution at well-defined points. Hooks are synchronous, priority-ordered, and timeout-protected.
There are five hook interfaces, each optional — a hook implements only the ones it needs:
| Interface | Purpose | Can modify? |
|---|---|---|
EventObserver |
Passively receive events | No |
LLMInterceptor |
Intercept before/after LLM calls | Yes — can modify messages, model, tools; can abort turn |
ToolInterceptor |
Intercept before/after tool execution | Yes — can modify args, deny tool, abort turn |
ToolApprover |
Pause for external approval (e.g., UI confirmation) | No — approve or deny |
ContextCompressInterceptor |
Intercept before context compression | No — can abort turn |
Hook actions: Continue, Modify, AbortTurn, HardAbort, DenyTool.
Timeout tiers:
- Normal hooks (Observer, LLMInterceptor, ToolInterceptor): 5s default, timeout →
Continue - Approval hooks (ToolApprover): 60s default (configurable), timeout →
deny(safety-first)
Approval runs in a separate goroutine with a channel+timeout pattern, so a misbehaving hook cannot block the loop indefinitely.
What is the Interrupt & Steering mechanism?
Interrupts
Two interrupt modes, exposed as public APIs on AgentLoop:
| Graceful Interrupt | Hard Abort | |
|---|---|---|
| Behavior | Sets interrupted=true, skips remaining tools, injects steering hint, allows one more LLM iteration for summary |
Sets aborted=true, cancels provider context + turn context immediately |
| Session | Keeps completed tool results, saves normally | Rolls back to snapshot (memory only, no disk write) |
| Sub-turns | Does NOT cascade — lets children finish naturally | Cascades recursively to all children via cascadeAbort |
Steering & Follow-up
| Steering | Follow-up | |
|---|---|---|
| Timing | Affects the current turn's next iteration | Queued; processed as new inbound after current turn ends |
| Consumption | Drained at the top of each iteration loop | Returned from runTurn, published by caller after all cleanup |
| Edge case | If LLM returns final content (no tools) but steering is pending: continues iteration if budget remains, degrades to follow-up at MaxIterations | Published via bus.PublishInbound after runTurn fully returns — avoids race with defer cleanup |
Tool execution strategy
The current WaitGroup-based parallel execution is replaced with smart grouping:
Tool calls: [read_file, list_dir, write_file, web_search, exec]
Groups:
→ {read_file, list_dir} parallel=true (both ReadOnly)
→ {write_file} parallel=false (mutating)
→ {web_search} parallel=true (ReadOnly)
→ {exec} parallel=false (mutating)
- ReadOnly tools (declared via optional
ReadOnlyIndicatorinterface): consecutive read-only tools run in parallel - Mutating tools: run sequentially with steering/interrupt checkpoints between each
- Undeclared tools default to mutating (conservative — only reduces parallelism, never causes correctness issues)
IsReadOnly()is static (no args parameter) — args-dependent read-only is intentionally not supported for simplicity and safety
Multi-agent coordination (Sub-turns)
The current SubagentManager + RunToolLoop architecture has critical flaws:
RunToolLoopis completely outside the EventBus/Hook/interrupt system- Async sub-agent results route to
agent:main:mainsession, losing the original conversation context - No parent-child Turn relationship — no cascading interrupts
- Sub-agents cannot be externally interrupted
Design: Sub-turns via runTurn
Sub-agents are refactored to use the same runTurn path as top-level turns:
spawnSubTurn(parentTS, config)— single entry point for both sync and async sub-turns- Turn tree: parent/child relationship tracked via
parentTurnID/childTurnIDsinturnState - Ephemeral sessions: sub-turns use in-memory-only sessions (no disk I/O, no cleanup needed)
- Resource control: depth limit (default 3), per-agent concurrency semaphore (buffered channel, non-blocking acquire)
- Result delivery: sync → direct ToolResult; async →
deliverSubTurnResultto parent'spendingResultsqueue (drained at next iteration) or session history (if parent finished)
Relationship to existing code
| Current | After refactor |
|---|---|
runAgentLoop() |
Renamed to runTurn(), restructured with turnState + event emission |
runLLMIteration() |
Merged into runTurn() iteration loop |
Tool execution (WaitGroup in runLLMIteration) |
Extracted to tool_exec.go with smart grouping |
SubagentTool → RunToolLoop |
SubagentTool → spawnSubTurn(Sync) → runTurn |
SpawnTool → RunToolLoop (async) |
SpawnTool → spawnSubTurn(Async) → runTurn in goroutine |
| No observability | EventBus with 17 event types |
| No hooks | HookManager with 5 hook interfaces |
| No interrupts | InterruptGraceful / InterruptHard with session rollback |
| No steering | InjectSteering / InjectFollowUp with queue-based delivery |
New files
| File | Responsibility |
|---|---|
pkg/agent/events.go |
Event types, EventKind enum, payload structs |
pkg/agent/eventbus.go |
EventBus: Subscribe/Emit/Close |
pkg/agent/hooks.go |
Hook interfaces, HookManager |
pkg/agent/turn_state.go |
turnState, steering/followup queues, interrupt flags |
pkg/agent/tool_exec.go |
Smart tool execution with hooks |
pkg/agent/subturn.go |
spawnSubTurn, deliverSubTurnResult |
Modified files
| File | Scope |
|---|---|
pkg/agent/loop.go |
Major — new fields, runTurn rewrite, public APIs |
pkg/tools/base.go |
Minimal — add ReadOnlyIndicator interface |
pkg/tools/*.go |
Small — add IsReadOnly() to known read-only tools |
pkg/agent/instance.go |
Small — add sub-turn config fields |
pkg/tools/subagent.go |
Large — rewrite to call spawnSubTurn |
pkg/tools/spawn.go |
Large — rewrite to call spawnSubTurn |
Concurrency safety
Key invariants:
- Lock ordering:
turnState.mu→eventBus.mu (RLock)→sessionManager.mu— never reversed - Emit outside locks: EventBus.Emit is called after releasing
turnState.muto minimize lock hold time - Double-check on providerCancel: Loop sets
providerCancelthen checksaborted;InterruptHardsetsabortedthen callsproviderCancel— mutual redundancy underts.mu - FollowUp as return value:
runTurnreturns follow-ups instead of publishing in defer — eliminates race between cleanup and new turn startup - Approval goroutine isolation: Approval runs in separate goroutine with timeout — cannot block loop even if hook misbehaves
- Session snapshot correction:
ts.sessionSnapshotis updated afterforceCompression()to prevent stale rollback targets
Open questions
- Should EventBus support filtered subscriptions (subscribe to specific EventKinds only), or should filtering be subscriber-side?
- Should there be a built-in "debug hook" that logs all events to a file, enabled via config flag?
- should async sub-turn results that arrive after parent turn ends trigger a user notification by default, or should this be opt-in?
- Should
ReadOnlyIndicatorbe extended to MCP tools (remote tools declaring read-only via MCP metadata)?