fix(memory): avoid stale tool schema recall by he-yufeng · Pull Request #5058 · QwenLM/qwen-code

he-yufeng · 2026-06-12T17:27:31Z

What this PR does

This makes managed auto-memory more conservative around MCP and other live tool definitions. It stops auto-memory from saving MCP tool names, parameter schemas, field mappings, guessed call formats, or failed tool-call attempts, tightens model-based recall so active tool schemas are not selected while the tool is in use, and applies the same protection to the heuristic recall fallback.

Why it's needed

Issue #4976 describes a long tool-call loop where stale or guessed tool usage could become persistent context and influence later calls. Live tool definitions are a better source of truth than memory for names, parameter shapes, and call formats. Keeping those details out of extraction and recall reduces the chance that old failed attempts get injected near a future tool call.

Reviewer Test Plan

How to verify

Run the focused memory tests and checks below. The new tests cover the extraction prompt guard, model-selector prompt guidance, and heuristic fallback filtering for recently used tools.

Evidence (Before & After)

N/A for UI. Before this change, the heuristic fallback ignored recentTools, so a memory document about ata::article-list-query parameter schemas or failed calls could still be selected when the model selector failed. After this change, those active-tool schema memories are filtered while durable operational context, such as an escalation path, can still be recalled.

Tested on

OS	Status
Windows	tested
macOS	not tested
Linux	not tested

Environment (optional)

Node.js from the local repository environment after npm ci.

Commands run:

npm run test --workspace=@qwen-code/qwen-code-core -- src/memory/prompt.test.ts src/memory/relevanceSelector.test.ts src/memory/recall.test.ts
npx eslint packages/core/src/memory/prompt.ts packages/core/src/memory/relevanceSelector.ts packages/core/src/memory/recall.ts packages/core/src/memory/prompt.test.ts packages/core/src/memory/relevanceSelector.test.ts packages/core/src/memory/recall.test.ts --max-warnings 0
npx prettier --check packages/core/src/memory/prompt.ts packages/core/src/memory/relevanceSelector.ts packages/core/src/memory/recall.ts packages/core/src/memory/prompt.test.ts packages/core/src/memory/relevanceSelector.test.ts packages/core/src/memory/recall.test.ts
npm run typecheck --workspace=@qwen-code/qwen-code-core
git diff --check

Risk & Scope

Main risk or tradeoff: the marker-based fallback filter intentionally ignores active-tool schema/failed-call memories, but it preserves non-schema operational memories such as ownership or escalation notes.
Not validated / out of scope: this does not redesign MCP tool discovery or add schema-first tool loading; it only prevents memory from reinforcing stale tool-call details.
Breaking changes / migration notes: none.

Linked Issues

Fixes #4976

中文说明

这个 PR 做了什么

这个 PR 让 managed auto-memory 在 MCP 和其他实时工具定义相关场景下更保守：不再保存 MCP 工具名、参数 schema、字段映射、猜测出来的调用格式或失败工具调用记录；同时加强模型选择器的提示，避免在工具正在使用时召回该工具的旧 schema 记忆；并给 heuristic fallback 补上同样的过滤。

为什么需要

#4976 描述了一次很长的工具调用绕路。问题之一是旧的、猜测出来的工具用法可能被保存进 memory，并在后续会话中靠近工具调用上下文被注入。工具名、参数形状、调用格式应该以当前 live tool definitions 为准，而不是以旧 memory 为准。这个改动降低了错误调用被跨会话放大的概率。

验证

本地在 Windows 上跑了 memory 相关的 focused tests、ESLint、Prettier check、core typecheck 和 git diff --check。新增测试覆盖提取提示、模型选择器提示，以及模型选择器失败后 heuristic fallback 对活跃工具 schema/失败调用 memory 的过滤。

范围和风险

这个 PR 不重做 MCP 工具发现流程，也不实现 schema-first 工具加载；它只修 memory 侧，避免 memory 强化过时工具调用细节。fallback 过滤是有条件的：必须同时命中当前活跃工具名和 schema/参数/失败调用等标记；owner、escalation path 这类持久运维上下文仍可召回。

wenshao

Thanks — the diagnosis is right, the three-layer approach (extraction prompt / model selector / heuristic fallback) is the correct shape, and the PR itself is clean and well-tested with an honest description. But while tracing the call chain I found a structural gap that changes what this PR actually does at runtime, plus a naming-convention mismatch. Details inline; summary:

Blocking

recentTools is never populated in production. The only production caller of recall is client.ts:1692 (memory prefetch), and it passes only config / excludedFilePaths / abortSignal — on main and on this branch. So options.recentTools ?? [] is always []: the new heuristic filter never fires, and the Recently used tools: ... context line (gated on recentTools.length > 0) never reaches the model selector either. Of the three changes, only the extraction-prompt bullet has live effect; the recall-side hardening — the headline of this PR — is currently dead code, and the tests pass because they inject recentTools explicitly. Either wire it up in client.ts (collect the tool names executed recently in this session), or reposition the PR as "extraction-side fix + recall-side groundwork" and soften Fixes #4976 to Refs #4976.
toolAliases splits on '::', but qwen-code registers MCP tools as mcp__<server>__<tool> (see generateValidName(mcp__${serverName}__${serverToolName}) in mcp-tool.ts). ata::article-list-query is the issue #4976 transcript format, not the runtime format — once wiring lands, the alias derivation won't match real names.

Strongly suggested
3. The old prompt's explicit carve-out — "DO still select memories containing warnings, gotchas, or known issues about those tools — active use is exactly when those matter" — was dropped. Combined with broad heuristic markers (mcp tool, tool usage, api docs), the highest-value memories (behavioral warnings about an active tool) get suppressed exactly when they matter most. Excluding schemas/failed transcripts and keeping behavioral warnings are compatible goals.
4. The extraction bullet's blanket "failed tool-call attempts" would also block saving confirmed workarounds for recurring failures — worth distinguishing raw transcripts/guessed schemas (don't save) from confirmed workarounds/gotchas (do save).

Positives: filtering happens before scoring so excluded docs free up top-5 slots; the model-success path and telemetry (docs_scanned stays unfiltered) are untouched; tests cover each layer; bilingual PR description with a real test plan.

中文版本

感谢这个 PR —— 问题诊断正确，三层防护（提取 prompt / 模型选择器 / heuristic fallback）的思路也对，diff 干净、测试齐全、描述诚实。但在追调用链时发现一个结构性缺口，它改变了这个 PR 在运行时的实际效果，另有一个命名格式不匹配。细节见 inline 评论，摘要如下：

Blocking

recentTools 在生产代码里从未被填充。recall 的唯一生产调用方是 client.ts:1692（memory prefetch），它只传 config / excludedFilePaths / abortSignal —— main 和本分支都是如此。所以 options.recentTools ?? [] 永远是空数组：新的 heuristic 过滤永不触发，模型选择器侧的 Recently used tools: ... 上下文行（条件 recentTools.length > 0）也永远不出现。三个改动里只有提取侧 bullet 有真实效果；召回侧防护（PR 的主体）目前是死代码，测试能过是因为显式注入了 recentTools。建议要么在 client.ts 接线（收集本会话最近执行的工具名），要么把 PR 重新定位为“提取侧修复 + 召回侧预备”，并把 Fixes #4976 改成 Refs #4976。
toolAliases 按 '::' 切分，但 qwen-code 的 MCP 工具注册名是 mcp__<server>__<tool>（见 mcp-tool.ts 的 generateValidName）。ata::article-list-query 是 issue #4976 转录文本的格式，不是运行时格式 —— 一旦接线，别名推导不会命中真实名字。

强烈建议
3. 旧 prompt 里明确的 carve-out —— “DO still select memories containing warnings, gotchas, or known issues about those tools — active use is exactly when those matter” —— 被删掉了。叠加宽泛的 heuristic 标记词（mcp tool、tool usage、api docs），价值最高的记忆（活跃工具的行为型警告）恰好在最需要时被压掉。排除 schema/失败转录与保留行为型警告并不冲突。
4. 提取侧 bullet 一刀切的 “failed tool-call attempts” 会连带阻止保存“反复失败 + 已确认的 workaround” —— 建议区分原始转录/猜测 schema（不存）与确认过的 workaround/gotcha（可存）。

亮点：过滤在打分之前，被排除文档把 top-5 槽位让出来；模型成功路径与遥测（docs_scanned 用未过滤总数）不受影响；三层各有测试；双语描述 + 真实测试计划。

wenshao · 2026-06-12T17:40:16Z

  }

-  const selectedDocs = selectRelevantAutoMemoryDocuments(query, docs, limit);
+  const heuristicDocs = docs.filter(


This filter never fires in production: the only production caller of recall is client.ts:1692 (the memory prefetch in sendMessageStream), and its options carry only config / excludedFilePaths / abortSignal — recentTools is not passed anywhere outside tests (this is true on main too; the parameter has had no producer since it was introduced). So options.recentTools ?? [] is always [] and isActiveToolUsageMemory returns false on its first line for every doc. The same applies to the model-selector side: the Recently used tools: ... context line is gated on recentTools.length > 0, so the strengthened system-prompt instruction is vacuous at runtime.

To actually close #4976's recall side, client.ts needs to collect the tool names executed recently in the session and pass them here. If that's intended as a follow-up, consider repositioning this PR (Fixes → Refs).

中文版本

这个过滤在生产中永远不会触发：recall 的唯一生产调用方是 client.ts:1692（sendMessageStream 里的 memory prefetch），它的 options 只有 config / excludedFilePaths / abortSignal —— 除测试外没有任何地方传 recentTools（main 上也一样，该参数自引入起就没有 producer）。所以 options.recentTools ?? [] 永远是空数组，isActiveToolUsageMemory 第一行就对所有文档返回 false。模型选择器侧同理：Recently used tools: ... 行的条件是 recentTools.length > 0，强化后的 system prompt 指令在运行时是空话。

要真正闭环 #4976 的召回侧，需要在 client.ts 收集本会话最近执行的工具名并传进来。如果打算放到后续 PR，建议把本 PR 重新定位（Fixes → Refs）。

wenshao · 2026-06-12T17:40:16Z

+function toolAliases(toolName: string): string[] {
+  return Array.from(
+    new Set(
+      [toolName, toolName.split('::').at(-1) ?? '']


The '::' separator comes from the issue #4976 transcript, not from this codebase: qwen-code registers MCP tools as mcp__<server>__<tool> (generateValidName(\mcp__${serverName}__${serverToolName}`)inmcp-tool.ts). Once recentToolsis wired with real registry names likemcp__ata__article-list-query, this split is a no-op, the alias list contains only the full name, and a memory that mentions just article-list-query` won't match — the filter misses its own target scenario.

Suggest splitting on both '::' and '__', dropping the mcp prefix segment, and adding a test that uses the real naming format (that test would also have surfaced the wiring gap).

Micro-nit: .at(-1) ?? '' — split() never returns an empty array, so the ?? '' branch is dead (harmless).

中文版本

'::' 分隔符来自 issue #4976 的转录文本，不是代码库约定：qwen-code 的 MCP 工具注册名是 mcp__<server>__<tool>（mcp-tool.ts 的 generateValidName）。一旦 recentTools 接线传入真实注册名（如 mcp__ata__article-list-query），这个 split 等于没切，别名列表只剩完整名字，而记忆文档里通常只写 article-list-query —— 子串匹配不中，过滤会错过它自己要防的场景。

建议同时按 '::' 和 '__' 切分、剥掉 mcp 前缀段，并补一条使用真实命名格式的测试（这条测试也能暴露上面的接线缺口）。

顺带：.at(-1) ?? '' 里 split() 不会返回空数组，?? '' 是死分支（无害）。

wenshao · 2026-06-12T17:40:16Z

 const MAX_DOC_BODY_CHARS = 1_200;
 const debugLogger = createDebugLogger('AUTO_MEMORY_RECALL');

+const ACTIVE_TOOL_USAGE_MEMORY_MARKERS = [


Two notes on the marker list:

Singular/plural pairs are redundant under substring matching — parameter schema already matches parameter schemas, same for field mapping / tool schema. The list can shrink by ~a third.

The broad markers (mcp tool, tool usage, api docs) make this filter catch warnings-class memories too: a feedback note like "when using mcp tool X, always pass locale=zh or it silently returns empty" names the active tool + hits mcp tool → filtered, exactly when it's most needed. Consider skipping type: 'feedback' docs here, or narrowing the markers to schema-ish terms only (schema / mapping / guessed / failed-call), to stay consistent with the warnings carve-out discussion on the selector prompt.

中文版本

关于标记词列表的两点：

子串匹配下单复数成对是冗余的 —— parameter schema 已覆盖 parameter schemas，field mapping / tool schema 同理，列表可以缩掉约三分之一。

宽泛标记（mcp tool、tool usage、api docs）会让过滤连警告型记忆一起命中：比如 feedback 记忆 “使用 mcp tool X 时必须传 locale=zh，否则静默返回空” —— 提到了活跃工具名 + 命中 mcp tool → 被滤掉，而这恰好是最需要它的时刻。建议这里跳过 type: 'feedback' 文档，或把标记收窄到纯 schema 类词（schema / mapping / guessed / failed-call），与选择器 prompt 上关于 warnings carve-out 的讨论保持一致。

wenshao · 2026-06-12T17:40:16Z

 - If you are unsure if a memory will be useful in processing the user's query, then do not include it in your list. Be selective and discerning.
 - If there are no memories in the list that would clearly be useful, feel free to return an empty list.
- If a list of recently-used tools is provided, do not select memories that are usage reference or API documentation for those tools (the assistant is already exercising them). DO still select memories containing warnings, gotchas, or known issues about those tools — active use is exactly when those matter.`;
+- If a list of recently-used tools is provided, do not select memories that are usage reference, API documentation, parameter schemas, field mappings, guessed call formats, or failed-call transcripts for those tools. Live tool definitions are the source of truth. Only select a tool-related memory if it contains durable operational context that cannot be obtained from the live schema, such as credentials location, ownership, or an external escalation path.`;


The old wording carried an explicit carve-out: "DO still select memories containing warnings, gotchas, or known issues about those tools — active use is exactly when those matter." The new whitelist (credentials location / ownership / escalation path) doesn't obviously cover it, and the preceding exclusion list could plausibly classify a parameter gotcha ("date param must be UTC or it silently returns empty") as schema-ish — suppressing the highest-value memory class right when it matters most.

The two goals are compatible: exclude schemas / guessed formats / failed-call transcripts, AND keep behavioral warnings. Suggest re-adding the carve-out sentence alongside the new exclusions.

(Also note: this instruction is currently unreachable at runtime — the Recently used tools: line never appears because no production caller passes recentTools; see the recall.ts comment.)

中文版本

旧文案有一条明确的 carve-out：“DO still select memories containing warnings, gotchas, or known issues about those tools — active use is exactly when those matter.” 新白名单（credentials / ownership / escalation path）并不能明显覆盖它，而前面的排除列表很容易把参数类 gotcha（“date 参数必须 UTC，否则静默返回空”）归类为 schema 类 —— 在最需要的时刻压掉价值最高的一类记忆。

两个目标不冲突：排除 schema / 猜测格式 / 失败转录，同时保留行为型警告。建议在新排除项旁边把 carve-out 那句加回来。

（另注：这条指令目前在运行时不可达 —— Recently used tools: 行永远不出现，因为没有生产调用方传 recentTools，见 recall.ts 的评论。）

wenshao · 2026-06-12T17:40:16Z

  '- Code patterns, conventions, architecture, file paths, or project structure — these can be derived by reading the current project state.',
  '- Git history, recent changes, or who-changed-what — `git log` / `git blame` are authoritative.',
  '- Debugging solutions or fix recipes — the fix is in the code; the commit message has the context.',
+  '- MCP tool names, parameter schemas, field mappings, guessed tool-call formats, or failed tool-call attempts — live tool definitions are authoritative and may change.',


The blanket "failed tool-call attempts" is a bit over-broad: a recurring failure with its confirmed workaround is exactly the kind of durable feedback memory worth saving ("tool X 500s when page>10 — paginate with cursor instead"). Raw failure transcripts and guessed schemas, agreed, should never be saved. Suggest wording that distinguishes the two, e.g. "...or raw failed-call transcripts — live tool definitions are authoritative. (Confirmed workarounds or gotchas about a tool's behavior are still worth saving.)"

中文版本

一刀切的 “failed tool-call attempts” 偏宽：反复失败 + 已确认的 workaround 恰恰是值得保存的持久 feedback 记忆（“工具 X 在 page>10 时 500 —— 改用 cursor 分页”）。原始失败转录、猜测的 schema 确实永远不该存。建议措辞区分两者，例如：“...or raw failed-call transcripts — live tool definitions are authoritative. (Confirmed workarounds or gotchas about a tool's behavior are still worth saving.)”

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-06-12T19:06:08Z

Thanks for the careful trace. I pushed an update that addresses the runtime gap rather than just changing the tests:

GeminiClient.recordCompletedToolCall() now keeps a bounded recent-tool list, and the managed memory prefetch passes it into recall().
MCP tool aliasing now understands the runtime mcp__<server>__<tool> shape while keeping the legacy server::tool suffix behavior for older transcripts.
The selector and heuristic fallback still suppress live schema / failed-call transcript memories for active tools, but preserve durable gotchas, warnings, confirmed workarounds, ownership, credentials, and escalation notes.
The extraction prompt now distinguishes raw failed tool-call transcripts from confirmed durable workarounds/gotchas.

Validation:

npm run test --workspace=packages/core -- memory/recall.test.ts memory/relevanceSelector.test.ts memory/prompt.test.ts core/client.test.ts
npm run typecheck --workspace=packages/core
npx eslint packages/core/src/core/client.ts packages/core/src/core/client.test.ts packages/core/src/memory/recall.ts packages/core/src/memory/recall.test.ts packages/core/src/memory/relevanceSelector.ts packages/core/src/memory/relevanceSelector.test.ts packages/core/src/memory/prompt.ts packages/core/src/memory/prompt.test.ts --max-warnings 0
git diff --check

wenshao · 2026-06-12T20:07:35Z

  private lastSentIdeContext: IdeContext | undefined;
  private forceFullIdeContext = true;
+  private recentCompletedToolNames: string[] = [];
  private pendingMemoryPrefetch: MemoryPrefetchHandle | undefined;


[Suggestion] resetChat() (line ~628) does not clear recentCompletedToolNames. Every other piece of per-conversation state is explicitly reset there (surfacedRelevantAutoMemoryPaths, cachedGitStatus, lastApiCompletionTimestamp, file read cache, deferred tools, etc.), but this new field was omitted.

After a /clear, stale tool names from the previous conversation persist and continue to drive the isActiveToolUsageMemory filter — ephemeral tool-schema memories for tools used in the old conversation will be incorrectly suppressed in the new one.

Suggested change

private pendingMemoryPrefetch: MemoryPrefetchHandle | undefined;

private recentCompletedToolNames: string[] = [];

// NOTE: also reset in resetChat()

— qwen3.7-max via Qwen Code /review

wenshao · 2026-06-12T20:07:35Z

@@ -286,7 +288,10 @@ export class GeminiClient {
    // Check if we're resuming from a previous session
    const resumedSessionData = this.config.getResumedSessionData();


[Suggestion] When a session is resumed (resumedSessionData is truthy), the code replays UI telemetry and rebuilds API history but never extracts tool-call names to seed recentCompletedToolNames. The field stays at [] until the first new tool call completes, so the active-tool filter is inert during the resumed session's first query — which is the query most likely to benefit from memory recall.

Consider walking resumedSessionData.conversation.messages after buildApiHistoryFromConversation to extract the last N assistant functionCall.name values and assign them to this.recentCompletedToolNames.

— qwen3.7-max via Qwen Code /review

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-06-12T20:38:27Z

Thanks for the follow-up review. I pushed one more update for both points:

resetChat() now clears recentCompletedToolNames, so a new chat does not carry stale tool context.
Resume now seeds the recent-tool list from the loaded history before startChat(...) repairs dangling calls. It only treats calls as completed when their functionResponse.id is present; id-less calls are still kept as a legacy/provider fallback.

Validation:

npm run test --workspace=packages/core -- memory/recall.test.ts memory/relevanceSelector.test.ts memory/prompt.test.ts core/client.test.ts
npm run typecheck --workspace=packages/core
npx eslint packages/core/src/core/client.ts packages/core/src/core/client.test.ts packages/core/src/memory/recall.ts packages/core/src/memory/recall.test.ts packages/core/src/memory/relevanceSelector.ts packages/core/src/memory/relevanceSelector.test.ts packages/core/src/memory/prompt.ts packages/core/src/memory/prompt.test.ts --max-warnings 0
git diff --check

qqqys

Critical follow-up rechecked: the production recent-tool context is now wired into recall, MCP tool aliases handle the runtime mcp__server__tool shape, resetChat clears stale tool context, and resumed sessions seed completed tool names from matched call/response history. I did not find any remaining critical issue in this pass.

wenshao · 2026-06-12T22:31:53Z

Local runtime verification report (Linux, real build + real MCP server)

I built this branch (009b32de, merge-base a283ca04) and drove the real CLI in tmux to verify the runtime behavior this PR now claims, plus an A/B against the merge-base build. Since the PR was only tested on Windows with unit tests, this covers the Linux/live-session gap.

Rig. A local stdio MCP server registered as ata exposing article-list-query (runtime registry name observed: mcp__ata__article-list-query). Two auto-memory docs planted on disk with unguessable sentinel tokens in their bodies:

ata-article-query-schema.md — described as "Parameter schema and usage reference for the ata MCP tool article-list-query…", body sentinel SCHEMA-TOKEN-4417 (should be suppressed while the tool is in active use)
ata-escalation.md — "Escalation path and owner…", body sentinel ESCALATION-TOKEN-7731 (durable; should survive)

The relevance selector only ever sees doc headers, so a sentinel can reach the main model only via actual recall injection — "model echoes the token" is injection proof (probes forbade file reads; transcripts were checked for stray reads, and one contaminated probe was discarded). For per-turn ground truth I added log-only instrumentation to the compiled client.js / recall.js / relevanceSelector.js of both builds (dumping recentTools at prefetch, the selector's tools section and selection, and the heuristic filter decision). Docs were planted only after the first MCP-call turn of each session so excludedFilePaths could not mask results. Main and side model: glm-4.7.

Verified working (PR build, vs merge-base control)

1. The production wiring gap is genuinely fixed. After a real MCP call in turn 1, the next turn's prefetch carries runtime tool names into the selector prompt:

[CLIENT] [E2E] prefetch recentTools=["tool_search","mcp__ata__article-list-query"]
[AUTO_MEMORY_RECALL] [E2E] selector toolsSection="\n\nRecently used tools: tool_search, mcp__ata__article-list-query" ...

Merge-base control, same scenario: prefetch (base) no recentTools in options + toolsSection="" on every turn — even immediately after the MCP call — and the schema doc was injected (model echoed SCHEMA-TOKEN-4417).

2. The deterministic fallback layer works end-to-end. With the side query deliberately failing fast (workspace fastModel pointed at a dead local provider; main chat unaffected), a real session produced:

Model-driven auto-memory recall failed; falling back to heuristic selection. (Connection error)
[E2E] heuristic recentTools=["tool_search","mcp__ata__article-list-query"]
      kept=["ata-escalation.md"] of=["ata-article-query-schema.md","ata-escalation.md"]

Probe answer: SCHEMA=NONE / ESCALATION=ESCALATION-TOKEN-7731 — schema memory suppressed, durable memory injected, sentinel-verified. Additionally, 5/5 deterministic assertions against the compiled recall.js (no model involved) passed: the runtime shape mcp__ata__article-list-query and the issue-transcript shape ata::article-list-query both filter the schema doc (the alias fix from the earlier review round holds), the durable doc survives, and an unrelated recent tool filters nothing.

3. resetChat() clearing and resume seeding are real.

/clear → next prefetch logs recentTools=[], and previously-surfaced docs become selectable again.
Resume: quit after the MCP turn, relaunch with -c. The resumed process's first prefetch already carries recentTools=["tool_search","mcp__ata__article-list-query"], seeded purely from recorded history (matched call/response ids) before any tool ran in the new process. Merge-base resume control: recall still works but with no recentTools — so the seeding is exactly this PR's delta, not a side effect.

4. Extraction guard suppresses the harmful save class. Same mixed prompt in fresh workspaces ("remember: I guessed the call format {topic: string} for the ata tool and the call failed with a type error; also the on-call owner is the platform team in #ata-ops"):

Merge-base: wrote a feedback memory preserving the guessed format + failure ("don't use {topic: string}" — which is in fact the tool's correct schema; a textbook demo of why these saves mislead later sessions), plus the owner memory.
PR build, 2 runs: the guessed-format/failed-call detail was saved in neither run. Run 2 saved exactly the owner/escalation memory (the intended outcome); run 1 saved nothing (over-suppressed the allowed durable item). Small n and model-dependent, but directionally right.

5. Focused unit tests: all 4 touched test files pass in the worktree (212/212). Core + CLI build clean.

Findings to weigh (none are regressions — the PR beats base in every scenario I ran)

F1. On the happy path the protection is advisory, and glm-4.7 often ignores it. With the tools section present and a doc literally described as "Parameter schema and usage reference for the ", the model selector still selected the schema doc in 3 of 4 instrumented probes (temperature 0; the probe asks how to use the tool, so query-relevance and the exclusion rule conflict — which is representative of the #4976 scenario). The deterministic filter only runs when the side query fails. If we want the guarantee on the healthy path too, isActiveToolUsageMemory() could also post-filter the model's selection in recall.ts (same durable carve-out) — suggest as a follow-up.

F2. Heuristic markers are English-only; Chinese memories pass through. For zh-locale users the extraction agent writes Chinese descriptions/bodies (live demo: merge-base extraction produced ATA MCP 工具调用格式注意事项, a guessed-format/failed-call memory in Chinese). Against the PR's compiled filter with the tool active, that Chinese doc is not filtered (the tool-name alias matches but no ACTIVE_TOOL_USAGE_MEMORY_MARKERS entry matches Chinese text), while an English equivalent is filtered. Given the user base, zh marker equivalents (参数格式 / 调用格式 / 失败调用 / 字段映射 …) or a language-neutral approach would be a worthwhile follow-up.

F3. (Pre-existing, for awareness) A slow-failing side model skips the fallback entirely. With a misconfigured provider that fails slowly (real 502s × 5 retries ≈ 5.5 s), the prefetch is aborted when the first stream completes, the abortSignal?.aborted short-circuit then skips the heuristic, and no memory is recalled that turn (SCHEMA=NONE / ESCALATION=NONE). The fallback layer protects fast failures (connection-refused: ~2.9 s, worked perfectly), not slow ones. Not introduced by this PR — it just bounds how much the new fallback filter can cover.

Verdict

The three fixes from the earlier review rounds (production wiring, runtime alias shape, resetChat/resume lifecycle) are all real and verified live against the merge-base. The deterministic fallback works end-to-end, and the extraction guard measurably stops the harmful save class. F1/F2 bound the protection but never make anything worse than base — fine as follow-up issues rather than blockers. LGTM from the runtime-behavior side.

中文版（完整验证报告）

本地运行时验证报告（Linux，真实构建 + 真实 MCP server）

我在本地构建了该分支（009b32de，merge-base a283ca04），用 tmux 驱动真实 CLI 验证这个 PR 现在声称的运行时行为，并与 merge-base 构建做了 A/B 对比。PR 本身只在 Windows 上跑过单测，这里补上 Linux/真实会话这块。

测试装置。 本地 stdio MCP server 注册为 ata，暴露 article-list-query 工具（运行时注册名实测为 mcp__ata__article-list-query）。在磁盘上植入两个 auto-memory 文档，正文里埋了无法猜出的哨兵 token：

ata-article-query-schema.md — 描述为该工具的 "Parameter schema and usage reference…"，正文哨兵 SCHEMA-TOKEN-4417（工具活跃使用时应被抑制）
ata-escalation.md — "Escalation path and owner…"，正文哨兵 ESCALATION-TOKEN-7731（持久运维信息，应保留）

relevance selector 只能看到文档头部（描述），正文哨兵只有在 recall 真正注入后才会进入主模型上下文——所以"模型回答出 token"即注入证明（probe 禁止读文件；逐轮核对了 transcript，发现一次被模型读文件污染的 probe 已弃用）。为了拿到逐轮的客观依据，我在两个构建的编译产物 client.js / recall.js / relevanceSelector.js 里加了仅日志的插桩（打印 prefetch 的 recentTools、selector 收到的工具列表与选择结果、heuristic 过滤决策）。每个会话都是先完成第一轮 MCP 调用后才植入文档，避免 excludedFilePaths 干扰结果。主模型与 side 模型均为 glm-4.7。

验证通过项（PR 构建，与 merge-base 对照）

1. 生产链路缺口确实修复了。 第 1 轮真实 MCP 调用后，下一轮 prefetch 携带运行时工具名并进入 selector 提示：

[CLIENT] [E2E] prefetch recentTools=["tool_search","mcp__ata__article-list-query"]
[AUTO_MEMORY_RECALL] [E2E] selector toolsSection="\n\nRecently used tools: ..."

merge-base 对照组同场景：每一轮都是 prefetch (base) no recentTools in options + toolsSection=""——即使刚调用完 MCP 工具——schema 文档被注入（模型回出了 SCHEMA-TOKEN-4417）。

2. 确定性 fallback 层端到端可用。 让 side query 快速失败（workspace fastModel 指向本地死端口，主对话不受影响），真实会话产生：

Model-driven auto-memory recall failed; falling back to heuristic selection. (Connection error)
[E2E] heuristic recentTools=[...] kept=["ata-escalation.md"] of=[schema, escalation]

probe 回答：SCHEMA=NONE / ESCALATION=ESCALATION-TOKEN-7731——schema 记忆被抑制、持久记忆被注入，哨兵验证。另外直接针对编译后 recall.js（不经模型）的 5 项确定性断言全部通过：运行时形态 mcp__ata__article-list-query 与 issue transcript 形态 ata::article-list-query 都能过滤 schema 文档（上轮 review 的 alias 修复对真实注册名生效）、持久文档保留、无关工具不触发过滤。

3. resetChat() 清理与 resume 播种均为真实行为。

/clear → 下一轮 prefetch 打出 recentTools=[]，已浮现过的文档重新可选。
Resume：MCP 轮后退出，-c 重启。恢复进程的第一次 prefetch 就携带 recentTools=["tool_search","mcp__ata__article-list-query"]，纯由录制历史（成对的 call/response id）播种，新进程尚未跑过任何工具。merge-base resume 对照：recall 正常但无 recentTools——播种确实是本 PR 的增量，而非别的副作用。

4. 提取守卫能压制有害保存类别。 相同混合 prompt（"记住：我猜了 ata 工具的调用格式 {topic: string} 且首次调用类型报错；另外 ATA 故障值班是 platform team，#ata-ops 频道"）在全新 workspace 各跑：

merge-base：写下了保留猜测格式+失败信息的 feedback 记忆（"不要用 {topic: string}"——而这其实是该工具正确的 schema，正好演示了这类保存如何误导后续会话），外加 owner 记忆。
PR 构建 2 次：猜测格式/失败调用细节两次都没有被保存。第 2 次只保存了 owner/escalation 记忆（理想行为）；第 1 次什么都没保存（把允许的持久项也过度抑制了）。样本小且依赖模型，但方向正确。

5. 焦点单测：4 个被改测试文件在 worktree 中全部通过（212/212）。core + cli 构建干净。

需要权衡的发现（都不是回归——所有场景下 PR 均优于 base）

F1. 正常路径上的保护是建议性的，glm-4.7 经常不遵守。 工具列表在场、文档描述明写 "Parameter schema and usage reference for the <活跃工具>" 的情况下，模型 selector 在 4 次插桩 probe 中仍有 3 次选中了 schema 文档（temperature 0；probe 询问如何使用该工具，"与查询相关"和"排除规则"互相冲突——这正是 #4976 的典型场景）。确定性过滤只在 side query 失败时运行。若想在健康路径上也有确定性保证，可以在 recall.ts 里用 isActiveToolUsageMemory() 对模型选择结果做后置过滤（同样的持久白名单）——建议作为 follow-up。

F2. heuristic 标记只有英文；中文记忆会穿透。 中文环境下提取 agent 写中文描述/正文（实测：merge-base 提取产出了 ATA MCP 工具调用格式注意事项，正是中文版的猜测格式/失败调用记忆）。用 PR 编译产物 + 活跃工具实测：该中文文档不会被过滤（工具名 alias 命中，但 ACTIVE_TOOL_USAGE_MEMORY_MARKERS 没有任何条目匹配中文），英文等价文档会被过滤。考虑用户群体，补充中文标记（参数格式 / 调用格式 / 失败调用 / 字段映射…）或改为语言无关的判定值得做个 follow-up。

F3.（既有行为，仅提醒）side 模型慢失败时 fallback 完全不会运行。 配置一个慢失败的 provider（真实 502 × 5 次重试 ≈ 5.5 秒）时，prefetch 在首个流结束时被 abort，abortSignal?.aborted 短路跳过 heuristic，该轮完全没有记忆召回（SCHEMA=NONE / ESCALATION=NONE）。fallback 层保护的是快失败（connection-refused ~2.9 秒，工作完美），不是慢失败。非本 PR 引入——只是界定了新 fallback 过滤的实际覆盖范围。

结论

前几轮 review 提出的三处修复（生产链路、运行时 alias 形态、resetChat/resume 生命周期）全部真实生效并通过与 merge-base 的实测对比验证；确定性 fallback 端到端可用；提取守卫可测量地阻止了有害保存类别。F1/F2 界定了保护边界但从未劣于 base——适合作为 follow-up issue 而非阻塞项。从运行时行为角度 LGTM。

wenshao · 2026-06-12T22:34:40Z

@qwen-code /triage

qwen-code-ci-bot · 2026-06-12T22:39:24Z

Thanks for the PR, @he-yufeng!

Template looks good ✓ — all required sections present, bilingual, linked to #4976.

On direction: this is a clear win. Issue #4976 shows a textbook case of auto-memory reinforcing a guessed tool-call pattern across sessions, turning a one-time detour into a persistent feedback loop. Making memory more conservative about live tool definitions is exactly the right fix — tool schemas should come from the live registry, not from stale notes. No CHANGELOG reference in other agents, but the area is core to qwen-code's memory system.

On approach: the three-layer defense (extraction guard → model selector prompt → deterministic heuristic fallback) is well-scoped and each layer covers a distinct failure mode. I considered whether a single post-filter on recall output would suffice, but the extraction guard prevents the bad saves in the first place — which the post-filter alone can't do. The scope feels right: it constrains memory behavior without touching MCP discovery or tool loading.

Moving on to code review. 🔍

中文说明

感谢贡献 @he-yufeng！

模板完整 ✓ — 所有必需段落齐全，中英双语，关联了 #4976。

方向：这个修复很明确。#4976 展示了一个典型案例——auto-memory 将猜测的工具调用模式固化下来，跨会话形成反馈循环。让 memory 对 live tool definitions 更保守是正确做法——工具 schema 应该来自实时注册表而非过时笔记。其他 agent 的 CHANGELOG 中没有直接参考，但 memory 系统是 qwen-code 的核心能力。

方案：三层防御（提取守卫 → 模型选择器提示 → 确定性 heuristic fallback）范围合理，每层覆盖不同失败模式。我考虑过是否只做 recall 输出的后置过滤就够了，但提取守卫从源头阻止了有害保存——后置过滤做不到。范围控制得当：约束了 memory 行为而不触及 MCP 发现或工具加载流程。

进入代码审查 🔍

— Qwen Code · qwen3.7-max

qwen-code-ci-bot · 2026-06-12T22:41:03Z

Code Review

The implementation is clean and well-structured. Three layers of defense, each independently useful:

client.ts — recentCompletedToolNames is a bounded (20) move-to-end dedup list. seedRecentCompletedToolNamesFromHistory walks resumed history matching function calls with responses by ID, correctly skipping pending calls. resetChat() clears the list. Defensive copy ([...this.recentCompletedToolNames]) when passing to recall(). All straightforward.

recall.ts — isActiveToolUsageMemory() requires three conditions simultaneously: (1) a recent tool name matches via aliases, (2) schema/failed-call markers are present, (3) no durable markers (gotcha, workaround, owner, escalation) override. This is conservative by design — false positives are unlikely because all three must hold. toolAliases() correctly handles the runtime shape mcp__server__tool, the transcript shape server::tool, and bare tool names.

relevanceSelector.ts — System prompt tightened from "don't select usage reference" to an explicit list (parameter schemas, field mappings, guessed call formats, failed-call transcripts) while preserving the durable carve-out.

prompt.ts — One line added to WHAT_NOT_TO_SAVE_SECTION. Concise and well-worded.

No critical blockers. No AGENTS.md violations. Code follows project conventions (ESM, strict TS, collocated tests).

Known limitations (all from wenshao's runtime verification, none are regressions):

F1: Model selector protection is advisory on the happy path — the deterministic filter only kicks in when the side query fails. Suggest post-filtering model selection as a follow-up.
F2: Heuristic markers are English-only — Chinese memories pass through. Worth adding zh markers or a language-neutral approach.

Tests

Unit tests for prompt.test.ts and relevanceSelector.test.ts: 15/15 passed. recall.test.ts and client.test.ts failed in the worktree due to a missing ajv package (environment issue, not PR code) — wenshao's report confirmed all 212 tests pass in a proper build.

Typecheck errors in the worktree are all in files not touched by this PR (filesearch/ignore.ts, schemaValidator.ts, gitIgnoreParser.ts) — pre-existing environment issues with ajv and ignore packages.

Smoke Test (main working tree, `npm run dev`)

$ npm run dev -- -p 'What is 2+2? Reply with just the number.'
> @qwen-code/qwen-code@0.18.0 dev
> node scripts/dev.js -p What is 2+2? Reply with just the number.
DEV is set to true, but the React DevTools server is not running.
4
$

CLI responds correctly. The memory subsystem changes require an MCP server to exercise end-to-end — wenshao's comprehensive runtime verification (sentinel tokens, merge-base A/B, deterministic assertions against compiled output) already covers all three layers thoroughly.

中文说明

代码审查

实现干净、结构清晰。三层防御，每层独立可用：

client.ts — recentCompletedToolNames 是上限 20 的去重 move-to-end 列表。seedRecentCompletedToolNamesFromHistory 遍历恢复的历史，通过 ID 匹配 function call 和 response，正确跳过 pending 调用。resetChat() 清空列表。传给 recall() 时做了防御性拷贝。都很直接。

recall.ts — isActiveToolUsageMemory() 要求三个条件同时满足：(1) 近期工具名通过别名匹配，(2) 存在 schema/失败调用标记，(3) 没有持久标记（gotcha、workaround、owner、escalation）覆盖。设计上保守——三个条件同时满足才过滤，误判概率低。toolAliases() 正确处理运行时形态 mcp__server__tool、transcript 形态 server::tool 和裸工具名。

relevanceSelector.ts — 系统提示从"不要选 usage reference"收紧为显式列表（parameter schemas、field mappings、guessed call formats、failed-call transcripts），同时保留持久运维白名单。

prompt.ts — WHAT_NOT_TO_SAVE_SECTION 加了一行。简洁到位。

无关键阻塞项。无 AGENTS.md 违规。代码遵循项目规范（ESM、严格 TS、测试就近放置）。

已知局限（均来自 wenshao 的运行时验证，均非回归）：

F1：正常路径上模型选择器的保护是建议性的——确定性过滤只在 side query 失败时生效。建议后续对模型选择结果做后置过滤。
F2：heuristic 标记仅英文——中文记忆会穿透。值得补充中文标记或改用语言无关方案。

测试

prompt.test.ts 和 relevanceSelector.test.ts 单测：15/15 通过。recall.test.ts 和 client.test.ts 在 worktree 中因缺少 ajv 包失败（环境问题，非 PR 代码）——wenshao 的报告确认全部 212 个测试在正式构建中通过。

Typecheck 错误均在 本 PR 未修改的文件中（filesearch/ignore.ts、schemaValidator.ts、gitIgnoreParser.ts）——ajv 和 ignore 包的既有环境问题。

冒烟测试（主工作树，`npm run dev`）

CLI 正常响应。memory 子系统的变更需要 MCP server 才能端到端验证——wenshao 的综合运行时验证（哨兵 token、merge-base A/B、针对编译产物的确定性断言）已充分覆盖所有三层。

— Qwen Code · qwen3.7-max

qwen-code-ci-bot · 2026-06-12T22:41:25Z

Reflection

Going back to my independent proposal before reading the diff: I'd have done essentially the same thing — add a "don't save tool schemas" rule to the extraction prompt, pass recent tool names to the relevance selector with a skip instruction, and add a deterministic post-filter on the heuristic fallback. The PR matches this and adds the lifecycle wiring (resume seeding, resetChat clearing) that I would have missed in a first pass. That's a sign the author thought through the full lifecycle, not just the happy path.

The problem is real — #4976 is a vivid 14-turn detour where guessed tool patterns got reinforced by memory. The fix is narrowly scoped: it touches exactly the three points where stale tool details can enter or persist in memory (extraction, model selection, heuristic fallback). No scope creep, no speculative abstractions.

wenshao's runtime verification is the strongest evidence I've seen on a community PR — sentinel tokens, merge-base A/B comparison, deterministic assertions against compiled output, all three layers verified independently. The three findings (F1 advisory happy-path, F2 English-only markers, F3 slow-fail gap) are valid follow-ups that don't make anything worse than the current baseline.

The code is straightforward, the tests are focused, and the implementation follows project conventions. If I had to maintain this in six months, I'd thank the author for the clear layering and the conservative filter logic (three conditions must all hold, with a durable whitelist override).

Verdict: LGTM. Approving. ✅

中文说明

反思

回到我在看 diff 之前的独立方案：我大概会做同样的事——在提取提示里加"不要保存工具 schema"规则，把近期工具名传给 relevance selector 并加跳过指令，再给 heuristic fallback 加确定性后置过滤。PR 匹配了这个方案，还加上了我初版会遗漏的生命周期处理（resume 播种、resetChat 清空）。这说明作者考虑了完整生命周期，而非只看正常路径。

问题是真实的——#4976 是一次 14 轮弯路，猜测的工具模式被 memory 强化。修复范围精确：恰好触及过时工具细节可能进入或留存于 memory 的三个点（提取、模型选择、heuristic fallback）。没有范围蔓延，没有投机性抽象。

wenshao 的运行时验证是我在社区 PR 上见过的最扎实的——哨兵 token、merge-base A/B 对比、针对编译产物的确定性断言，三层均独立验证。三个发现（F1 正常路径建议性、F2 仅英文标记、F3 慢失败缺口）是值得 follow-up 的有效点，但都不会让情况比当前基线更差。

代码直接，测试聚焦，实现遵循项目规范。如果六个月后我来维护这段代码，我会感谢作者清晰的层次划分和保守的过滤逻辑（三个条件同时满足才过滤，加上持久白名单覆盖）。

结论：LGTM。批准。 ✅

— Qwen Code · qwen3.7-max

qwen-code-ci-bot

LGTM, looks ready to ship. ✅

* fix(memory): avoid stale tool schema recall Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * fix(memory): seed resumed tool recall context Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> --------- Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

wenshao reviewed Jun 12, 2026

View reviewed changes

fix(memory): avoid stale tool schema recall

e02d6ae

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng force-pushed the fix/memory-tool-schema-guard branch from 03eaab4 to e02d6ae Compare June 12, 2026 19:05

wenshao reviewed Jun 12, 2026

View reviewed changes

fix(memory): seed resumed tool recall context

009b32d

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

qqqys approved these changes Jun 12, 2026

View reviewed changes

wenshao approved these changes Jun 12, 2026

View reviewed changes

wenshao merged commit 92c4a82 into QwenLM:main Jun 12, 2026
23 checks passed

qwen-code-ci-bot approved these changes Jun 12, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jun 15, 2026

Upstream changes detected: Gemini CLI, Codex CLI (App Server Protocol), GitHub Copilot CLI, Qwen Code greenpolo/cc-multi-cli-plugin#8

Open

	private pendingMemoryPrefetch: MemoryPrefetchHandle \| undefined;
	private recentCompletedToolNames: string[] = [];
	// NOTE: also reset in resetChat()

		@@ -286,7 +288,10 @@ export class GeminiClient {
		// Check if we're resuming from a previous session
		const resumedSessionData = this.config.getResumedSessionData();

Conversation

he-yufeng commented Jun 12, 2026

What this PR does

Why it's needed

Reviewer Test Plan

How to verify

Evidence (Before & After)

Tested on

Environment (optional)

Risk & Scope

Linked Issues

这个 PR 做了什么

为什么需要

验证

范围和风险

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented Jun 12, 2026

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented Jun 12, 2026

Uh oh!

qqqys left a comment

Choose a reason for hiding this comment

Uh oh!

wenshao commented Jun 12, 2026

Local runtime verification report (Linux, real build + real MCP server)

Verified working (PR build, vs merge-base control)

Findings to weigh (none are regressions — the PR beats base in every scenario I ran)

Verdict

本地运行时验证报告（Linux，真实构建 + 真实 MCP server）

验证通过项（PR 构建，与 merge-base 对照）

需要权衡的发现（都不是回归——所有场景下 PR 均优于 base）

结论

Uh oh!

wenshao commented Jun 12, 2026

Uh oh!

Uh oh!

qwen-code-ci-bot commented Jun 12, 2026

Uh oh!

qwen-code-ci-bot commented Jun 12, 2026

Code Review

Tests

Smoke Test (main working tree, npm run dev)

代码审查

测试

冒烟测试（主工作树，npm run dev）

Uh oh!

qwen-code-ci-bot commented Jun 12, 2026

Reflection

反思

Uh oh!

qwen-code-ci-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Smoke Test (main working tree, `npm run dev`)

冒烟测试（主工作树，`npm run dev`）