Skip to content

fix(mcp): retry listTools() before evicting client on transient failure#17153

Open
nil957 wants to merge 1 commit intoanomalyco:devfrom
nil957:fix/mcp-retry-on-transient-failure
Open

fix(mcp): retry listTools() before evicting client on transient failure#17153
nil957 wants to merge 1 commit intoanomalyco:devfrom
nil957:fix/mcp-retry-on-transient-failure

Conversation

@nil957
Copy link
Copy Markdown

@nil957 nil957 commented Mar 12, 2026

Issue for this PR

Closes #17099

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

This PR fixes a critical bug where MCP tools silently vanish mid-session after a single transient listTools() failure.

The Problem:
In mcp/index.ts, when client.listTools() fails (timeout, pipe hiccup, GC pause), the code immediately executes delete s.clients[clientName], permanently removing the client from the singleton state. The MCP server process may still be running perfectly fine, but the tools are gone forever until OpenCode restarts.

The Fix:

  1. Added retry logic (3 attempts with 1s delay) before marking a client as failed
  2. Removed the immediate delete s.clients[clientName] that permanently evicted healthy servers
  3. Added warn-level logging for retry attempts to aid debugging

Why it works:
Transient failures (network blips, GC pauses, brief timeouts) are common in long-running sessions. Retrying gives the MCP server a chance to respond before we give up. Not deleting the client allows future reconnection attempts.

How did you verify your code works?

  1. Read through the existing code and traced the bug as described in Bug: MCP tools permanently lost mid-session after single transient listTools() failure — no retry, no reconnect #17099
  2. Verified the fix logic handles the retry/catch flow correctly
  3. Confirmed the change is minimal and surgical - only affects the failure handling path

Screenshots / recordings

N/A - backend change

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Previously, a single transient listTools() failure (timeout, pipe hiccup, GC
pause) would permanently delete the MCP client from the session state with no
retry or reconnection mechanism. This caused MCP tools to silently vanish
mid-session while the server process was still running.

This commit:
1. Adds retry logic (3 attempts with 1s delay) before marking a client as failed
2. Removes the immediate 'delete s.clients[clientName]' that permanently evicted
   healthy servers on transient errors
3. Logs retry attempts at warn level for visibility

The client remains in 'failed' status after all retries are exhausted, but is
no longer deleted, allowing future state() calls to attempt reconnection.

Fixes anomalyco#17099
@github-actions github-actions bot added needs:compliance This means the issue will auto-close after 2 hours. and removed needs:compliance This means the issue will auto-close after 2 hours. labels Mar 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: MCP tools permanently lost mid-session after single transient listTools() failure — no retry, no reconnect

1 participant