fix: retry transient local agent server errors#3044
Conversation
|
@BugBot run |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the resilience of the local agent's streaming capabilities by introducing a robust retry mechanism for transient server errors originating from external providers. By intelligently retrying failed stream passes and leveraging existing continuation paths, the system can gracefully recover from temporary service interruptions, leading to a more stable and reliable user experience. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
@BugBot run |
There was a problem hiding this comment.
Code Review
The pull request introduces a robust retry mechanism for transient server errors encountered during the local agent stream processing. This significantly improves the reliability and resilience of the AI agent by allowing it to recover from temporary network or provider-side issues. A new test case has been added to validate this functionality, which is a good practice. The refactoring of the retry logic and the introduction of specific retryable error patterns are well-implemented.
| const RETRYABLE_STREAM_ERROR_STATUS_CODES = new Set([ | ||
| 408, 429, 500, 502, 503, 504, | ||
| ]); | ||
| const RETRYABLE_STREAM_ERROR_PATTERNS = [ | ||
| "server_error", | ||
| "internal server error", | ||
| "service unavailable", | ||
| "bad gateway", | ||
| "gateway timeout", | ||
| "too many requests", | ||
| "rate_limit", | ||
| "overloaded", | ||
| "timeout", | ||
| "econnrefused", | ||
| "enotfound", | ||
| "econnreset", | ||
| "epipe", | ||
| "etimedout", |
There was a problem hiding this comment.
Consider grouping these new constants (RETRYABLE_STREAM_ERROR_STATUS_CODES, RETRYABLE_STREAM_ERROR_PATTERNS) with the other related constants (e.g., MAX_TERMINATED_STREAM_RETRIES, STREAM_RETRY_BASE_DELAY_MS, STREAM_CONTINUE_MESSAGE) for better organization and readability. This would make it easier to find all configuration-like values in one place.
const MAX_TERMINATED_STREAM_RETRIES = 3;
const STREAM_RETRY_BASE_DELAY_MS = 400;
const STREAM_CONTINUE_MESSAGE =
"[System] Your previous response stream was interrupted by a transient network error. Continue from exactly where you left off and do not repeat text that has already been sent.";
const RETRYABLE_STREAM_ERROR_STATUS_CODES = new Set([
408, 429, 500, 502, 503, 504,
]);
const RETRYABLE_STREAM_ERROR_PATTERNS = [
"server_error",
"internal server error",
"service unavailable",
"bad gateway",
"gateway timeout",
"too many requests",
"rate_limit",
"overloaded",
"timeout",
"econnrefused",
"enotfound",
"econnreset",
"epipe",
"etimedout",
];
Greptile SummaryThis PR extends the local agent stream retry logic to cover structured provider-side errors (e.g. Azure
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Stream iteration / response.await] -->|error thrown| B{aborted?}
B -- yes --> C[Break / log]
B -- no --> D{shouldRetryTransientStreamError?}
D -->|retryCount >= MAX 3| E[throw error → surface to user]
D -->|retryCount < MAX| F{isTerminatedStreamError?}
F -- yes --> H[Schedule retry with continuation instruction]
F -- no --> G{isRetryableProviderStreamError?}
G -->|statusCode in set OR ≥500| H
G -->|error string matches pattern| H
G -- no match --> E
H --> I[maybeAppendRetryReplayForRetry]
I --> J[terminatedRetryCount++]
J --> K[needsContinuationInstruction = true]
K --> L[delay STREAM_RETRY_BASE_DELAY_MS × retryCount]
L --> M[continue → restart stream pass]
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts
Line: 1033-1039
Comment:
**Telemetry event name no longer reflects its scope**
The event name `"local_agent:terminated_stream_retry"` is now fired for both transport-level `"terminated"` errors and structured provider-side errors (e.g. Azure `server_error`). Any telemetry dashboards filtering on this event name will silently lump in provider retries under the old "terminated" bucket, making it hard to distinguish the two failure modes.
The same stale name is used again at line 1082 (the `response` await catch path).
Consider a name that covers both cases, and optionally add an `errorKind` field to differentiate:
```suggestion
sendTelemetryEvent("local_agent:transient_stream_retry", {
chatId: req.chatId,
dyadRequestId,
retryCount: terminatedRetryCount,
error: String(streamError),
phase: "stream_iteration",
});
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts
Line: 1357-1363
Comment:
**`unwrapStreamError` discards the outer envelope's HTTP status code**
`unwrapStreamError` returns `error.error` when an `"error"` key is present, which strips any `statusCode` / `status` sitting on the outer wrapper. For a provider error shaped like `{ status: 408, error: { message: "Request Timeout" } }`, the `408` is lost and the code falls through to string-pattern matching — but `"request timeout"` is not in `RETRYABLE_STREAM_ERROR_PATTERNS`, so a legitimate 408 wrapped error would silently not be retried.
Consider extracting the status code from both levels before unwrapping:
```
const statusCode =
(isRecord(error) && typeof (error as any).statusCode === "number" && (error as any).statusCode) ||
(isRecord(error) && typeof (error as any).status === "number" && (error as any).status) ||
// ... then fall through to the unwrapped object
```
Or alternatively, add `"request timeout"` to `RETRYABLE_STREAM_ERROR_PATTERNS` so the string-path still catches it.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Remove 'timeout' err..." |
| if ( | ||
| typeof statusCode === "number" && | ||
| (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)) | ||
| ) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
>= 500 makes the explicit Set redundant and retries non-retryable codes
statusCode >= 500 catches every 5xx code including non-transient ones like 501 Not Implemented and 505 HTTP Version Not Supported, which a provider would never recover from on a retry. Because the Set already enumerates the exact 5xx codes worth retrying (500, 502, 503, 504) alongside the 4xx ones (408, 429), the >= 500 branch is both overly broad and redundant.
Consider replacing the condition with just the Set lookup:
| if ( | |
| typeof statusCode === "number" && | |
| (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)) | |
| ) { | |
| return true; | |
| } | |
| if ( | |
| typeof statusCode === "number" && | |
| RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode) | |
| ) { | |
| return true; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts
Line: 1366-1371
Comment:
**`>= 500` makes the explicit Set redundant and retries non-retryable codes**
`statusCode >= 500` catches every 5xx code including non-transient ones like `501 Not Implemented` and `505 HTTP Version Not Supported`, which a provider would never recover from on a retry. Because the Set already enumerates the exact 5xx codes worth retrying (`500, 502, 503, 504`) alongside the 4xx ones (`408, 429`), the `>= 500` branch is both overly broad and redundant.
Consider replacing the condition with just the Set lookup:
```suggestion
if (
typeof statusCode === "number" &&
RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)
) {
return true;
}
```
How can I resolve this? If you propose a fix, please make it concise.
🔍 Dyadbot Code Review SummaryVerdict: 🤔 NOT SURE - Potential issues Reviewed by 3 independent agents: Correctness Expert, Code Health Expert, UX Wizard. Issues Summary
🟢 Low Priority Notes (4 items)
🚫 Dropped False Positives (3 items)
Generated by Dyadbot multi-agent code review |
| } catch (err) { | ||
| if ( | ||
| shouldRetryTerminatedStreamError({ | ||
| shouldRetryTransientStreamError({ |
There was a problem hiding this comment.
Exhausted-retries telemetry missing for new provider errors
Low Severity
In the response_finalization phase, the telemetry guard at line 1096 still only checks isTerminatedStreamError(err). Since shouldRetryTransientStreamError now also retries isRetryableProviderStreamError errors, when those new provider errors exhaust retries, the terminated_stream_retries_exhausted telemetry event silently won't fire. The stream_iteration phase (line 1047) correctly fires telemetry unconditionally on exhaustion, making this an inconsistency between the two paths.
Additional Locations (1)
| "too many requests", | ||
| "rate_limit", | ||
| "overloaded", | ||
| "timeout", |
There was a problem hiding this comment.
🟡 MEDIUM | correctness
Bare timeout pattern is too broad
The substring 'timeout' will match any error whose message/code/type contains the word — including non-transient client-imposed timeouts (e.g., AbortSignal.timeout(), user-configured request timeouts, or messages like "connection timeout set too low"). This would cause up to 3 silent retries of non-retryable errors before surfacing the real failure.
The more specific 'etimedout' and 'gateway timeout' patterns already cover the network-level and HTTP 504 cases. Status code 408 is also handled by the Set.
💡 Suggestion: Remove the bare 'timeout' entry and rely on the existing specific patterns (etimedout, gateway timeout, status 408/504).
| import { DEFAULT_MAX_TOOL_CALL_STEPS } from "@/constants/settings_constants"; | ||
|
|
||
| const RETRYABLE_STREAM_ERROR_STATUS_CODES = new Set([ | ||
| 408, 429, 500, 502, 503, 504, |
There was a problem hiding this comment.
🟡 MEDIUM | user-experience
429 rate-limit errors retried too aggressively
429 (Too Many Requests) is included in RETRYABLE_STREAM_ERROR_STATUS_CODES and will be retried after only ~400ms-1200ms (linear backoff). Providers that emit 429 typically expect longer backoff (seconds to minutes) and may include a Retry-After header. Retrying in <2s will likely hit the rate limit again immediately, burning all 3 retry attempts and delaying the error the user sees by ~2.4s with no benefit.
💡 Suggestion: Either exclude 429 from automatic retry (and let the existing rate-limit error UI surface immediately), or apply a significantly longer minimum delay for 429s and respect Retry-After headers if present.
| @@ -1338,7 +1394,7 @@ function shouldRetryTerminatedStreamError(params: { | |||
| return ( | |||
There was a problem hiding this comment.
🟡 MEDIUM | observability
Exhausted-retries telemetry misses new provider errors
The terminated_stream_retries_exhausted telemetry event in the response-finalization path (~line 1096) is still gated on isTerminatedStreamError(err). After this PR, retries can also be exhausted by provider-side errors (e.g., Azure server_error, 429s) matched by isRetryableProviderStreamError — but those cases silently skip telemetry.
Note: the stream-iteration exhaustion path (~line 1047) fires unconditionally, so only the response-finalization path has the gap.
💡 Suggestion: Change the condition at ~line 1096 to isTerminatedStreamError(err) || isRetryableProviderStreamError(err).
There was a problem hiding this comment.
🟡 Missing "retries exhausted" telemetry for retryable provider errors in response finalization phase
The PR expands shouldRetryTransientStreamError (line 1066) to cover both terminated errors and retryable provider errors, but the fallback telemetry guard at local_agent_handler.ts:1096 still only checks isTerminatedStreamError(err). When a retryable provider error (e.g., 500 server_error) exhausts its MAX_TERMINATED_STREAM_RETRIES retries during the response finalization phase, no "local_agent:terminated_stream_retries_exhausted" telemetry event is emitted — unlike the stream iteration error site (local_agent_handler.ts:1047) which unconditionally sends the telemetry. This creates an observability blind spot for the newly added error types.
(Refers to lines 1096-1107)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
1 issue found across 2 files
Confidence score: 3/5
- There is a concrete reliability risk in
src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts: usingstatusCode >= 500broadens retries to all 5xx responses, including non-transient cases that are unlikely to succeed on retry. - Because this issue is medium severity (6/10) with high confidence (9/10), it introduces meaningful behavior risk rather than a minor housekeeping concern.
- The retry logic may ignore the intended curated
RETRYABLE_STREAM_ERRORSbehavior, which could cause unnecessary retry loops and delayed failure handling for users. - Pay close attention to
src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts- ensure retry conditions exclude non-retryable 5xx statuses and align with the curated retry policy.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts">
<violation number="1" location="src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts:1368">
P2: `statusCode >= 500` catches every 5xx code including non-transient ones like `501 Not Implemented` and `505 HTTP Version Not Supported`, which will never recover on retry. This makes the curated `RETRYABLE_STREAM_ERROR_STATUS_CODES` Set redundant for 5xx codes and causes unnecessary retries for permanent failures. Replace the compound condition with just the Set lookup.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| if ( | ||
| typeof statusCode === "number" && | ||
| (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)) |
There was a problem hiding this comment.
P2: statusCode >= 500 catches every 5xx code including non-transient ones like 501 Not Implemented and 505 HTTP Version Not Supported, which will never recover on retry. This makes the curated RETRYABLE_STREAM_ERROR_STATUS_CODES Set redundant for 5xx codes and causes unnecessary retries for permanent failures. Replace the compound condition with just the Set lookup.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts, line 1368:
<comment>`statusCode >= 500` catches every 5xx code including non-transient ones like `501 Not Implemented` and `505 HTTP Version Not Supported`, which will never recover on retry. This makes the curated `RETRYABLE_STREAM_ERROR_STATUS_CODES` Set redundant for 5xx codes and causes unnecessary retries for permanent failures. Replace the compound condition with just the Set lookup.</comment>
<file context>
@@ -1329,7 +1349,43 @@ function isTerminatedStreamError(error: unknown): boolean {
+
+ if (
+ typeof statusCode === "number" &&
+ (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode))
+ ) {
+ return true;
</file context>
| (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)) | |
| RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode) |
Removed 'timeout' from the list of error messages.
|
@BugBot run |
🔍 Dyadbot Code Review SummaryVerdict: ✅ YES - Ready to merge Reviewed by 3 independent agents: Correctness Expert, Code Health Expert, UX Wizard. Issues Summary
🟢 Low Priority Notes (3 items)
🚫 Dropped False Positives (7 items)
Generated by Dyadbot multi-agent code review |
| const MAX_TERMINATED_STREAM_RETRIES = 3; | ||
| const STREAM_RETRY_BASE_DELAY_MS = 400; | ||
| const STREAM_CONTINUE_MESSAGE = | ||
| "[System] Your previous response stream was interrupted by a transient network error. Continue from exactly where you left off and do not repeat text that has already been sent."; |
There was a problem hiding this comment.
🟡 MEDIUM | user-experience / correctness
Continuation instruction semantically wrong for provider errors
STREAM_CONTINUE_MESSAGE says "interrupted by a transient network error. Continue from exactly where you left off" — but when a provider emits a structured server error (e.g. Azure server_error) before any text was streamed, there is nothing to continue from. The model receives a misleading instruction to "continue" non-existent partial output.
For the existing terminated-stream path this was appropriate (the TCP connection dropped mid-response), but provider errors can fire before any output is generated.
💡 Suggestion: Only set needsContinuationInstruction = true when fullResponse is non-empty, or use a distinct retry message for provider errors that fired before any output (e.g. a simple "Please retry the request" instead of "continue from where you left off").
🎭 Playwright Test Results❌ Some tests failed
Summary: 787 passed, 9 failed, 11 flaky, 252 skipped Failed Tests🍎 macOS
🪟 Windows
📋 Re-run Failing Tests (macOS)Copy and paste to re-run all failing spec files locally: npm run e2e \
e2e-tests/queued_message.spec.ts
|


Summary
server_errorTest plan
🤖 Generated with Claude Code