Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions src/__tests__/local_agent_handler.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1026,6 +1026,86 @@ describe("handleLocalAgentStream", () => {
expect(hasReplayedToolCall).toBe(true);
expect(hasReplayedToolResult).toBe(true);
});

it("should retry and resume when the provider emits a retryable server error", async () => {
// Arrange
const { event, getMessagesByChannel } = createFakeEvent();
mockSettings = buildTestSettings({ enableDyadPro: true });
mockChatData = buildTestChat();

const streamMessagesByAttempt: any[][] = [];
let attemptCount = 0;
mockStreamTextImpl = (options) => {
attemptCount += 1;
streamMessagesByAttempt.push(options.messages ?? []);

if (attemptCount === 1) {
return {
fullStream: (async function* () {
throw {
type: "error",
sequence_number: 0,
error: {
type: "server_error",
code: "server_error",
message: "The server had an error processing your request.",
},
};
})(),
response: Promise.resolve({ messages: [] }),
steps: Promise.resolve([]),
};
}

return {
fullStream: (async function* () {
yield { type: "text-delta", text: "Recovered after retry." };
})(),
response: Promise.resolve({
messages: [
{
role: "assistant",
content: [{ type: "text", text: "Recovered after retry." }],
},
],
}),
steps: Promise.resolve([{ toolCalls: [] }]),
};
};

// Act
await handleLocalAgentStream(
event,
{ chatId: 1, prompt: "test" },
new AbortController(),
{
placeholderMessageId: 10,
systemPrompt: "You are helpful",
dyadRequestId,
},
);

// Assert
expect(attemptCount).toBe(2);
expect(getMessagesByChannel("chat:response:error")).toHaveLength(0);

const continuationInstructionFound = (
streamMessagesByAttempt[1] ?? []
).some(
(message: any) =>
message.role === "user" &&
Array.isArray(message.content) &&
message.content.some(
(part: any) =>
part.type === "text" &&
typeof part.text === "string" &&
part.text.includes(
"previous response stream was interrupted by a transient network error",
),
),
);
expect(continuationInstructionFound).toBe(true);
});
});

describe("Stream processing - reasoning blocks", () => {
Expand Down
66 changes: 61 additions & 5 deletions src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing "retries exhausted" telemetry for retryable provider errors in response finalization phase

The PR expands shouldRetryTransientStreamError (line 1066) to cover both terminated errors and retryable provider errors, but the fallback telemetry guard at local_agent_handler.ts:1096 still only checks isTerminatedStreamError(err). When a retryable provider error (e.g., 500 server_error) exhausts its MAX_TERMINATED_STREAM_RETRIES retries during the response finalization phase, no "local_agent:terminated_stream_retries_exhausted" telemetry event is emitted — unlike the stream iteration error site (local_agent_handler.ts:1047) which unconditionally sends the telemetry. This creates an observability blind spot for the newly added error types.

(Refers to lines 1096-1107)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Original file line number Diff line number Diff line change
Expand Up @@ -80,14 +80,34 @@ import {
checkAndMarkForCompaction,
} from "@/ipc/handlers/compaction/compaction_handler";
import { getPostCompactionMessages } from "@/ipc/handlers/compaction/compaction_utils";
import { DEFAULT_MAX_TOOL_CALL_STEPS } from "@/constants/settings_constants";

const logger = log.scope("local_agent_handler");
const PLANNING_QUESTIONNAIRE_TOOL_NAME = "planning_questionnaire";
const MAX_TERMINATED_STREAM_RETRIES = 3;
const STREAM_RETRY_BASE_DELAY_MS = 400;
const STREAM_CONTINUE_MESSAGE =
"[System] Your previous response stream was interrupted by a transient network error. Continue from exactly where you left off and do not repeat text that has already been sent.";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM | user-experience / correctness

Continuation instruction semantically wrong for provider errors

STREAM_CONTINUE_MESSAGE says "interrupted by a transient network error. Continue from exactly where you left off" — but when a provider emits a structured server error (e.g. Azure server_error) before any text was streamed, there is nothing to continue from. The model receives a misleading instruction to "continue" non-existent partial output.

For the existing terminated-stream path this was appropriate (the TCP connection dropped mid-response), but provider errors can fire before any output is generated.

💡 Suggestion: Only set needsContinuationInstruction = true when fullResponse is non-empty, or use a distinct retry message for provider errors that fired before any output (e.g. a simple "Please retry the request" instead of "continue from where you left off").

import { DEFAULT_MAX_TOOL_CALL_STEPS } from "@/constants/settings_constants";

const RETRYABLE_STREAM_ERROR_STATUS_CODES = new Set([
408, 429, 500, 502, 503, 504,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM | user-experience

429 rate-limit errors retried too aggressively

429 (Too Many Requests) is included in RETRYABLE_STREAM_ERROR_STATUS_CODES and will be retried after only ~400ms-1200ms (linear backoff). Providers that emit 429 typically expect longer backoff (seconds to minutes) and may include a Retry-After header. Retrying in <2s will likely hit the rate limit again immediately, burning all 3 retry attempts and delaying the error the user sees by ~2.4s with no benefit.

💡 Suggestion: Either exclude 429 from automatic retry (and let the existing rate-limit error UI surface immediately), or apply a significantly longer minimum delay for 429s and respect Retry-After headers if present.

]);
const RETRYABLE_STREAM_ERROR_PATTERNS = [
"server_error",
"internal server error",
"service unavailable",
"bad gateway",
"gateway timeout",
"too many requests",
"rate_limit",
"overloaded",
"timeout",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM | correctness

Bare timeout pattern is too broad

The substring 'timeout' will match any error whose message/code/type contains the word — including non-transient client-imposed timeouts (e.g., AbortSignal.timeout(), user-configured request timeouts, or messages like "connection timeout set too low"). This would cause up to 3 silent retries of non-retryable errors before surfacing the real failure.

The more specific 'etimedout' and 'gateway timeout' patterns already cover the network-level and HTTP 504 cases. Status code 408 is also handled by the Set.

💡 Suggestion: Remove the bare 'timeout' entry and rely on the existing specific patterns (etimedout, gateway timeout, status 408/504).

"econnrefused",
"enotfound",
"econnreset",
"epipe",
"etimedout",
Comment on lines +92 to +108
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider grouping these new constants (RETRYABLE_STREAM_ERROR_STATUS_CODES, RETRYABLE_STREAM_ERROR_PATTERNS) with the other related constants (e.g., MAX_TERMINATED_STREAM_RETRIES, STREAM_RETRY_BASE_DELAY_MS, STREAM_CONTINUE_MESSAGE) for better organization and readability. This would make it easier to find all configuration-like values in one place.

const MAX_TERMINATED_STREAM_RETRIES = 3;
const STREAM_RETRY_BASE_DELAY_MS = 400;
const STREAM_CONTINUE_MESSAGE =
  "[System] Your previous response stream was interrupted by a transient network error. Continue from exactly where you left off and do not repeat text that has already been sent.";

const RETRYABLE_STREAM_ERROR_STATUS_CODES = new Set([
  408, 429, 500, 502, 503, 504,
]);
const RETRYABLE_STREAM_ERROR_PATTERNS = [
  "server_error",
  "internal server error",
  "service unavailable",
  "bad gateway",
  "gateway timeout",
  "too many requests",
  "rate_limit",
  "overloaded",
  "timeout",
  "econnrefused",
  "enotfound",
  "econnreset",
  "epipe",
  "etimedout",
];

];

// ============================================================================
// Tool Streaming State Management
Expand Down Expand Up @@ -994,7 +1014,7 @@ export async function handleLocalAgentStream(
streamErrorFromIteration ?? streamErrorFromCallback;
if (streamError) {
if (
shouldRetryTerminatedStreamError({
shouldRetryTransientStreamError({
error: streamError,
retryCount: terminatedRetryCount,
aborted: abortController.signal.aborted,
Expand Down Expand Up @@ -1043,7 +1063,7 @@ export async function handleLocalAgentStream(
responseMessages = response.messages;
} catch (err) {
if (
shouldRetryTerminatedStreamError({
shouldRetryTransientStreamError({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exhausted-retries telemetry missing for new provider errors

Low Severity

In the response_finalization phase, the telemetry guard at line 1096 still only checks isTerminatedStreamError(err). Since shouldRetryTransientStreamError now also retries isRetryableProviderStreamError errors, when those new provider errors exhaust retries, the terminated_stream_retries_exhausted telemetry event silently won't fire. The stream_iteration phase (line 1047) correctly fires telemetry unconditionally on exhaustion, making this an inconsistency between the two paths.

Additional Locations (1)
Fix in Cursor Fix in Web

error: err,
retryCount: terminatedRetryCount,
aborted: abortController.signal.aborted,
Expand Down Expand Up @@ -1329,7 +1349,43 @@ function isTerminatedStreamError(error: unknown): boolean {
return false;
}

function shouldRetryTerminatedStreamError(params: {
function isRetryableProviderStreamError(error: unknown): boolean {
const normalized = unwrapStreamError(error);
if (!isRecord(normalized)) {
return false;
}

const statusCode =
(typeof normalized.statusCode === "number" && normalized.statusCode) ||
(typeof normalized.status === "number" && normalized.status) ||
(isRecord(normalized.response) &&
typeof normalized.response.status === "number"
? normalized.response.status
: undefined);

if (
typeof statusCode === "number" &&
(statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode))
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: statusCode >= 500 catches every 5xx code including non-transient ones like 501 Not Implemented and 505 HTTP Version Not Supported, which will never recover on retry. This makes the curated RETRYABLE_STREAM_ERROR_STATUS_CODES Set redundant for 5xx codes and causes unnecessary retries for permanent failures. Replace the compound condition with just the Set lookup.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts, line 1368:

<comment>`statusCode >= 500` catches every 5xx code including non-transient ones like `501 Not Implemented` and `505 HTTP Version Not Supported`, which will never recover on retry. This makes the curated `RETRYABLE_STREAM_ERROR_STATUS_CODES` Set redundant for 5xx codes and causes unnecessary retries for permanent failures. Replace the compound condition with just the Set lookup.</comment>

<file context>
@@ -1329,7 +1349,43 @@ function isTerminatedStreamError(error: unknown): boolean {
+
+  if (
+    typeof statusCode === "number" &&
+    (statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode))
+  ) {
+    return true;
</file context>
Suggested change
(statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode))
RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)
Fix with Cubic

) {
return true;
}
Comment on lines +1365 to +1370
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 >= 500 makes the explicit Set redundant and retries non-retryable codes

statusCode >= 500 catches every 5xx code including non-transient ones like 501 Not Implemented and 505 HTTP Version Not Supported, which a provider would never recover from on a retry. Because the Set already enumerates the exact 5xx codes worth retrying (500, 502, 503, 504) alongside the 4xx ones (408, 429), the >= 500 branch is both overly broad and redundant.

Consider replacing the condition with just the Set lookup:

Suggested change
if (
typeof statusCode === "number" &&
(statusCode >= 500 || RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode))
) {
return true;
}
if (
typeof statusCode === "number" &&
RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)
) {
return true;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/pro/main/ipc/handlers/local_agent/local_agent_handler.ts
Line: 1366-1371

Comment:
**`>= 500` makes the explicit Set redundant and retries non-retryable codes**

`statusCode >= 500` catches every 5xx code including non-transient ones like `501 Not Implemented` and `505 HTTP Version Not Supported`, which a provider would never recover from on a retry. Because the Set already enumerates the exact 5xx codes worth retrying (`500, 502, 503, 504`) alongside the 4xx ones (`408, 429`), the `>= 500` branch is both overly broad and redundant.

Consider replacing the condition with just the Set lookup:

```suggestion
  if (
    typeof statusCode === "number" &&
    RETRYABLE_STREAM_ERROR_STATUS_CODES.has(statusCode)
  ) {
    return true;
  }
```

How can I resolve this? If you propose a fix, please make it concise.


const errorString =
[
typeof normalized.message === "string" ? normalized.message : undefined,
typeof normalized.code === "string" ? normalized.code : undefined,
typeof normalized.type === "string" ? normalized.type : undefined,
]
.filter(Boolean)
.join(" ")
.toLowerCase() || getErrorMessage(normalized).toLowerCase();

return RETRYABLE_STREAM_ERROR_PATTERNS.some((pattern) =>
errorString.includes(pattern),
);
}

function shouldRetryTransientStreamError(params: {
error: unknown;
retryCount: number;
aborted: boolean;
Expand All @@ -1338,7 +1394,7 @@ function shouldRetryTerminatedStreamError(params: {
return (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM | observability

Exhausted-retries telemetry misses new provider errors

The terminated_stream_retries_exhausted telemetry event in the response-finalization path (~line 1096) is still gated on isTerminatedStreamError(err). After this PR, retries can also be exhausted by provider-side errors (e.g., Azure server_error, 429s) matched by isRetryableProviderStreamError — but those cases silently skip telemetry.

Note: the stream-iteration exhaustion path (~line 1047) fires unconditionally, so only the response-finalization path has the gap.

💡 Suggestion: Change the condition at ~line 1096 to isTerminatedStreamError(err) || isRetryableProviderStreamError(err).

!aborted &&
retryCount < MAX_TERMINATED_STREAM_RETRIES &&
isTerminatedStreamError(error)
(isTerminatedStreamError(error) || isRetryableProviderStreamError(error))
);
}

Expand Down
Loading