Skip to content

fix(agent): retry transient provider failures#1457

Closed
Alix-007 wants to merge 3 commits intosipeed:mainfrom
Alix-007:pr/issue-629-transient-retry
Closed

fix(agent): retry transient provider failures#1457
Alix-007 wants to merge 3 commits intosipeed:mainfrom
Alix-007:pr/issue-629-transient-retry

Conversation

@Alix-007
Copy link
Copy Markdown
Contributor

Summary

  • retry transient single-provider LLM failures inside the agent loop instead of aborting immediately
  • reuse provider error classification so HTTP 500/502/503 style failures are treated like temporary outages
  • add a regression test proving the agent recovers after one transient provider failure

Fixes #629

Test plan

  • go test ./pkg/agent -run TestAgentLoop_ContextExhaustionRetry -count=1
  • go test ./pkg/agent -run TestAgentLoop_TransientProviderRetry -count=1

@Alix-007
Copy link
Copy Markdown
Contributor Author

Pushed an updated branch for this PR. It keeps the transient single-provider retry fix, resolves the current pkg/agent/loop drift against main, and keeps the targeted regression in place.

Local verification:

  • go test ./pkg/agent -run TestAgentLoop_ -count=1

@Alix-007
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up lint-only fix on top of the retry update to satisfy golines in pkg/agent/loop.go and reran go test ./pkg/agent -run TestAgentLoop_ -count=1 locally.

@yinwm yinwm mentioned this pull request Mar 18, 2026
7 tasks
@yinwm yinwm added this to the Refactor Agent milestone Mar 18, 2026
@Alix-007 Alix-007 force-pushed the pr/issue-629-transient-retry branch from 8e0727b to f026151 Compare March 19, 2026 03:17
@Alix-007
Copy link
Copy Markdown
Contributor Author

Refreshed this branch onto the current main and force-updated the PR head.

During the refresh I explicitly dropped an old conflict-resolution commit that had accumulated unrelated registry/provider reload changes, so the PR is back to the intended narrow scope: transient retry handling for single-provider calls plus the focused regression test.

Local verification after the refresh:

  • go test ./pkg/agent -run 'TestAgentLoop_' -count=1
  • go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.10.1 run ./pkg/agent/...

GitHub PR checks are green again.

@Alix-007
Copy link
Copy Markdown
Contributor Author

Closing this to reduce queue noise. The branch is conflicting again, the issue lane already has another open carrier (#866), and this PR has aged without maintainer review despite several newer focused fixes being merged first. If we revisit this bug, it should come back as a fresh clean-base follow-up rather than continue on this stale branch.

@Alix-007 Alix-007 closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: agent go Pull requests that update go code type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Didn't retry if meet LLM call failed

2 participants