Skip to content

fix: add circuit breakers to prevent re-queue infinite retry loop#80

Closed
frostebite wants to merge 1 commit intogame-ci:mainfrom
frostebite:fix/retry-loop-circuit-breakers
Closed

fix: add circuit breakers to prevent re-queue infinite retry loop#80
frostebite wants to merge 1 commit intogame-ci:mainfrom
frostebite:fix/retry-loop-circuit-breakers

Conversation

@frostebite
Copy link
Member

@frostebite frostebite commented Mar 14, 2026

Summary

Three targeted fixes that close the race windows causing cascading job failures and infinite re-dispatch loops in the build queue scheduler. These changes are minimal and surgical -- each addresses a specific failure mode that has been observed in production.

  • Make registerNewBuild idempotent so that network-timeout retries of the "started" report no longer throw, preventing the cascading failure that marks jobs as failed and halts the scheduler when failingJobs.length > maxToleratedFailures
  • Add retry limits to base/hub image dispatch so the scheduler stops re-dispatching workflows every 15-minute cron cycle when a base or hub job is stuck in "created" or "failed" state (editor images already had this protection via the Ingeminator)
  • Allow created -> inProgress job transition so a workflow's reportNewBuild call moves the job out of the schedulable state even if the scheduler crashed before writing created -> scheduled, preventing duplicate dispatch

Detailed Changes

Fix 1: Idempotent registerNewBuild (functions/src/model/ciBuilds.ts)

Problem: When a GitHub Actions workflow calls reportNewBuild with status: started but the HTTP response is lost (network timeout, Firebase cold start), the action's error handler calls reportBuildFailure, marking the job as failed. The next cron cycle retries the workflow, which calls reportNewBuild again -- but now the build already exists with status "started", so it throws "A build with X as identifier already exists", causing another failure report, ad infinitum.

Fix:

  • If build exists with status "started" and same jobId: silently return (idempotent retry)
  • If build exists with status "started" and different jobId: throw with descriptive error (indicates a real bug)
  • If build exists with status "published": silently return (build already completed)
  • If build exists with status "failed": overwrite with new "started" (existing behavior, preserved)

This is essentially what #73 proposed.

Fix 2: Retry limits for base/hub image dispatch (functions/src/logic/buildQueue/scheduler.ts, functions/src/model/ciJobs.ts)

Problem: ensureThatBaseImageHasBeenBuilt() and ensureThatHubImageHasBeenBuilt() re-dispatch GitHub workflows on every cron cycle (every 15 minutes) when the job is in "created" or "failed" state, with no backoff and no retry limit. Editor images are protected by the Ingeminator's maxFailuresPerBuild check, but base/hub images have no such protection.

Fix:

  • Added CiJobs.hasExceededRetryLimit() helper that checks job.meta.failureCount >= maxRetries
  • Both ensureThatBaseImageHasBeenBuilt() and ensureThatHubImageHasBeenBuilt() now check this limit (using the existing settings.maxFailuresPerBuild = 15) before dispatching
  • When the limit is reached, logs a warning and sends a Discord alert asking for manual intervention

Fix 3: Allow created -> inProgress transition (functions/src/model/ciJobs.ts)

Problem: The scheduler dispatches a GitHub workflow first, then updates Firestore (created -> scheduled). If the scheduler crashes between these two steps, the job stays created even though a workflow is running. The next cron cycle picks it up again and dispatches a duplicate workflow.

When that workflow calls reportNewBuild, markJobAsInProgress only accepts "scheduled" status, so the job stays "created" -- leaving it eligible for yet another duplicate dispatch.

Fix: markJobAsInProgress now accepts both "created" and "scheduled" statuses, so the workflow's report moves the job out of the schedulable state regardless of whether the scheduler's markJobAsScheduled call succeeded.

This is essentially what #74 proposed.

Related

Test plan

  • Verify TypeScript compiles cleanly (npx tsc --noEmit passes with zero errors)
  • Review each change against the existing Firestore state machine to confirm no unintended state transitions
  • Verify the maxFailuresPerBuild setting (15) is appropriate for base/hub images
  • Consider deploying to a staging Firebase project and simulating the failure scenarios

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Implemented automatic retry limit enforcement to prevent indefinite job re-dispatch attempts.
    • Added Discord notifications when builds exceed configured retry limits.
    • Enhanced concurrent build registration with improved conflict detection and idempotent retry handling.
    • Improved job status transition logic for more reliable lifecycle management.

Three targeted fixes that close the race windows causing cascading
job failures and infinite re-dispatch loops:

1. Make registerNewBuild idempotent (ciBuilds.ts)
   - If build already exists with status "started" and same jobId,
     silently succeed (handles network timeout retries)
   - If build already exists with status "published", silently succeed
   - If build already exists with status "failed", overwrite with
     "started" (existing retry behavior, preserved)
   - If build exists with "started" but different jobId, throw with
     a descriptive error message
   Inspired by PR game-ci#73.

2. Add retry limits to base/hub image dispatch (scheduler.ts)
   - Check job failureCount against maxFailuresPerBuild (15) before
     re-dispatching base or hub image workflows
   - Log a warning and send a Discord alert when the limit is reached
   - Prevents infinite re-dispatch on every cron cycle when a
     base/hub job is stuck in "created" or "failed" state
   Uses new CiJobs.hasExceededRetryLimit() helper (ciJobs.ts).

3. Allow created -> inProgress transition (ciJobs.ts)
   - markJobAsInProgress now accepts jobs with status "created" in
     addition to "scheduled"
   - Closes the race window where scheduler dispatches a workflow but
     crashes before updating Firestore from "created" to "scheduled"
   - The workflow's reportNewBuild call now moves the job out of the
     schedulable state regardless of whether the scheduler updated it
   Inspired by PR game-ci#74.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

📝 Walkthrough

Walkthrough

Three CI/job management files were modified to add retry-limit checks before dispatching build events and improve concurrent build registration handling. A new utility method validates whether job retry limits are exceeded, while enhanced status transition logic distinguishes between failed, started, and published build states.

Changes

Cohort / File(s) Summary
Retry Limit Checks and Build Registration
functions/src/logic/buildQueue/scheduler.ts, functions/src/model/ciBuilds.ts, functions/src/model/ciJobs.ts
Added pre-dispatch retry-limit validation in two scheduling paths with Discord alerts; enhanced build registration to handle concurrent/duplicate requests with conflict detection for mismatched job IDs; introduced new hasExceededRetryLimit utility method; expanded markJobAsInProgress status transitions to accept both created and scheduled states.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant CiJobs
    participant Discord
    participant GitHub
    
    Scheduler->>CiJobs: hasExceededRetryLimit(job, maxFailures)?
    CiJobs-->>Scheduler: true/false
    
    alt Retry Limit Exceeded
        Scheduler->>Scheduler: Log warning
        Scheduler->>Discord: Send alert (job type + max limit)
        Scheduler-->>Scheduler: Return false, abort dispatch
    else Retry Limit Not Exceeded
        Scheduler->>GitHub: Dispatch event (Base/Hub image)
        GitHub-->>Scheduler: Handle response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested reviewers

  • GabLeRoux

Poem

🐰 Retry limits checked with care,
Builds queued with conflict-aware flair,
Discord alerts when limits near,
Concurrent requests, crystal clear!
The scheduler hops—no doubts remain. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is comprehensive and well-structured, covering summary, detailed changes, and test plan. However, the required checklist section from the template is incomplete (no entries checked or marked as not needed). Complete the checklist section: confirm whether the readme needs updating and whether tests were added/updated, then mark items accordingly.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: adding circuit breakers to prevent infinite retry loops in the scheduler.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
functions/src/model/ciBuilds.ts (1)

133-168: ⚠️ Potential issue | 🟠 Major

Wrap registerNewBuild in a transaction to prevent concurrent race conditions.

The code reads document state at line 134, then conditionally creates at line 167 without transactional protection. Two concurrent requests can both read "missing," causing the second create to fail and breaking idempotent behavior.

Wrap the entire read-check-act flow in db.runTransaction() to ensure atomicity and prevent concurrent conflicts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@functions/src/model/ciBuilds.ts` around lines 133 - 168, The read-check-act
flow in registerNewBuild (the logic that gets doc ref for CiBuilds.collection,
inspects snapshot.exists/status, and then either ref.create(data) or
ref.set(...mergeFields)) must be run inside db.runTransaction() to prevent
races; refactor so you call db.runTransaction(async (tx) => { const snap = await
tx.get(ref); /* same status checks: if failed -> tx.set(ref, data, {mergeFields:
[...]}); if started -> validate relatedJobId and return/throw as before; if
published -> return; else -> throw; if not exists -> tx.create(ref, data) */ });
ensure all reads/creates/sets use tx.get/tx.create/tx.set and preserve the same
error/return behavior.
🧹 Nitpick comments (1)
functions/src/logic/buildQueue/scheduler.ts (1)

93-95: Consider deduplicating retry-limit alerts to reduce alert fatigue.

When a job stays over the threshold, each scheduler cycle sends another Discord.sendAlert, which can create ongoing noise. Consider persisting an “alert sent” marker (or cooldown timestamp) so alerts fire once per threshold crossing.

Also applies to: 144-146

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@functions/src/logic/buildQueue/scheduler.ts` around lines 93 - 95, The
scheduler currently calls logger.warn and Discord.sendAlert every cycle for jobs
over the retry threshold; change it to deduplicate alerts by recording an "alert
sent" marker or cooldown timestamp for each job (e.g., in the job's metadata or
a short-lived store) and only call Discord.sendAlert when the job first crosses
the threshold (or after the cooldown expires). Update the logic around the
existing logger.warn / Discord.sendAlert calls in scheduler.ts (the scheduler
loop / retry-limit handling) to check this marker before sending, set the marker
when sending, and clear/reset it when the job falls back below the threshold so
future threshold crossings will alert again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@functions/src/model/ciBuilds.ts`:
- Around line 133-168: The read-check-act flow in registerNewBuild (the logic
that gets doc ref for CiBuilds.collection, inspects snapshot.exists/status, and
then either ref.create(data) or ref.set(...mergeFields)) must be run inside
db.runTransaction() to prevent races; refactor so you call
db.runTransaction(async (tx) => { const snap = await tx.get(ref); /* same status
checks: if failed -> tx.set(ref, data, {mergeFields: [...]}); if started ->
validate relatedJobId and return/throw as before; if published -> return; else
-> throw; if not exists -> tx.create(ref, data) */ }); ensure all
reads/creates/sets use tx.get/tx.create/tx.set and preserve the same
error/return behavior.

---

Nitpick comments:
In `@functions/src/logic/buildQueue/scheduler.ts`:
- Around line 93-95: The scheduler currently calls logger.warn and
Discord.sendAlert every cycle for jobs over the retry threshold; change it to
deduplicate alerts by recording an "alert sent" marker or cooldown timestamp for
each job (e.g., in the job's metadata or a short-lived store) and only call
Discord.sendAlert when the job first crosses the threshold (or after the
cooldown expires). Update the logic around the existing logger.warn /
Discord.sendAlert calls in scheduler.ts (the scheduler loop / retry-limit
handling) to check this marker before sending, set the marker when sending, and
clear/reset it when the job falls back below the threshold so future threshold
crossings will alert again.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dd1efae2-be32-4e24-8422-df40e09fc650

📥 Commits

Reviewing files that changed from the base of the PR and between b6cb391 and dcb6267.

📒 Files selected for processing (3)
  • functions/src/logic/buildQueue/scheduler.ts
  • functions/src/model/ciBuilds.ts
  • functions/src/model/ciJobs.ts

@frostebite
Copy link
Member Author

Readiness & Safety Assessment

What do these fixes do?

This PR adds three circuit breakers to prevent the re-queue infinite retry loop:

  1. Idempotent registerNewBuild — When a build report arrives for an already-existing build ID, the current code throws an error (unless status is "failed"). This causes cascading failures when GitHub Actions retries a "started" report after a network timeout. The fix handles each existing status gracefully:

  2. Base/hub retry limits — Editor images already have a maxFailuresPerBuild: 15 cap via the Ingeminator. But base and hub images are re-dispatched on every 15-minute cron cycle with no backoff and no limit — truly infinite retries. This fix adds the same maxFailuresPerBuild check, with a Discord alert when exceeded.

  3. createdinProgress transitionmarkJobAsInProgress previously only accepted "scheduled" status. If the scheduler dispatched a workflow but crashed before writing created → scheduled to Firestore, the job stayed in "created" and could be dispatched again. Now it accepts both "created" and "scheduled". This incorporates the approach from PR Attempt to prevent duplicate workflow dispatch by marking jobs in progress at build start #74 (closed without merge).

Relationship to game-ci/docker PR #276

The docker PR (#276) fixes the root cause (wrong digest field in Windows workflows). This PR adds backend-side safety nets so that even if future build reports fail for any reason, the system degrades gracefully instead of entering an infinite retry loop.

Both PRs are complementary but independent — either can be merged first.

CI Status

  • ✅ TypeScript compilation passes
  • ✅ Lint + format passes
  • ✅ Existing tests pass (2/2)
  • ❌ Test Deploy fails — this is a secrets access issue for fork PRs (GCP_SA_KEY unavailable), not a code problem. This failure would also occur on any external contributor's PR.

Test Coverage Gap

The modified files (ciBuilds.ts, ciJobs.ts, scheduler.ts) have zero existing unit tests. The repo has only 2 test files total, both for version scraping logic. The hasExceededRetryLimit helper is a pure function that's trivially testable. The registerNewBuild idempotency logic would benefit from Firestore-mocked tests but requires mock infrastructure that doesn't exist yet.

Safety

  • registerNewBuild changes: Only adds new if branches for existing statuses. The "failed" → retry path (existing behavior) is untouched. New branches either return silently (safe) or throw descriptive errors (safe). No existing happy-path behavior changes.
  • Retry limits: Only adds a guard check before the existing dispatch call. If failureCount is below the limit, behavior is identical to current code. The Discord alert reuses existing Discord.sendAlert().
  • markJobAsInProgress: Expands the accepted statuses from ['scheduled'] to ['created', 'scheduled']. This is strictly more permissive — it cannot break any transition that currently works.

Recommendation

Safe to merge. All changes are additive guards on existing code paths. The Test Deploy failure is expected for fork PRs and not related to the code changes.

@webbertakken
Copy link
Member

The AI is focusing on the wrong thigns and making the wrong conclusions. Again it's adding 'created' to the array, which has never been part of the problem. And it's assuming network requests get lost, but the system is already self healing in that regard.

The problem is actually not with the scheduling itself, but with certain builds getting stuck, and the pipeline then getting stuck because it doesn't continue scheduling builds when something isn't building correctly. And that is by design, so we don't create a ton of failing workflows (In practice max 20 or so, instead of hundreds)

@frostebite
Copy link
Member Author

Closing this PR. After deeper investigation and reviewer feedback, the scheduling system isn't the problem — it's working as designed. The pipeline halts when builds get stuck, which correctly prevents hundreds of failing workflows.

The actual root cause is in the docker workflows: Config.Image is absent from docker inspect output on newer Docker versions (containerd image store), causing every Windows build to fail at the reporting step. That fix belongs in game-ci/docker#276, not here.

The self-healing pattern doesn't need circuit breakers — it needs the builds to stop failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants