fix: add circuit breakers to prevent re-queue infinite retry loop by frostebite · Pull Request #80 · game-ci/versioning-backend

frostebite · 2026-03-14T03:36:36Z

Summary

Three targeted fixes that close the race windows causing cascading job failures and infinite re-dispatch loops in the build queue scheduler. These changes are minimal and surgical -- each addresses a specific failure mode that has been observed in production.

Make registerNewBuild idempotent so that network-timeout retries of the "started" report no longer throw, preventing the cascading failure that marks jobs as failed and halts the scheduler when failingJobs.length > maxToleratedFailures
Add retry limits to base/hub image dispatch so the scheduler stops re-dispatching workflows every 15-minute cron cycle when a base or hub job is stuck in "created" or "failed" state (editor images already had this protection via the Ingeminator)
Allow created -> inProgress job transition so a workflow's reportNewBuild call moves the job out of the schedulable state even if the scheduler crashed before writing created -> scheduled, preventing duplicate dispatch

Detailed Changes

Fix 1: Idempotent `registerNewBuild` (`functions/src/model/ciBuilds.ts`)

Problem: When a GitHub Actions workflow calls reportNewBuild with status: started but the HTTP response is lost (network timeout, Firebase cold start), the action's error handler calls reportBuildFailure, marking the job as failed. The next cron cycle retries the workflow, which calls reportNewBuild again -- but now the build already exists with status "started", so it throws "A build with X as identifier already exists", causing another failure report, ad infinitum.

Fix:

If build exists with status "started" and same jobId: silently return (idempotent retry)
If build exists with status "started" and different jobId: throw with descriptive error (indicates a real bug)
If build exists with status "published": silently return (build already completed)
If build exists with status "failed": overwrite with new "started" (existing behavior, preserved)

This is essentially what #73 proposed.

Fix 2: Retry limits for base/hub image dispatch (`functions/src/logic/buildQueue/scheduler.ts`, `functions/src/model/ciJobs.ts`)

Problem: ensureThatBaseImageHasBeenBuilt() and ensureThatHubImageHasBeenBuilt() re-dispatch GitHub workflows on every cron cycle (every 15 minutes) when the job is in "created" or "failed" state, with no backoff and no retry limit. Editor images are protected by the Ingeminator's maxFailuresPerBuild check, but base/hub images have no such protection.

Fix:

Added CiJobs.hasExceededRetryLimit() helper that checks job.meta.failureCount >= maxRetries
Both ensureThatBaseImageHasBeenBuilt() and ensureThatHubImageHasBeenBuilt() now check this limit (using the existing settings.maxFailuresPerBuild = 15) before dispatching
When the limit is reached, logs a warning and sends a Discord alert asking for manual intervention

Fix 3: Allow `created` -> `inProgress` transition (`functions/src/model/ciJobs.ts`)

Problem: The scheduler dispatches a GitHub workflow first, then updates Firestore (created -> scheduled). If the scheduler crashes between these two steps, the job stays created even though a workflow is running. The next cron cycle picks it up again and dispatches a duplicate workflow.

When that workflow calls reportNewBuild, markJobAsInProgress only accepts "scheduled" status, so the job stays "created" -- leaving it eligible for yet another duplicate dispatch.

Fix: markJobAsInProgress now accepts both "created" and "scheduled" statuses, so the workflow's report moves the job out of the schedulable state regardless of whether the scheduler's markJobAsScheduled call succeeded.

This is essentially what #74 proposed.

Incorporates the approach from fix: make registerNewBuild idempotent to prevent cascading job failures #73 (make registerNewBuild idempotent)
Incorporates the approach from Attempt to prevent duplicate workflow dispatch by marking jobs in progress at build start #74 (allow created -> inProgress transition)
Root cause analysis: set continue-on-error:true for all actions/report-to-backend docker#273

Test plan

Verify TypeScript compiles cleanly (npx tsc --noEmit passes with zero errors)
Review each change against the existing Firestore state machine to confirm no unintended state transitions
Verify the maxFailuresPerBuild setting (15) is appropriate for base/hub images
Consider deploying to a staging Firebase project and simulating the failure scenarios

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Implemented automatic retry limit enforcement to prevent indefinite job re-dispatch attempts.
- Added Discord notifications when builds exceed configured retry limits.
- Enhanced concurrent build registration with improved conflict detection and idempotent retry handling.
- Improved job status transition logic for more reliable lifecycle management.

Three targeted fixes that close the race windows causing cascading job failures and infinite re-dispatch loops: 1. Make registerNewBuild idempotent (ciBuilds.ts) - If build already exists with status "started" and same jobId, silently succeed (handles network timeout retries) - If build already exists with status "published", silently succeed - If build already exists with status "failed", overwrite with "started" (existing retry behavior, preserved) - If build exists with "started" but different jobId, throw with a descriptive error message Inspired by PR game-ci#73. 2. Add retry limits to base/hub image dispatch (scheduler.ts) - Check job failureCount against maxFailuresPerBuild (15) before re-dispatching base or hub image workflows - Log a warning and send a Discord alert when the limit is reached - Prevents infinite re-dispatch on every cron cycle when a base/hub job is stuck in "created" or "failed" state Uses new CiJobs.hasExceededRetryLimit() helper (ciJobs.ts). 3. Allow created -> inProgress transition (ciJobs.ts) - markJobAsInProgress now accepts jobs with status "created" in addition to "scheduled" - Closes the race window where scheduler dispatches a workflow but crashes before updating Firestore from "created" to "scheduled" - The workflow's reportNewBuild call now moves the job out of the schedulable state regardless of whether the scheduler updated it Inspired by PR game-ci#74. Co-Authored-By: Claude Opus 4.6 <[email protected]>

coderabbitai · 2026-03-14T03:36:57Z

📝 Walkthrough

Walkthrough

Three CI/job management files were modified to add retry-limit checks before dispatching build events and improve concurrent build registration handling. A new utility method validates whether job retry limits are exceeded, while enhanced status transition logic distinguishes between failed, started, and published build states.

Changes

Cohort / File(s)	Summary
Retry Limit Checks and Build Registration `functions/src/logic/buildQueue/scheduler.ts`, `functions/src/model/ciBuilds.ts`, `functions/src/model/ciJobs.ts`	Added pre-dispatch retry-limit validation in two scheduling paths with Discord alerts; enhanced build registration to handle concurrent/duplicate requests with conflict detection for mismatched job IDs; introduced new `hasExceededRetryLimit` utility method; expanded `markJobAsInProgress` status transitions to accept both `created` and `scheduled` states.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant CiJobs
    participant Discord
    participant GitHub
    
    Scheduler->>CiJobs: hasExceededRetryLimit(job, maxFailures)?
    CiJobs-->>Scheduler: true/false
    
    alt Retry Limit Exceeded
        Scheduler->>Scheduler: Log warning
        Scheduler->>Discord: Send alert (job type + max limit)
        Scheduler-->>Scheduler: Return false, abort dispatch
    else Retry Limit Not Exceeded
        Scheduler->>GitHub: Dispatch event (Base/Hub image)
        GitHub-->>Scheduler: Handle response
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested reviewers

GabLeRoux

Poem

🐰 Retry limits checked with care,
Builds queued with conflict-aware flair,
Discord alerts when limits near,
Concurrent requests, crystal clear!
The scheduler hops—no doubts remain. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description is comprehensive and well-structured, covering summary, detailed changes, and test plan. However, the required checklist section from the template is incomplete (no entries checked or marked as not needed).	Complete the checklist section: confirm whether the readme needs updating and whether tests were added/updated, then mark items accordingly.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: adding circuit breakers to prevent infinite retry loops in the scheduler.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

functions/src/model/ciBuilds.ts (1)
133-168: ⚠️ Potential issue | 🟠 Major

Wrap registerNewBuild in a transaction to prevent concurrent race conditions.

The code reads document state at line 134, then conditionally creates at line 167 without transactional protection. Two concurrent requests can both read "missing," causing the second create to fail and breaking idempotent behavior.

Wrap the entire read-check-act flow in db.runTransaction() to ensure atomicity and prevent concurrent conflicts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@functions/src/model/ciBuilds.ts` around lines 133 - 168, The read-check-act
flow in registerNewBuild (the logic that gets doc ref for CiBuilds.collection,
inspects snapshot.exists/status, and then either ref.create(data) or
ref.set(...mergeFields)) must be run inside db.runTransaction() to prevent
races; refactor so you call db.runTransaction(async (tx) => { const snap = await
tx.get(ref); /* same status checks: if failed -> tx.set(ref, data, {mergeFields:
[...]}); if started -> validate relatedJobId and return/throw as before; if
published -> return; else -> throw; if not exists -> tx.create(ref, data) */ });
ensure all reads/creates/sets use tx.get/tx.create/tx.set and preserve the same
error/return behavior.

🧹 Nitpick comments (1)

functions/src/logic/buildQueue/scheduler.ts (1)
93-95: Consider deduplicating retry-limit alerts to reduce alert fatigue.

When a job stays over the threshold, each scheduler cycle sends another Discord.sendAlert, which can create ongoing noise. Consider persisting an “alert sent” marker (or cooldown timestamp) so alerts fire once per threshold crossing.

Also applies to: 144-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@functions/src/logic/buildQueue/scheduler.ts` around lines 93 - 95, The
scheduler currently calls logger.warn and Discord.sendAlert every cycle for jobs
over the retry threshold; change it to deduplicate alerts by recording an "alert
sent" marker or cooldown timestamp for each job (e.g., in the job's metadata or
a short-lived store) and only call Discord.sendAlert when the job first crosses
the threshold (or after the cooldown expires). Update the logic around the
existing logger.warn / Discord.sendAlert calls in scheduler.ts (the scheduler
loop / retry-limit handling) to check this marker before sending, set the marker
when sending, and clear/reset it when the job falls back below the threshold so
future threshold crossings will alert again.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@functions/src/model/ciBuilds.ts`:
- Around line 133-168: The read-check-act flow in registerNewBuild (the logic
that gets doc ref for CiBuilds.collection, inspects snapshot.exists/status, and
then either ref.create(data) or ref.set(...mergeFields)) must be run inside
db.runTransaction() to prevent races; refactor so you call
db.runTransaction(async (tx) => { const snap = await tx.get(ref); /* same status
checks: if failed -> tx.set(ref, data, {mergeFields: [...]}); if started ->
validate relatedJobId and return/throw as before; if published -> return; else
-> throw; if not exists -> tx.create(ref, data) */ }); ensure all
reads/creates/sets use tx.get/tx.create/tx.set and preserve the same
error/return behavior.

---

Nitpick comments:
In `@functions/src/logic/buildQueue/scheduler.ts`:
- Around line 93-95: The scheduler currently calls logger.warn and
Discord.sendAlert every cycle for jobs over the retry threshold; change it to
deduplicate alerts by recording an "alert sent" marker or cooldown timestamp for
each job (e.g., in the job's metadata or a short-lived store) and only call
Discord.sendAlert when the job first crosses the threshold (or after the
cooldown expires). Update the logic around the existing logger.warn /
Discord.sendAlert calls in scheduler.ts (the scheduler loop / retry-limit
handling) to check this marker before sending, set the marker when sending, and
clear/reset it when the job falls back below the threshold so future threshold
crossings will alert again.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dd1efae2-be32-4e24-8422-df40e09fc650

📥 Commits

Reviewing files that changed from the base of the PR and between b6cb391 and dcb6267.

📒 Files selected for processing (3)

functions/src/logic/buildQueue/scheduler.ts
functions/src/model/ciBuilds.ts
functions/src/model/ciJobs.ts

frostebite · 2026-03-14T03:57:18Z

Readiness & Safety Assessment

What do these fixes do?

This PR adds three circuit breakers to prevent the re-queue infinite retry loop:

Idempotent registerNewBuild — When a build report arrives for an already-existing build ID, the current code throws an error (unless status is "failed"). This causes cascading failures when GitHub Actions retries a "started" report after a network timeout. The fix handles each existing status gracefully:
- "started" + same jobId → silently succeed (idempotent retry)
- "published" → silently succeed (already done)
- "failed" → overwrite with "started" (existing behavior, preserved)
- This incorporates the approach from PR fix: make registerNewBuild idempotent to prevent cascading job failures #73 (closed without merge).
Base/hub retry limits — Editor images already have a maxFailuresPerBuild: 15 cap via the Ingeminator. But base and hub images are re-dispatched on every 15-minute cron cycle with no backoff and no limit — truly infinite retries. This fix adds the same maxFailuresPerBuild check, with a Discord alert when exceeded.
created → inProgress transition — markJobAsInProgress previously only accepted "scheduled" status. If the scheduler dispatched a workflow but crashed before writing created → scheduled to Firestore, the job stayed in "created" and could be dispatched again. Now it accepts both "created" and "scheduled". This incorporates the approach from PR Attempt to prevent duplicate workflow dispatch by marking jobs in progress at build start #74 (closed without merge).

Relationship to game-ci/docker PR #276

The docker PR (#276) fixes the root cause (wrong digest field in Windows workflows). This PR adds backend-side safety nets so that even if future build reports fail for any reason, the system degrades gracefully instead of entering an infinite retry loop.

Both PRs are complementary but independent — either can be merged first.

CI Status

✅ TypeScript compilation passes
✅ Lint + format passes
✅ Existing tests pass (2/2)
❌ Test Deploy fails — this is a secrets access issue for fork PRs (GCP_SA_KEY unavailable), not a code problem. This failure would also occur on any external contributor's PR.

Test Coverage Gap

The modified files (ciBuilds.ts, ciJobs.ts, scheduler.ts) have zero existing unit tests. The repo has only 2 test files total, both for version scraping logic. The hasExceededRetryLimit helper is a pure function that's trivially testable. The registerNewBuild idempotency logic would benefit from Firestore-mocked tests but requires mock infrastructure that doesn't exist yet.

Safety

registerNewBuild changes: Only adds new if branches for existing statuses. The "failed" → retry path (existing behavior) is untouched. New branches either return silently (safe) or throw descriptive errors (safe). No existing happy-path behavior changes.
Retry limits: Only adds a guard check before the existing dispatch call. If failureCount is below the limit, behavior is identical to current code. The Discord alert reuses existing Discord.sendAlert().
markJobAsInProgress: Expands the accepted statuses from ['scheduled'] to ['created', 'scheduled']. This is strictly more permissive — it cannot break any transition that currently works.

Recommendation

Safe to merge. All changes are additive guards on existing code paths. The Test Deploy failure is expected for fork PRs and not related to the code changes.

webbertakken · 2026-03-14T12:20:22Z

The AI is focusing on the wrong thigns and making the wrong conclusions. Again it's adding 'created' to the array, which has never been part of the problem. And it's assuming network requests get lost, but the system is already self healing in that regard.

The problem is actually not with the scheduling itself, but with certain builds getting stuck, and the pipeline then getting stuck because it doesn't continue scheduling builds when something isn't building correctly. And that is by design, so we don't create a ton of failing workflows (In practice max 20 or so, instead of hundreds)

frostebite · 2026-03-15T07:54:01Z

Closing this PR. After deeper investigation and reviewer feedback, the scheduling system isn't the problem — it's working as designed. The pipeline halts when builds get stuck, which correctly prevents hundreds of failing workflows.

The actual root cause is in the docker workflows: Config.Image is absent from docker inspect output on newer Docker versions (containerd image store), causing every Windows build to fail at the reporting step. That fix belongs in game-ci/docker#276, not here.

The self-healing pattern doesn't need circuit breakers — it needs the builds to stop failing.

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

frostebite requested review from GabLeRoux and webbertakken March 14, 2026 05:49

frostebite closed this Mar 15, 2026

frostebite mentioned this pull request Mar 15, 2026

fix(ci): fix Windows image digest extraction and Docker daemon readiness game-ci/docker#276

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add circuit breakers to prevent re-queue infinite retry loop#80

fix: add circuit breakers to prevent re-queue infinite retry loop#80
frostebite wants to merge 1 commit intogame-ci:mainfrom
frostebite:fix/retry-loop-circuit-breakers

frostebite commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 14, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

frostebite commented Mar 14, 2026

Uh oh!

webbertakken commented Mar 14, 2026

Uh oh!

frostebite commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

frostebite commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detailed Changes

Fix 1: Idempotent registerNewBuild (functions/src/model/ciBuilds.ts)

Fix 2: Retry limits for base/hub image dispatch (functions/src/logic/buildQueue/scheduler.ts, functions/src/model/ciJobs.ts)

Fix 3: Allow created -> inProgress transition (functions/src/model/ciJobs.ts)

Related

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

frostebite commented Mar 14, 2026

Readiness & Safety Assessment

What do these fixes do?

Relationship to game-ci/docker PR #276

CI Status

Test Coverage Gap

Safety

Recommendation

Uh oh!

webbertakken commented Mar 14, 2026

Uh oh!

frostebite commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frostebite commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Fix 1: Idempotent `registerNewBuild` (`functions/src/model/ciBuilds.ts`)

Fix 2: Retry limits for base/hub image dispatch (`functions/src/logic/buildQueue/scheduler.ts`, `functions/src/model/ciJobs.ts`)

Fix 3: Allow `created` -> `inProgress` transition (`functions/src/model/ciJobs.ts`)

coderabbitai bot commented Mar 14, 2026 •

edited

Loading