fix: make registerNewBuild idempotent to prevent cascading job failures by wilg · Pull Request #73 · game-ci/versioning-backend

wilg · 2025-12-09T21:43:31Z

So, I'm not claiming to be 100% confident in this fix because it's a highly interdependent system, but this is a result of me investigating the problem game-ci/docker#273 here with a coding agent.

This seems plausible, so presenting it for a maintainer to review.

Context

After PR #266 fixed the Android cmdline-tools issue on Nov 3, dozens of previously-blocked Unity versions became eligible to build simultaneously. This burst of concurrent builds exposed an idempotency issue in the backend that caused cascading job failures, ultimately halting the scheduler.

Problem

The scheduler stops processing new Unity versions when more than 2 jobs are marked as failed (maxToleratedFailures: 2).

The Failure Cascade

GitHub Actions workflow calls report-to-backend with status: started
Request succeeds but response is lost (network timeout, Firebase cold start, etc.)
HTTP client throws → action calls core.setFailed(err)
Workflow's "Report failure" step runs (if: ${{ failure() || cancelled() }})
Backend receives status: failed → CiJobs.markFailureForJob() marks the job as failed
Once failingJobs.length > 2, the scheduler stops entirely

Why Not Discord Rate Limiting?

The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs ("A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.

Solution

Make registerNewBuild idempotent for retries:

If build exists with status 'started' and same jobId → return success (legitimate retry)
If build exists with different jobId → throw error (indicates a bug)
Existing behavior unchanged for 'failed' and 'published' statuses

Note

This fix prevents future cascading failures. To unblock the current queue, failed jobs in the database may need to be manually reset or retried.

Changes

functions/src/model/ciBuilds.ts: Added idempotency check in registerNewBuild()

Summary by CodeRabbit

Release Notes

Bug Fixes
- Strengthened build registration validation to detect and prevent conflicting registrations when the same build is registered with different job identifiers
- Enhanced idempotent request handling to properly skip redundant registration attempts while maintaining data consistency
- Improved error reporting for edge cases with more descriptive error messages based on build state

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-09T21:43:40Z

Walkthrough

Enhanced the registerNewBuild function to implement stricter validation and idempotent behavior for existing builds. The function now checks the existingStatus and applies different logic: retains merge-only behavior for 'failed' status, validates relatedJobId consistency for 'started' status (skipping idempotently if matching or throwing on conflict), and throws errors for other status values.

Changes

Cohort / File(s)	Summary
Enhanced Build Registration Logic `functions/src/model/ciBuilds.ts`	Refactored `registerNewBuild` to extract and switch on `existingStatus`, adding strict validation for conflicting registrations and idempotent behavior for 'started' status builds, with new error handling and logging paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify the relatedJobId comparison logic and conflict detection for 'started' status builds
Ensure idempotent behavior correctly skips modification when status and relatedJobId match
Validate error messages for conflicting registrations and unhandled status values are clear and actionable
Confirm logging appropriately captures the skip case without noise

Poem

A build that's already begun,
Now checked with care, not rushed or run.
If paths align, we skip the fuss,
But conflicts found? They'll tell us thus.
Idempotent magic, smooth as can be! 🐰✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: making registerNewBuild idempotent to prevent cascading job failures, which is the core objective of the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, providing context, problem analysis, solution details, and changes overview.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

functions/src/model/ciBuilds.ts (1)
138-161: Logic is correct for the retry scenario; consider a minor refactor for efficiency.

The idempotency logic correctly addresses the PR objectives. A few observations:

Minor inefficiency: snapshot.data() is called twice (lines 138 and 146). Consider extracting once:
     if (snapshot.exists) {
-      const existingStatus = snapshot.data()?.status;
+      const existingData = snapshot.data();
+      const existingStatus = existingData?.status;
       if (existingStatus === 'failed') {
         // Builds can be retried after a failure.
         // In case of reporting a new build during retry step, only overwrite these fields
         result = await ref.set(data, {
           mergeFields: ['status', 'meta.lastBuildStart', 'modifiedDate'],
         });
       } else if (existingStatus === 'started') {
-        const existingJobId = snapshot.data()?.relatedJobId;
+        const existingJobId = existingData?.relatedJobId;
Potential race on create: If two concurrent requests for a new build arrive simultaneously, both may see !snapshot.exists, one succeeds with create(), and the other throws. While unlikely in practice (the PR describes sequential retries), wrapping in a transaction or catching the create error could make this fully idempotent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0089cf and 2604cd7.

📒 Files selected for processing (1)

functions/src/model/ciBuilds.ts (1 hunks)

🔇 Additional comments (2)

functions/src/model/ciBuilds.ts (2)

145-156: Idempotent handling for 'started' status looks good.

The logic correctly distinguishes between:

Same job retrying (idempotent skip)

Different job trying to claim the same build (error)

The early return at line 156 appropriately skips the "Build created" log since no mutation occurred.

157-161: The error handling for the 'published' status case is correct.

The else branch correctly catches the 'published' status (along with any other unexpected statuses) and throws an error with the existing status included in the message for debugging clarity. The BuildStatus type defines only three values—'failed', 'started', and 'published'—and the logic properly routes:

'failed': allowed to update

'started': throws error

'published': throws error (via the else branch)

webbertakken

Appreciate coming up with a fix!

The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs ("A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.

Correct!

Request succeeds but response is lost (network timeout, Firebase cold start, etc.)

Note that the mechanism to recover from this already exists

Firstly, this has never happened in 6 years time until we started making changes to the backend code. Likely because immediately after adding a row in the DB it's giving a response. The chance of the DB Row existing and this going wrong is extremely small.

Even when that happens, the workflow will report in as failed.

In a case where the workflow never reports in at all, it's also automatically detected.

In other words: The rationale MUST be assuming the reality of our system. And that is that Versioning Backend is very much the core of this whole process. It's integrity is paramount.

Right now something that never used to happen for many years started happening: a build that is reported in already exists in the database. This is the bug that should be fixed, as it should never be possible for this to happen (and didn't use to be as far as I know).

This PR is kind solving the symptom of the build pipeline getting stuck, but it doesn't solve the root problem. Therefor this isn't the right solution for us. By design, if anything goes wrong, the process will halt, so that we don't end up having to do a lot of manual work, but can fix the root cuase.

That said, the root cause fix shouldn't be much more complicated than this one.

wilg · 2025-12-13T01:44:23Z

OK, perhaps this is the root issue then? #74

Three targeted fixes that close the race windows causing cascading job failures and infinite re-dispatch loops: 1. Make registerNewBuild idempotent (ciBuilds.ts) - If build already exists with status "started" and same jobId, silently succeed (handles network timeout retries) - If build already exists with status "published", silently succeed - If build already exists with status "failed", overwrite with "started" (existing retry behavior, preserved) - If build exists with "started" but different jobId, throw with a descriptive error message Inspired by PR game-ci#73. 2. Add retry limits to base/hub image dispatch (scheduler.ts) - Check job failureCount against maxFailuresPerBuild (15) before re-dispatching base or hub image workflows - Log a warning and send a Discord alert when the limit is reached - Prevents infinite re-dispatch on every cron cycle when a base/hub job is stuck in "created" or "failed" state Uses new CiJobs.hasExceededRetryLimit() helper (ciJobs.ts). 3. Allow created -> inProgress transition (ciJobs.ts) - markJobAsInProgress now accepts jobs with status "created" in addition to "scheduled" - Closes the race window where scheduler dispatches a workflow but crashes before updating Firestore from "created" to "scheduled" - The workflow's reportNewBuild call now moves the job out of the schedulable state regardless of whether the scheduler updated it Inspired by PR game-ci#74. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: make registerNewBuild idempotent to prevent cascading job failures

2604cd7

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

webbertakken requested changes Dec 11, 2025

View reviewed changes

wilg closed this Dec 13, 2025

webbertakken mentioned this pull request Dec 13, 2025

fix: safely initiate discord #75

Merged

3 tasks

frostebite mentioned this pull request Mar 14, 2026

fix: add circuit breakers to prevent re-queue infinite retry loop #80

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make registerNewBuild idempotent to prevent cascading job failures#73

fix: make registerNewBuild idempotent to prevent cascading job failures#73
wilg wants to merge 1 commit intogame-ci:mainfrom
wilg:main

wilg commented Dec 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

webbertakken left a comment •

edited

Loading

Uh oh!

wilg commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wilg commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Problem

The Failure Cascade

Why Not Discord Rate Limiting?

Solution

Note

Changes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

webbertakken left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilg commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wilg commented Dec 9, 2025 •

edited

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

webbertakken left a comment •

edited

Loading