Skip to content

fix: make registerNewBuild idempotent to prevent cascading job failures#73

Closed
wilg wants to merge 1 commit intogame-ci:mainfrom
wilg:main
Closed

fix: make registerNewBuild idempotent to prevent cascading job failures#73
wilg wants to merge 1 commit intogame-ci:mainfrom
wilg:main

Conversation

@wilg
Copy link

@wilg wilg commented Dec 9, 2025

So, I'm not claiming to be 100% confident in this fix because it's a highly interdependent system, but this is a result of me investigating the problem game-ci/docker#273 here with a coding agent.

This seems plausible, so presenting it for a maintainer to review.


Context

After PR #266 fixed the Android cmdline-tools issue on Nov 3, dozens of previously-blocked Unity versions became eligible to build simultaneously. This burst of concurrent builds exposed an idempotency issue in the backend that caused cascading job failures, ultimately halting the scheduler.

Problem

The scheduler stops processing new Unity versions when more than 2 jobs are marked as failed (maxToleratedFailures: 2).

The Failure Cascade

  1. GitHub Actions workflow calls report-to-backend with status: started

  2. Request succeeds but response is lost (network timeout, Firebase cold start, etc.)

  3. HTTP client throws → action calls core.setFailed(err)

  4. Workflow's "Report failure" step runs (if: ${{ failure() || cancelled() }})

  5. Backend receives status: failedCiJobs.markFailureForJob() marks the job as failed

  6. Once failingJobs.length > 2, the scheduler stops entirely

Why Not Discord Rate Limiting?

The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs ("A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.

Solution

Make registerNewBuild idempotent for retries:

  • If build exists with status 'started' and same jobId → return success (legitimate retry)
  • If build exists with different jobId → throw error (indicates a bug)
  • Existing behavior unchanged for 'failed' and 'published' statuses

Note

This fix prevents future cascading failures. To unblock the current queue, failed jobs in the database may need to be manually reset or retried.

Changes

  • functions/src/model/ciBuilds.ts: Added idempotency check in registerNewBuild()

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Strengthened build registration validation to detect and prevent conflicting registrations when the same build is registered with different job identifiers
    • Enhanced idempotent request handling to properly skip redundant registration attempts while maintaining data consistency
    • Improved error reporting for edge cases with more descriptive error messages based on build state

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

Enhanced the registerNewBuild function to implement stricter validation and idempotent behavior for existing builds. The function now checks the existingStatus and applies different logic: retains merge-only behavior for 'failed' status, validates relatedJobId consistency for 'started' status (skipping idempotently if matching or throwing on conflict), and throws errors for other status values.

Changes

Cohort / File(s) Summary
Enhanced Build Registration Logic
functions/src/model/ciBuilds.ts
Refactored registerNewBuild to extract and switch on existingStatus, adding strict validation for conflicting registrations and idempotent behavior for 'started' status builds, with new error handling and logging paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify the relatedJobId comparison logic and conflict detection for 'started' status builds
  • Ensure idempotent behavior correctly skips modification when status and relatedJobId match
  • Validate error messages for conflicting registrations and unhandled status values are clear and actionable
  • Confirm logging appropriately captures the skip case without noise

Poem

A build that's already begun,
Now checked with care, not rushed or run.
If paths align, we skip the fuss,
But conflicts found? They'll tell us thus.
Idempotent magic, smooth as can be! 🐰✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: making registerNewBuild idempotent to prevent cascading job failures, which is the core objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description is comprehensive and well-structured, providing context, problem analysis, solution details, and changes overview.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
functions/src/model/ciBuilds.ts (1)

138-161: Logic is correct for the retry scenario; consider a minor refactor for efficiency.

The idempotency logic correctly addresses the PR objectives. A few observations:

  1. Minor inefficiency: snapshot.data() is called twice (lines 138 and 146). Consider extracting once:
     if (snapshot.exists) {
-      const existingStatus = snapshot.data()?.status;
+      const existingData = snapshot.data();
+      const existingStatus = existingData?.status;
       if (existingStatus === 'failed') {
         // Builds can be retried after a failure.
         // In case of reporting a new build during retry step, only overwrite these fields
         result = await ref.set(data, {
           mergeFields: ['status', 'meta.lastBuildStart', 'modifiedDate'],
         });
       } else if (existingStatus === 'started') {
-        const existingJobId = snapshot.data()?.relatedJobId;
+        const existingJobId = existingData?.relatedJobId;
  1. Potential race on create: If two concurrent requests for a new build arrive simultaneously, both may see !snapshot.exists, one succeeds with create(), and the other throws. While unlikely in practice (the PR describes sequential retries), wrapping in a transaction or catching the create error could make this fully idempotent.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0089cf and 2604cd7.

📒 Files selected for processing (1)
  • functions/src/model/ciBuilds.ts (1 hunks)
🔇 Additional comments (2)
functions/src/model/ciBuilds.ts (2)

145-156: Idempotent handling for 'started' status looks good.

The logic correctly distinguishes between:

  • Same job retrying (idempotent skip)
  • Different job trying to claim the same build (error)

The early return at line 156 appropriately skips the "Build created" log since no mutation occurred.


157-161: The error handling for the 'published' status case is correct.

The else branch correctly catches the 'published' status (along with any other unexpected statuses) and throws an error with the existing status included in the message for debugging clarity. The BuildStatus type defines only three values—'failed', 'started', and 'published'—and the logic properly routes:

  • 'failed': allowed to update
  • 'started': throws error
  • 'published': throws error (via the else branch)

Copy link
Member

@webbertakken webbertakken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate coming up with a fix!

The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs ("A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.

Correct!

  1. Request succeeds but response is lost (network timeout, Firebase cold start, etc.)

Note that the mechanism to recover from this already exists

Firstly, this has never happened in 6 years time until we started making changes to the backend code. Likely because immediately after adding a row in the DB it's giving a response. The chance of the DB Row existing and this going wrong is extremely small.

Even when that happens, the workflow will report in as failed.

In a case where the workflow never reports in at all, it's also automatically detected.

In other words: The rationale MUST be assuming the reality of our system. And that is that Versioning Backend is very much the core of this whole process. It's integrity is paramount.

Right now something that never used to happen for many years started happening: a build that is reported in already exists in the database. This is the bug that should be fixed, as it should never be possible for this to happen (and didn't use to be as far as I know).

This PR is kind solving the symptom of the build pipeline getting stuck, but it doesn't solve the root problem. Therefor this isn't the right solution for us. By design, if anything goes wrong, the process will halt, so that we don't end up having to do a lot of manual work, but can fix the root cuase.

That said, the root cause fix shouldn't be much more complicated than this one.

@wilg
Copy link
Author

wilg commented Dec 13, 2025

OK, perhaps this is the root issue then? #74

@wilg wilg closed this Dec 13, 2025
@webbertakken webbertakken mentioned this pull request Dec 13, 2025
3 tasks
frostebite added a commit to frostebite/versioning-backend that referenced this pull request Mar 14, 2026
Three targeted fixes that close the race windows causing cascading
job failures and infinite re-dispatch loops:

1. Make registerNewBuild idempotent (ciBuilds.ts)
   - If build already exists with status "started" and same jobId,
     silently succeed (handles network timeout retries)
   - If build already exists with status "published", silently succeed
   - If build already exists with status "failed", overwrite with
     "started" (existing retry behavior, preserved)
   - If build exists with "started" but different jobId, throw with
     a descriptive error message
   Inspired by PR game-ci#73.

2. Add retry limits to base/hub image dispatch (scheduler.ts)
   - Check job failureCount against maxFailuresPerBuild (15) before
     re-dispatching base or hub image workflows
   - Log a warning and send a Discord alert when the limit is reached
   - Prevents infinite re-dispatch on every cron cycle when a
     base/hub job is stuck in "created" or "failed" state
   Uses new CiJobs.hasExceededRetryLimit() helper (ciJobs.ts).

3. Allow created -> inProgress transition (ciJobs.ts)
   - markJobAsInProgress now accepts jobs with status "created" in
     addition to "scheduled"
   - Closes the race window where scheduler dispatches a workflow but
     crashes before updating Firestore from "created" to "scheduled"
   - The workflow's reportNewBuild call now moves the job out of the
     schedulable state regardless of whether the scheduler updated it
   Inspired by PR game-ci#74.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants