fix: make registerNewBuild idempotent to prevent cascading job failures#73
fix: make registerNewBuild idempotent to prevent cascading job failures#73wilg wants to merge 1 commit intogame-ci:mainfrom
Conversation
WalkthroughEnhanced the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
functions/src/model/ciBuilds.ts (1)
138-161: Logic is correct for the retry scenario; consider a minor refactor for efficiency.The idempotency logic correctly addresses the PR objectives. A few observations:
- Minor inefficiency:
snapshot.data()is called twice (lines 138 and 146). Consider extracting once:if (snapshot.exists) { - const existingStatus = snapshot.data()?.status; + const existingData = snapshot.data(); + const existingStatus = existingData?.status; if (existingStatus === 'failed') { // Builds can be retried after a failure. // In case of reporting a new build during retry step, only overwrite these fields result = await ref.set(data, { mergeFields: ['status', 'meta.lastBuildStart', 'modifiedDate'], }); } else if (existingStatus === 'started') { - const existingJobId = snapshot.data()?.relatedJobId; + const existingJobId = existingData?.relatedJobId;
- Potential race on create: If two concurrent requests for a new build arrive simultaneously, both may see
!snapshot.exists, one succeeds withcreate(), and the other throws. While unlikely in practice (the PR describes sequential retries), wrapping in a transaction or catching the create error could make this fully idempotent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
functions/src/model/ciBuilds.ts(1 hunks)
🔇 Additional comments (2)
functions/src/model/ciBuilds.ts (2)
145-156: Idempotent handling for 'started' status looks good.The logic correctly distinguishes between:
- Same job retrying (idempotent skip)
- Different job trying to claim the same build (error)
The early
returnat line 156 appropriately skips the "Build created" log since no mutation occurred.
157-161: The error handling for the 'published' status case is correct.The else branch correctly catches the 'published' status (along with any other unexpected statuses) and throws an error with the existing status included in the message for debugging clarity. The BuildStatus type defines only three values—'failed', 'started', and 'published'—and the logic properly routes:
- 'failed': allowed to update
- 'started': throws error
- 'published': throws error (via the else branch)
There was a problem hiding this comment.
Appreciate coming up with a fix!
The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs ("A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.
Correct!
- Request succeeds but response is lost (network timeout, Firebase cold start, etc.)
Note that the mechanism to recover from this already exists
Firstly, this has never happened in 6 years time until we started making changes to the backend code. Likely because immediately after adding a row in the DB it's giving a response. The chance of the DB Row existing and this going wrong is extremely small.
Even when that happens, the workflow will report in as failed.
In a case where the workflow never reports in at all, it's also automatically detected.
In other words: The rationale MUST be assuming the reality of our system. And that is that Versioning Backend is very much the core of this whole process. It's integrity is paramount.
Right now something that never used to happen for many years started happening: a build that is reported in already exists in the database. This is the bug that should be fixed, as it should never be possible for this to happen (and didn't use to be as far as I know).
This PR is kind solving the symptom of the build pipeline getting stuck, but it doesn't solve the root problem. Therefor this isn't the right solution for us. By design, if anything goes wrong, the process will halt, so that we don't end up having to do a lot of manual work, but can fix the root cuase.
That said, the root cause fix shouldn't be much more complicated than this one.
|
OK, perhaps this is the root issue then? #74 |
Three targeted fixes that close the race windows causing cascading
job failures and infinite re-dispatch loops:
1. Make registerNewBuild idempotent (ciBuilds.ts)
- If build already exists with status "started" and same jobId,
silently succeed (handles network timeout retries)
- If build already exists with status "published", silently succeed
- If build already exists with status "failed", overwrite with
"started" (existing retry behavior, preserved)
- If build exists with "started" but different jobId, throw with
a descriptive error message
Inspired by PR game-ci#73.
2. Add retry limits to base/hub image dispatch (scheduler.ts)
- Check job failureCount against maxFailuresPerBuild (15) before
re-dispatching base or hub image workflows
- Log a warning and send a Discord alert when the limit is reached
- Prevents infinite re-dispatch on every cron cycle when a
base/hub job is stuck in "created" or "failed" state
Uses new CiJobs.hasExceededRetryLimit() helper (ciJobs.ts).
3. Allow created -> inProgress transition (ciJobs.ts)
- markJobAsInProgress now accepts jobs with status "created" in
addition to "scheduled"
- Closes the race window where scheduler dispatches a workflow but
crashes before updating Firestore from "created" to "scheduled"
- The workflow's reportNewBuild call now moves the job out of the
schedulable state regardless of whether the scheduler updated it
Inspired by PR game-ci#74.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
So, I'm not claiming to be 100% confident in this fix because it's a highly interdependent system, but this is a result of me investigating the problem game-ci/docker#273 here with a coding agent.
This seems plausible, so presenting it for a maintainer to review.
Context
After PR #266 fixed the Android cmdline-tools issue on Nov 3, dozens of previously-blocked Unity versions became eligible to build simultaneously. This burst of concurrent builds exposed an idempotency issue in the backend that caused cascading job failures, ultimately halting the scheduler.
Problem
The scheduler stops processing new Unity versions when more than 2 jobs are marked as failed (
maxToleratedFailures: 2).The Failure Cascade
GitHub Actions workflow calls
report-to-backendwithstatus: startedRequest succeeds but response is lost (network timeout, Firebase cold start, etc.)
HTTP client throws → action calls
core.setFailed(err)Workflow's "Report failure" step runs (
if: ${{ failure() || cancelled() }})Backend receives
status: failed→CiJobs.markFailureForJob()marks the job as failedOnce
failingJobs.length > 2, the scheduler stops entirelyWhy Not Discord Rate Limiting?
The original report suggested Discord rate limiting was the cause. However, Eris handles 429s with automatic retry. The actual error in logs (
"A build with X as identifier already exists") is a Firestore duplicate error from non-idempotent API handling.Solution
Make
registerNewBuildidempotent for retries:'started'and same jobId → return success (legitimate retry)'failed'and'published'statusesNote
This fix prevents future cascading failures. To unblock the current queue, failed jobs in the database may need to be manually reset or retried.
Changes
functions/src/model/ciBuilds.ts: Added idempotency check inregisterNewBuild()Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.