[CI] Add on-demand rerun for spot-terminated jobs#2403
[CI] Add on-demand rerun for spot-terminated jobs#2403yzh119 merged 9 commits intoflashinfer-ai:mainfrom
Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
📝 WalkthroughWalkthroughAdds spot-aware rerun logic to PR CI: background spot termination monitoring (IMDSv2/v1), failure-analysis jobs that detect spot/infra markers and emit rerun matrices, conditional on‑demand rerun jobs for AOT import and GPU (A10G/T4) tests, and updated test reporting to reflect spot vs on‑demand outcomes. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions"
participant Runner as "Job Runner (spot/on‑demand)"
participant Monitor as "Spot Monitor (background)"
participant Analyzer as "Analyze Failure Job"
participant Rerun as "On‑demand Rerun Job"
rect rgba(0,128,0,0.5)
GH->>Runner: start job (matrix: spot / on‑demand)
end
Runner->>Monitor: start background spot monitor
Runner->>GH: run tests, upload logs/artifacts
Monitor->>GH: emit termination marker (if detected)
GH->>Analyzer: run analyze job on logs
Analyzer->>GH: set outputs (is_spot_termination, rerun_matrix)
alt spot termination detected
GH->>Rerun: trigger on‑demand rerun with rerun_matrix
Rerun->>Runner: start on‑demand runner and execute tests
Rerun->>GH: upload rerun results
else no spot termination
GH->>GH: finalize original job results
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 163-201: The analyze-aot-failure job and the analyze step
currently treat job logs as plain text and only consider conclusions ==
"failure", so update the job-level if condition and the jq filter to include
canceled (and optionally timed_out) results, and fetch/unzip logs before
grepping: change the workflow job condition from needs.aot-build-import.result
== 'failure' to include 'cancelled' (e.g., needs.aot-build-import.result ==
'failure' || needs.aot-build-import.result == 'cancelled'), update the
FAILED_JOBS jq selection in the analyze step to include .conclusion ==
"cancelled" (and "timed_out" if desired), and replace the gh api logs fetch/grep
logic to save the response as a ZIP (e.g., gh api ... > job_log.zip 2>/dev/null)
and then unzip its text with unzip -p job_log.zip > job_log.txt 2>/dev/null ||
continue before running grep on job_log.txt; apply the same changes to the other
two analyze jobs with IDs/steps analogous to analyze-aot-failure/analyze.
- Around line 118-129: The spot termination monitor's curl call to
http://169.254.169.254/latest/meta-data/spot/instance-action uses IMDSv1 only
and will fail when IMDSv2 is required; modify the monitor (the "Start spot
termination monitor" background loop and the two other identical monitor blocks)
to first request an IMDSv2 token from http://169.254.169.254/latest/api/token
with PUT and header X-aws-ec2-metadata-token-ttl-seconds, store that token if
successful, then call the metadata endpoint with curl including header
"X-aws-ec2-metadata-token: $TOKEN"; if the token request fails (non-2xx or
empty), fall back to the existing tokenless IMDSv1 curl behavior so monitors
still work on instances not enforcing IMDSv2.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)
203-214: Optional: limit rerun matrix to failed shards/arches to save on‑demand capacity.Right now you rerun the full matrix even if only one shard/arch failed. Consider building the rerun matrix from the failed job names returned by the same
gh apicall to reduce cost and time.Also applies to: 370-375
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 189-190: The Jobs API call that populates FAILED_JOBS is missing
pagination and only returns the first 30 jobs; update the gh api request string
used to set FAILED_JOBS (the command using gh api "/repos/${{ github.repository
}}/actions/runs/${RUN_ID}/jobs") to include ?per_page=100 so it retrieves up to
100 jobs, and apply the same ?per_page=100 addition to the other identical gh
api calls in this workflow (the other analyze job queries that filter .jobs[]
for AOT failures/cancellations) to ensure no failed/cancelled jobs are omitted.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)
118-137: Consider extracting the spot monitor into a reusable action/script.There are three near-identical monitor loops; keeping them in sync will be error-prone as the detection logic evolves. A small composite action (or shared script in
scripts/) would reduce drift and simplify future changes.Also applies to: 301-320, 470-489
|
The ci is actually green after re-run, the workflow is here: https://github.com/flashinfer-ai/flashinfer/actions/runs/21263150552?pr=2403 cc: @yzh119 |
|
@flashinfer-bot run |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/pr-test.yml (2)
370-383: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.Lines 378-380 duplicate the
needs,if, andruns-onkeys already defined at lines 372-377. YAML parsers typically use the last occurrence, which means:
needs: [gate, setup]is overwritten byneeds: setup— thegatedependency is lost, allowing unauthorized PRs to potentially run GPU tests- The multi-line
ifcondition checkingneeds.gate.outputs.authorizedis replaced by a simpler conditionruns-onloses the original labelsRemove the duplicate keys at lines 378-380 and merge the intended changes into the original definitions.
Proposed fix
gpu-tests-a10g: name: JIT Unittest ${{ matrix.shard }} (A10G) - needs: [gate, setup] - if: | - needs.gate.outputs.authorized == 'true' && - needs.setup.outputs.skip_build != 'true' && - github.event.inputs.skip_gpu != 'true' - runs-on: [self-hosted, Linux, X64, gpu, sm86] needs: setup - if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true' + needs: [gate, setup] + if: | + needs.gate.outputs.authorized == 'true' && + needs.setup.outputs.skip_build != 'true' && + github.event.inputs.skip_gpu != 'true' runs-on: [self-hosted, Linux, X64, gpu, sm86, spot]
549-559: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.Same issue as the A10G job — lines 557-559 duplicate the
needs,if, andruns-onkeys, causing thegatedependency and authorization check to be lost.Proposed fix
gpu-tests-t4: name: JIT Unittest (T4) - needs: [gate, setup] - if: | - needs.gate.outputs.authorized == 'true' && - needs.setup.outputs.skip_build != 'true' && - github.event.inputs.skip_gpu != 'true' - runs-on: [self-hosted, Linux, X64, gpu, sm75] - needs: setup - if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true' + needs: [gate, setup] + if: | + needs.gate.outputs.authorized == 'true' && + needs.setup.outputs.skip_build != 'true' && + github.event.inputs.skip_gpu != 'true' runs-on: [self-hosted, Linux, X64, gpu, sm75, spot]
|
@flashinfer-bot stop |
|
@flashinfer-bot rerun |
|
@yongwww looks like even if we rerun the tests, the original tests are still shown as failing in the panel? |
.github/workflows/pr-test.yml
Outdated
| ( | ||
| while true; do | ||
| # Support both IMDSv2 (token-based) and IMDSv1 (legacy) | ||
| IMDS_URL="http://169.254.169.254/latest/meta-data/spot/instance-action" | ||
| TOKEN=$(curl -s --max-time 2 -X PUT "http://169.254.169.254/latest/api/token" \ | ||
| -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null || true) | ||
| if [ -n "$TOKEN" ]; then | ||
| META=$(curl -sf --max-time 2 -H "X-aws-ec2-metadata-token: $TOKEN" "$IMDS_URL" 2>/dev/null || true) | ||
| else | ||
| META=$(curl -sf --max-time 2 "$IMDS_URL" 2>/dev/null || true) | ||
| fi | ||
| if echo "$META" | grep -q "terminate"; then | ||
| echo "::error::FLASHINFER_SPOT_TERMINATION_DETECTED" | ||
| echo "AWS Spot Termination Notice received at $(date)" | ||
| exit 0 | ||
| fi | ||
| sleep 5 | ||
| done | ||
| ) & |
There was a problem hiding this comment.
This script and the one used for analyzing failures are repeated a few times across different configurations. To me, it would make the diff more readable and make the config more maintainable to put these into a separate string variable or to make these jobs inherit some common config (note: I'm not completely sure how to do that in github actions, or if it's possible).
There was a problem hiding this comment.
good catch! will resolve this!
Yes, all jobs are shown in the panel, including the originally failed ones, even if a rerun later succeeds. I haven’t found a workaround to remove them from the panel yet; this appears to be a GitHub behavior. I’ll spend some additional time exploring possible alternatives (pls suggest any options). Currently, The Test Results Summary is required to be passing for merging. The summary should be treated as the indicator of PR health: it turns green if all tests pass, either in the original run or via a successful rerun: https://github.com/flashinfer-ai/flashinfer/actions/runs/21333554813?pr=2403. update: The retry logic runs as a distinct job in the workflow graph, GitHub considers them two separate historical events. The log of job_A failing is immutable and will always be displayed in the workflow visualization and the PR checks list. (by Gemini) |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 215-216: Change the workflow steps that start the spot termination
monitor to invoke the script via bash instead of attempting to execute it
directly: replace uses of "./scripts/task_monitor_spot.sh &" with "bash
./scripts/task_monitor_spot.sh &" in the steps named "Start spot termination
monitor" (and the two other occurrences mentioned) so the script runs regardless
of file executable bit; target the workflow step entries that reference
./scripts/task_monitor_spot.sh to apply this update.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)
283-295: Consider derivingrerun_matrixfrom the actual failed jobs.Current matrices rerun all variants even if only a subset failed, which can be costly. Parsing
FAILED_JOBSnames into a minimal rerun matrix would save capacity.Also applies to: 451-456
| uses: docker/login-action@v3 | ||
| with: | ||
| username: flashinfer | ||
| password: ${{ secrets.DOCKERHUB_TOKEN }} |
There was a problem hiding this comment.
why do we have to login to docker hub here? I suppose we only need this when pushing to dockerhub?
There was a problem hiding this comment.
Using docker/login is to mitigate the rate limit of 100/pulls/6hr (for example, we ran into docker pull rate limit: https://github.com/flashinfer-ai/flashinfer/actions/runs/21193948826/job/60965961746). The login will increases the pull rate limit from 100 pulls/6hr (anonymous) to 200 pulls/6hr (authenticated), reducing the likelihood of rate limit errors when running concurrent CI jobs.).
There was a problem hiding this comment.
By the way, we could consider using ECR in the future (faster pulls and no pull rate limits like dockerhub)
There was a problem hiding this comment.
I see, that's reasonable and thanks so much for the explaination.

📌 Description
Recently, we found some workflow jobs were failing due to ec2 SPOT termination, e.g. this job: https://github.com/flashinfer-ai/flashinfer/actions/runs/21222912953/job/61062117663
This PR enhances CI by rerunning failed jobs on on-demand instances when spot termination is detected. How it works:
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Chores
Tests
New Features
✏️ Tip: You can customize this high-level summary in your review settings.