Skip to content

[CI] Add on-demand rerun for spot-terminated jobs#2403

Merged
yzh119 merged 9 commits intoflashinfer-ai:mainfrom
yongwww:od_fallback
Jan 30, 2026
Merged

[CI] Add on-demand rerun for spot-terminated jobs#2403
yzh119 merged 9 commits intoflashinfer-ai:mainfrom
yongwww:od_fallback

Conversation

@yongwww
Copy link
Member

@yongwww yongwww commented Jan 22, 2026

📌 Description

Recently, we found some workflow jobs were failing due to ec2 SPOT termination, e.g. this job: https://github.com/flashinfer-ai/flashinfer/actions/runs/21222912953/job/61062117663

This PR enhances CI by rerunning failed jobs on on-demand instances when spot termination is detected. How it works:

  • Spot jobs run a background monitor checking AWS metadata for termination notice
  • If termination is detected, a marker is written to the job log
  • Analyze jobs check logs via GitHub API for the marker
  • If spot termination: rerun all failed/cancelled jobs on on-demand
  • If real test failure: no rerun, workflow fails fast

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores

    • Enhanced CI to detect spot-instance terminations, start spot monitoring, mark affected runs, add cleanup/diagnostic steps, and permit on‑demand reruns across AOT and GPU paths (A10G, T4).
    • Exposed analysis outputs and adjusted workflow permissions to orchestrate conditional reruns.
  • Tests

    • Added automated failure analysis and on‑demand rerun paths for AOT and GPU tests.
    • Improved test reporting to show Skipped, Passed (spot), Passed (on‑demand rerun), and Failed.
  • New Features

    • Added a runtime spot-termination monitor that watches for instance termination notices and signals CI for timely reruns.

✏️ Tip: You can customize this high-level summary in your review settings.

@gemini-code-assist
Copy link
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Adds spot-aware rerun logic to PR CI: background spot termination monitoring (IMDSv2/v1), failure-analysis jobs that detect spot/infra markers and emit rerun matrices, conditional on‑demand rerun jobs for AOT import and GPU (A10G/T4) tests, and updated test reporting to reflect spot vs on‑demand outcomes. (50 words)

Changes

Cohort / File(s) Summary
Primary CI workflow
/.github/workflows/pr-test.yml
Added spot-monitor start/stop steps, new analysis jobs (analyze-*), public outputs (is_spot_termination, rerun_matrix), on‑demand rerun jobs for AOT, A10G, and T4, adjusted matrices/concurrency/permissions, and updated Test Results Summary with check_status.
Spot monitor script
scripts/task_monitor_spot.sh
New script that polls EC2 IMDS (IMDSv2 with token, fallback to IMDSv1) for Spot termination notices, logs and annotates detection, and exits to mark termination. Robust curl handling and 5s polling loop.
Rerun analysis & orchestration (co-located)
/.github/workflows/pr-test.yml
Introduced analyze jobs that inspect artifacts/logs and produce is_spot_termination and rerun_matrix; added rerun jobs (aot-build-import-rerun, gpu-tests-a10g-rerun, gpu-tests-t4-rerun) that execute conditionally from those matrices.

Sequence Diagram(s)

sequenceDiagram
    participant GH as "GitHub Actions"
    participant Runner as "Job Runner (spot/on‑demand)"
    participant Monitor as "Spot Monitor (background)"
    participant Analyzer as "Analyze Failure Job"
    participant Rerun as "On‑demand Rerun Job"

    rect rgba(0,128,0,0.5)
    GH->>Runner: start job (matrix: spot / on‑demand)
    end

    Runner->>Monitor: start background spot monitor
    Runner->>GH: run tests, upload logs/artifacts
    Monitor->>GH: emit termination marker (if detected)
    GH->>Analyzer: run analyze job on logs
    Analyzer->>GH: set outputs (is_spot_termination, rerun_matrix)
    alt spot termination detected
        GH->>Rerun: trigger on‑demand rerun with rerun_matrix
        Rerun->>Runner: start on‑demand runner and execute tests
        Rerun->>GH: upload rerun results
    else no spot termination
        GH->>GH: finalize original job results
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jimmyzho
  • yzh119

Poem

🐰 I nibble logs beneath the moonlit sky,
A little monitor that watches by and by.
When spot instances scamper and flee,
I trumpet a rerun to set tests free,
Then hop away to munch a carrot pie. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[CI] Add on-demand rerun for spot-terminated jobs' clearly and concisely describes the main change: adding a rerun mechanism for CI jobs that fail due to spot termination.
Description check ✅ Passed The description provides a comprehensive explanation of the changes, including the problem (spot termination failures), the solution approach, and how the implementation works. All key template sections are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yongwww yongwww marked this pull request as ready for review January 22, 2026 19:03
@yongwww yongwww marked this pull request as draft January 22, 2026 19:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 163-201: The analyze-aot-failure job and the analyze step
currently treat job logs as plain text and only consider conclusions ==
"failure", so update the job-level if condition and the jq filter to include
canceled (and optionally timed_out) results, and fetch/unzip logs before
grepping: change the workflow job condition from needs.aot-build-import.result
== 'failure' to include 'cancelled' (e.g., needs.aot-build-import.result ==
'failure' || needs.aot-build-import.result == 'cancelled'), update the
FAILED_JOBS jq selection in the analyze step to include .conclusion ==
"cancelled" (and "timed_out" if desired), and replace the gh api logs fetch/grep
logic to save the response as a ZIP (e.g., gh api ... > job_log.zip 2>/dev/null)
and then unzip its text with unzip -p job_log.zip > job_log.txt 2>/dev/null ||
continue before running grep on job_log.txt; apply the same changes to the other
two analyze jobs with IDs/steps analogous to analyze-aot-failure/analyze.
- Around line 118-129: The spot termination monitor's curl call to
http://169.254.169.254/latest/meta-data/spot/instance-action uses IMDSv1 only
and will fail when IMDSv2 is required; modify the monitor (the "Start spot
termination monitor" background loop and the two other identical monitor blocks)
to first request an IMDSv2 token from http://169.254.169.254/latest/api/token
with PUT and header X-aws-ec2-metadata-token-ttl-seconds, store that token if
successful, then call the metadata endpoint with curl including header
"X-aws-ec2-metadata-token: $TOKEN"; if the token request fails (non-2xx or
empty), fall back to the existing tokenless IMDSv1 curl behavior so monitors
still work on instances not enforcing IMDSv2.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)

203-214: Optional: limit rerun matrix to failed shards/arches to save on‑demand capacity.

Right now you rerun the full matrix even if only one shard/arch failed. Consider building the rerun matrix from the failed job names returned by the same gh api call to reduce cost and time.

Also applies to: 370-375

@yongwww yongwww marked this pull request as ready for review January 22, 2026 19:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 189-190: The Jobs API call that populates FAILED_JOBS is missing
pagination and only returns the first 30 jobs; update the gh api request string
used to set FAILED_JOBS (the command using gh api "/repos/${{ github.repository
}}/actions/runs/${RUN_ID}/jobs") to include ?per_page=100 so it retrieves up to
100 jobs, and apply the same ?per_page=100 addition to the other identical gh
api calls in this workflow (the other analyze job queries that filter .jobs[]
for AOT failures/cancellations) to ensure no failed/cancelled jobs are omitted.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)

118-137: Consider extracting the spot monitor into a reusable action/script.

There are three near-identical monitor loops; keeping them in sync will be error-prone as the detection logic evolves. A small composite action (or shared script in scripts/) would reduce drift and simplify future changes.

Also applies to: 301-320, 470-489

@yongwww
Copy link
Member Author

yongwww commented Jan 23, 2026

The ci is actually green after re-run, the workflow is here: https://github.com/flashinfer-ai/flashinfer/actions/runs/21263150552?pr=2403
latest_workflow

cc: @yzh119

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yongwww can you resolve conflicts with main branch?

@yongwww
Copy link
Member Author

yongwww commented Jan 25, 2026

@flashinfer-bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/pr-test.yml (2)

370-383: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.

Lines 378-380 duplicate the needs, if, and runs-on keys already defined at lines 372-377. YAML parsers typically use the last occurrence, which means:

  1. needs: [gate, setup] is overwritten by needs: setupthe gate dependency is lost, allowing unauthorized PRs to potentially run GPU tests
  2. The multi-line if condition checking needs.gate.outputs.authorized is replaced by a simpler condition
  3. runs-on loses the original labels

Remove the duplicate keys at lines 378-380 and merge the intended changes into the original definitions.

Proposed fix
   gpu-tests-a10g:
     name: JIT Unittest ${{ matrix.shard }} (A10G)
-    needs: [gate, setup]
-    if: |
-      needs.gate.outputs.authorized == 'true' &&
-      needs.setup.outputs.skip_build != 'true' &&
-      github.event.inputs.skip_gpu != 'true'
-    runs-on: [self-hosted, Linux, X64, gpu, sm86]
     needs: setup
-    if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true'
+    needs: [gate, setup]
+    if: |
+      needs.gate.outputs.authorized == 'true' &&
+      needs.setup.outputs.skip_build != 'true' &&
+      github.event.inputs.skip_gpu != 'true'
     runs-on: [self-hosted, Linux, X64, gpu, sm86, spot]

549-559: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.

Same issue as the A10G job — lines 557-559 duplicate the needs, if, and runs-on keys, causing the gate dependency and authorization check to be lost.

Proposed fix
   gpu-tests-t4:
     name: JIT Unittest (T4)
-    needs: [gate, setup]
-    if: |
-      needs.gate.outputs.authorized == 'true' &&
-      needs.setup.outputs.skip_build != 'true' &&
-      github.event.inputs.skip_gpu != 'true'
-    runs-on: [self-hosted, Linux, X64, gpu, sm75]
-    needs: setup
-    if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true'
+    needs: [gate, setup]
+    if: |
+      needs.gate.outputs.authorized == 'true' &&
+      needs.setup.outputs.skip_build != 'true' &&
+      github.event.inputs.skip_gpu != 'true'
     runs-on: [self-hosted, Linux, X64, gpu, sm75, spot]

@yongwww
Copy link
Member Author

yongwww commented Jan 25, 2026

@flashinfer-bot stop

@yongwww
Copy link
Member Author

yongwww commented Jan 25, 2026

@flashinfer-bot rerun

@yzh119
Copy link
Collaborator

yzh119 commented Jan 25, 2026

@yongwww looks like even if we rerun the tests, the original tests are still shown as failing in the panel?

Copy link
Contributor

@dierksen dierksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on repetition

Comment on lines +202 to +220
(
while true; do
# Support both IMDSv2 (token-based) and IMDSv1 (legacy)
IMDS_URL="http://169.254.169.254/latest/meta-data/spot/instance-action"
TOKEN=$(curl -s --max-time 2 -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null || true)
if [ -n "$TOKEN" ]; then
META=$(curl -sf --max-time 2 -H "X-aws-ec2-metadata-token: $TOKEN" "$IMDS_URL" 2>/dev/null || true)
else
META=$(curl -sf --max-time 2 "$IMDS_URL" 2>/dev/null || true)
fi
if echo "$META" | grep -q "terminate"; then
echo "::error::FLASHINFER_SPOT_TERMINATION_DETECTED"
echo "AWS Spot Termination Notice received at $(date)"
exit 0
fi
sleep 5
done
) &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script and the one used for analyzing failures are repeated a few times across different configurations. To me, it would make the diff more readable and make the config more maintainable to put these into a separate string variable or to make these jobs inherit some common config (note: I'm not completely sure how to do that in github actions, or if it's possible).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! will resolve this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 0010d47, pls take a look

@yongwww
Copy link
Member Author

yongwww commented Jan 26, 2026

@yongwww looks like even if we rerun the tests, the original tests are still shown as failing in the panel?

Yes, all jobs are shown in the panel, including the originally failed ones, even if a rerun later succeeds. I haven’t found a workaround to remove them from the panel yet; this appears to be a GitHub behavior. I’ll spend some additional time exploring possible alternatives (pls suggest any options).

Currently, The Test Results Summary is required to be passing for merging. The summary should be treated as the indicator of PR health: it turns green if all tests pass, either in the original run or via a successful rerun: https://github.com/flashinfer-ai/flashinfer/actions/runs/21333554813?pr=2403.


update:

The retry logic runs as a distinct job in the workflow graph, GitHub considers them two separate historical events. The log of job_A failing is immutable and will always be displayed in the workflow visualization and the PR checks list. (by Gemini)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 215-216: Change the workflow steps that start the spot termination
monitor to invoke the script via bash instead of attempting to execute it
directly: replace uses of "./scripts/task_monitor_spot.sh &" with "bash
./scripts/task_monitor_spot.sh &" in the steps named "Start spot termination
monitor" (and the two other occurrences mentioned) so the script runs regardless
of file executable bit; target the workflow step entries that reference
./scripts/task_monitor_spot.sh to apply this update.
🧹 Nitpick comments (1)
.github/workflows/pr-test.yml (1)

283-295: Consider deriving rerun_matrix from the actual failed jobs.

Current matrices rerun all variants even if only a subset failed, which can be costly. Parsing FAILED_JOBS names into a minimal rerun matrix would save capacity.

Also applies to: 451-456

uses: docker/login-action@v3
with:
username: flashinfer
password: ${{ secrets.DOCKERHUB_TOKEN }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to login to docker hub here? I suppose we only need this when pushing to dockerhub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using docker/login is to mitigate the rate limit of 100/pulls/6hr (for example, we ran into docker pull rate limit: https://github.com/flashinfer-ai/flashinfer/actions/runs/21193948826/job/60965961746). The login will increases the pull rate limit from 100 pulls/6hr (anonymous) to 200 pulls/6hr (authenticated), reducing the likelihood of rate limit errors when running concurrent CI jobs.).

REF: https://docs.docker.com/docker-hub/usage/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we could consider using ECR in the future (faster pulls and no pull rate limits like dockerhub)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's reasonable and thanks so much for the explaination.

@yzh119 yzh119 merged commit 1c2cc29 into flashinfer-ai:main Jan 30, 2026
67 of 87 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 12, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants