[CI] Add on-demand rerun for spot-terminated jobs by yongwww · Pull Request #2403 · flashinfer-ai/flashinfer

yongwww · 2026-01-22T18:39:39Z

📌 Description

Recently, we found some workflow jobs were failing due to ec2 SPOT termination, e.g. this job: https://github.com/flashinfer-ai/flashinfer/actions/runs/21222912953/job/61062117663

This PR enhances CI by rerunning failed jobs on on-demand instances when spot termination is detected. How it works:

Spot jobs run a background monitor checking AWS metadata for termination notice
If termination is detected, a marker is written to the job log
Analyze jobs check logs via GitHub API for the marker
If spot termination: rerun all failed/cancelled jobs on on-demand
If real test failure: no rerun, workflow fails fast

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Chores
- Enhanced CI to detect spot-instance terminations, start spot monitoring, mark affected runs, add cleanup/diagnostic steps, and permit on‑demand reruns across AOT and GPU paths (A10G, T4).
- Exposed analysis outputs and adjusted workflow permissions to orchestrate conditional reruns.
Tests
- Added automated failure analysis and on‑demand rerun paths for AOT and GPU tests.
- Improved test reporting to show Skipped, Passed (spot), Passed (on‑demand rerun), and Failed.
New Features
- Added a runtime spot-termination monitor that watches for instance termination notices and signals CI for timely reruns.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gemini-code-assist · 2026-01-22T18:39:45Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

coderabbitai · 2026-01-22T18:39:51Z

📝 Walkthrough

Walkthrough

Adds spot-aware rerun logic to PR CI: background spot termination monitoring (IMDSv2/v1), failure-analysis jobs that detect spot/infra markers and emit rerun matrices, conditional on‑demand rerun jobs for AOT import and GPU (A10G/T4) tests, and updated test reporting to reflect spot vs on‑demand outcomes. (50 words)

Changes

Cohort / File(s)	Summary
Primary CI workflow `/.github/workflows/pr-test.yml`	Added spot-monitor start/stop steps, new analysis jobs (`analyze-*`), public outputs (`is_spot_termination`, `rerun_matrix`), on‑demand rerun jobs for AOT, A10G, and T4, adjusted matrices/concurrency/permissions, and updated Test Results Summary with `check_status`.
Spot monitor script `scripts/task_monitor_spot.sh`	New script that polls EC2 IMDS (IMDSv2 with token, fallback to IMDSv1) for Spot termination notices, logs and annotates detection, and exits to mark termination. Robust curl handling and 5s polling loop.
Rerun analysis & orchestration (co-located) `/.github/workflows/pr-test.yml`	Introduced analyze jobs that inspect artifacts/logs and produce `is_spot_termination` and `rerun_matrix`; added rerun jobs (`aot-build-import-rerun`, `gpu-tests-a10g-rerun`, `gpu-tests-t4-rerun`) that execute conditionally from those matrices.

Sequence Diagram(s)

sequenceDiagram
    participant GH as "GitHub Actions"
    participant Runner as "Job Runner (spot/on‑demand)"
    participant Monitor as "Spot Monitor (background)"
    participant Analyzer as "Analyze Failure Job"
    participant Rerun as "On‑demand Rerun Job"

    rect rgba(0,128,0,0.5)
    GH->>Runner: start job (matrix: spot / on‑demand)
    end

    Runner->>Monitor: start background spot monitor
    Runner->>GH: run tests, upload logs/artifacts
    Monitor->>GH: emit termination marker (if detected)
    GH->>Analyzer: run analyze job on logs
    Analyzer->>GH: set outputs (is_spot_termination, rerun_matrix)
    alt spot termination detected
        GH->>Rerun: trigger on‑demand rerun with rerun_matrix
        Rerun->>Runner: start on‑demand runner and execute tests
        Rerun->>GH: upload rerun results
    else no spot termination
        GH->>GH: finalize original job results
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: introduce GitHub Actions workflow for PR testing #2326: Modifies the same .github/workflows/pr-test.yml and adds spot/self-hosted runner CI paths; closely related.
ci: add permission control for public ci tests #2397: Also updates .github/workflows/pr-test.yml with public job/output wiring and gating logic that intersects with these rerun additions.

Suggested reviewers

jimmyzho
yzh119

Poem

🐰 I nibble logs beneath the moonlit sky,
A little monitor that watches by and by.
When spot instances scamper and flee,
I trumpet a rerun to set tests free,
Then hop away to munch a carrot pie. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[CI] Add on-demand rerun for spot-terminated jobs' clearly and concisely describes the main change: adding a rerun mechanism for CI jobs that fail due to spot termination.
Description check	✅ Passed	The description provides a comprehensive explanation of the changes, including the problem (spot termination failures), the solution approach, and how the implementation works. All key template sections are addressed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/workflows/pr-test.yml:
- Around line 163-201: The analyze-aot-failure job and the analyze step
currently treat job logs as plain text and only consider conclusions ==
"failure", so update the job-level if condition and the jq filter to include
canceled (and optionally timed_out) results, and fetch/unzip logs before
grepping: change the workflow job condition from needs.aot-build-import.result
== 'failure' to include 'cancelled' (e.g., needs.aot-build-import.result ==
'failure' || needs.aot-build-import.result == 'cancelled'), update the
FAILED_JOBS jq selection in the analyze step to include .conclusion ==
"cancelled" (and "timed_out" if desired), and replace the gh api logs fetch/grep
logic to save the response as a ZIP (e.g., gh api ... > job_log.zip 2>/dev/null)
and then unzip its text with unzip -p job_log.zip > job_log.txt 2>/dev/null ||
continue before running grep on job_log.txt; apply the same changes to the other
two analyze jobs with IDs/steps analogous to analyze-aot-failure/analyze.
- Around line 118-129: The spot termination monitor's curl call to
http://169.254.169.254/latest/meta-data/spot/instance-action uses IMDSv1 only
and will fail when IMDSv2 is required; modify the monitor (the "Start spot
termination monitor" background loop and the two other identical monitor blocks)
to first request an IMDSv2 token from http://169.254.169.254/latest/api/token
with PUT and header X-aws-ec2-metadata-token-ttl-seconds, store that token if
successful, then call the metadata endpoint with curl including header
"X-aws-ec2-metadata-token: $TOKEN"; if the token request fails (non-2xx or
empty), fall back to the existing tokenless IMDSv1 curl behavior so monitors
still work on instances not enforcing IMDSv2.

🧹 Nitpick comments (1)

.github/workflows/pr-test.yml (1)

203-214: Optional: limit rerun matrix to failed shards/arches to save on‑demand capacity.

Right now you rerun the full matrix even if only one shard/arch failed. Consider building the rerun matrix from the failed job names returned by the same gh api call to reduce cost and time.

Also applies to: 370-375

.github/workflows/pr-test.yml

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/pr-test.yml:
- Around line 189-190: The Jobs API call that populates FAILED_JOBS is missing
pagination and only returns the first 30 jobs; update the gh api request string
used to set FAILED_JOBS (the command using gh api "/repos/${{ github.repository
}}/actions/runs/${RUN_ID}/jobs") to include ?per_page=100 so it retrieves up to
100 jobs, and apply the same ?per_page=100 addition to the other identical gh
api calls in this workflow (the other analyze job queries that filter .jobs[]
for AOT failures/cancellations) to ensure no failed/cancelled jobs are omitted.

🧹 Nitpick comments (1)

.github/workflows/pr-test.yml (1)

118-137: Consider extracting the spot monitor into a reusable action/script.

There are three near-identical monitor loops; keeping them in sync will be error-prone as the detection logic evolves. A small composite action (or shared script in scripts/) would reduce drift and simplify future changes.

Also applies to: 301-320, 470-489

.github/workflows/pr-test.yml

yongwww · 2026-01-23T00:34:56Z

The ci is actually green after re-run, the workflow is here: https://github.com/flashinfer-ai/flashinfer/actions/runs/21263150552?pr=2403

cc: @yzh119

yzh119

Hi @yongwww can you resolve conflicts with main branch?

yongwww · 2026-01-25T13:32:08Z

@flashinfer-bot run

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

.github/workflows/pr-test.yml (2)
370-383: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.

Lines 378-380 duplicate the needs, if, and runs-on keys already defined at lines 372-377. YAML parsers typically use the last occurrence, which means:

needs: [gate, setup] is overwritten by needs: setup — the gate dependency is lost, allowing unauthorized PRs to potentially run GPU tests

The multi-line if condition checking needs.gate.outputs.authorized is replaced by a simpler condition

runs-on loses the original labels

Remove the duplicate keys at lines 378-380 and merge the intended changes into the original definitions.
Proposed fix
   gpu-tests-a10g:
     name: JIT Unittest ${{ matrix.shard }} (A10G)
-    needs: [gate, setup]
-    if: |
-      needs.gate.outputs.authorized == 'true' &&
-      needs.setup.outputs.skip_build != 'true' &&
-      github.event.inputs.skip_gpu != 'true'
-    runs-on: [self-hosted, Linux, X64, gpu, sm86]
     needs: setup
-    if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true'
+    needs: [gate, setup]
+    if: |
+      needs.gate.outputs.authorized == 'true' &&
+      needs.setup.outputs.skip_build != 'true' &&
+      github.event.inputs.skip_gpu != 'true'
     runs-on: [self-hosted, Linux, X64, gpu, sm86, spot]
549-559: Critical: Duplicate YAML keys cause undefined behavior and bypass authorization check.

Same issue as the A10G job — lines 557-559 duplicate the needs, if, and runs-on keys, causing the gate dependency and authorization check to be lost.
Proposed fix
   gpu-tests-t4:
     name: JIT Unittest (T4)
-    needs: [gate, setup]
-    if: |
-      needs.gate.outputs.authorized == 'true' &&
-      needs.setup.outputs.skip_build != 'true' &&
-      github.event.inputs.skip_gpu != 'true'
-    runs-on: [self-hosted, Linux, X64, gpu, sm75]
-    needs: setup
-    if: needs.setup.outputs.skip_build != 'true' && github.event.inputs.skip_gpu != 'true'
+    needs: [gate, setup]
+    if: |
+      needs.gate.outputs.authorized == 'true' &&
+      needs.setup.outputs.skip_build != 'true' &&
+      github.event.inputs.skip_gpu != 'true'
     runs-on: [self-hosted, Linux, X64, gpu, sm75, spot]

yongwww · 2026-01-25T13:44:04Z

@flashinfer-bot stop

yongwww · 2026-01-25T13:44:39Z

@flashinfer-bot rerun

yzh119 · 2026-01-25T23:40:13Z

@yongwww looks like even if we rerun the tests, the original tests are still shown as failing in the panel?

dierksen

nit on repetition

dierksen · 2026-01-26T16:40:39Z

.github/workflows/pr-test.yml

+          (
+            while true; do
+              # Support both IMDSv2 (token-based) and IMDSv1 (legacy)
+              IMDS_URL="http://169.254.169.254/latest/meta-data/spot/instance-action"
+              TOKEN=$(curl -s --max-time 2 -X PUT "http://169.254.169.254/latest/api/token" \
+                -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null || true)
+              if [ -n "$TOKEN" ]; then
+                META=$(curl -sf --max-time 2 -H "X-aws-ec2-metadata-token: $TOKEN" "$IMDS_URL" 2>/dev/null || true)
+              else
+                META=$(curl -sf --max-time 2 "$IMDS_URL" 2>/dev/null || true)
+              fi
+              if echo "$META" | grep -q "terminate"; then
+                echo "::error::FLASHINFER_SPOT_TERMINATION_DETECTED"
+                echo "AWS Spot Termination Notice received at $(date)"
+                exit 0
+              fi
+              sleep 5
+            done
+          ) &


This script and the one used for analyzing failures are repeated a few times across different configurations. To me, it would make the diff more readable and make the config more maintainable to put these into a separate string variable or to make these jobs inherit some common config (note: I'm not completely sure how to do that in github actions, or if it's possible).

good catch! will resolve this!

fixed in 0010d47, pls take a look

yongwww · 2026-01-26T16:56:52Z

@yongwww looks like even if we rerun the tests, the original tests are still shown as failing in the panel?

Yes, all jobs are shown in the panel, including the originally failed ones, even if a rerun later succeeds. I haven’t found a workaround to remove them from the panel yet; this appears to be a GitHub behavior. I’ll spend some additional time exploring possible alternatives (pls suggest any options).

Currently, The Test Results Summary is required to be passing for merging. The summary should be treated as the indicator of PR health: it turns green if all tests pass, either in the original run or via a successful rerun: https://github.com/flashinfer-ai/flashinfer/actions/runs/21333554813?pr=2403.

update:

The retry logic runs as a distinct job in the workflow graph, GitHub considers them two separate historical events. The log of job_A failing is immutable and will always be displayed in the workflow visualization and the PR checks list. (by Gemini)

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/pr-test.yml:
- Around line 215-216: Change the workflow steps that start the spot termination
monitor to invoke the script via bash instead of attempting to execute it
directly: replace uses of "./scripts/task_monitor_spot.sh &" with "bash
./scripts/task_monitor_spot.sh &" in the steps named "Start spot termination
monitor" (and the two other occurrences mentioned) so the script runs regardless
of file executable bit; target the workflow step entries that reference
./scripts/task_monitor_spot.sh to apply this update.

🧹 Nitpick comments (1)

.github/workflows/pr-test.yml (1)

283-295: Consider deriving rerun_matrix from the actual failed jobs.

Current matrices rerun all variants even if only a subset failed, which can be costly. Parsing FAILED_JOBS names into a minimal rerun matrix would save capacity.

Also applies to: 451-456

.github/workflows/pr-test.yml

yzh119 · 2026-01-29T21:55:34Z

.github/workflows/pr-test.yml

+        uses: docker/login-action@v3
+        with:
+          username: flashinfer
+          password: ${{ secrets.DOCKERHUB_TOKEN }}


why do we have to login to docker hub here? I suppose we only need this when pushing to dockerhub?

Using docker/login is to mitigate the rate limit of 100/pulls/6hr (for example, we ran into docker pull rate limit: https://github.com/flashinfer-ai/flashinfer/actions/runs/21193948826/job/60965961746). The login will increases the pull rate limit from 100 pulls/6hr (anonymous) to 200 pulls/6hr (authenticated), reducing the likelihood of rate limit errors when running concurrent CI jobs.).

REF: https://docs.docker.com/docker-hub/usage/

By the way, we could consider using ECR in the future (faster pulls and no pull rate limits like dockerhub)

I see, that's reasonable and thanks so much for the explaination.

yongwww marked this pull request as ready for review January 22, 2026 19:03

yongwww marked this pull request as draft January 22, 2026 19:09

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

.github/workflows/pr-test.yml Outdated Show resolved Hide resolved

.github/workflows/pr-test.yml Show resolved Hide resolved

yongwww marked this pull request as ready for review January 22, 2026 19:34

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

.github/workflows/pr-test.yml Outdated Show resolved Hide resolved

yongwww mentioned this pull request Jan 24, 2026

Expand Public CI GPU Architecture Coverage via GitHub Actions #2355

Open

yzh119 reviewed Jan 25, 2026

View reviewed changes

yongwww added 6 commits January 25, 2026 05:22

[CI] Add on-demand rerun for spot-terminated jobs

0af432b

add some debug info

e446bff

cancel all jobs if got cancel signal

416eb8a

fix comments

42b3984

trigger CI

ed4665e

fix comments

f7f7dd6

yongwww force-pushed the od_fallback branch from 93a312a to f7f7dd6 Compare January 25, 2026 13:28

flashinfer-bot added the run-ci label Jan 25, 2026

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

Fix issue from rebase

4e45bec

dierksen reviewed Jan 26, 2026

View reviewed changes

fix comments

0010d47

yongwww requested review from jimmyzho, kahyunnam and nvmbreughe as code owners January 26, 2026 19:13

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

.github/workflows/pr-test.yml Show resolved Hide resolved

fix comments

5dc2ab0

yzh119 reviewed Jan 29, 2026

View reviewed changes

yzh119 approved these changes Jan 30, 2026

View reviewed changes

yzh119 merged commit 1c2cc29 into flashinfer-ai:main Jan 30, 2026
67 of 87 checks passed

coderabbitai bot mentioned this pull request Feb 2, 2026

[CI]: Enable Blackwell & Hopper in public CI testing #2413

Draft

5 tasks

coderabbitai bot mentioned this pull request Feb 12, 2026

Add Hopper to CI #2552

Merged

5 tasks

Conversation

yongwww commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yongwww commented Jan 23, 2026

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yongwww commented Jan 25, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yongwww commented Jan 25, 2026

Uh oh!

yongwww commented Jan 25, 2026

Uh oh!

yzh119 commented Jan 25, 2026

Uh oh!

dierksen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dierksen Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yzh119 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 22, 2026 •

edited

Loading

dierksen left a comment •

edited

Loading

yongwww commented Jan 26, 2026 •

edited

Loading