Add Hopper to CI by yongwww · Pull Request #2552 · flashinfer-ai/flashinfer

yongwww · 2026-02-12T22:07:33Z

📌 Description

Add H100 to public ci testing.

Added "Error: The operation was canceled." (occurred in jobs like https://github.com/flashinfer-ai/flashinfer/actions/runs/21931167188/attempts/1?pr=2425) as one of the spot termination signals.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Tests
- Added H100 GPU capacity testing and included its results in final summaries
- Standardized GPU/AOT test matrices and runner architecture labels
- Improved failure diagnostics with automated spot-termination analysis
Chores
- Updated CI cleanup procedures to use targeted image/builder pruning and removed redundant login steps
- Added sparse-checkout for test scripts and harmonized cleanup behavior

- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script

…images

gemini-code-assist · 2026-02-12T22:07:47Z

Summary of Changes

Hello @yongwww, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the project's continuous integration capabilities by incorporating H100 GPUs for public testing, ensuring broader hardware compatibility. Concurrently, it refines the mechanism for identifying infrastructure-related job failures, specifically improving the detection of spot instance terminations by recognizing common cancellation messages.

Highlights

CI Expansion: Integrated H100 (Hopper) GPUs into the public CI testing infrastructure to broaden hardware coverage.
Improved Spot Termination Detection: Enhanced the CI's ability to detect spot instance terminations by recognizing "Error: The operation was canceled." as a key indicator, alongside other infrastructure failure patterns.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

scripts/task_analyze_spot.sh
- Added a new shell script to analyze GitHub Actions job failures for spot termination indicators.
- Implemented logic to check job metadata for runner communication errors like "operation was canceled" or "runner lost".
- Included functionality to download and parse job logs for specific infrastructure error patterns such as "connection reset by peer" or "context canceled".
- Outputs a boolean is_spot_termination indicating if a job failure is likely due to a spot instance termination.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/pr-test.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-12T22:07:55Z

📝 Walkthrough

Walkthrough

Standardizes CI runner/architecture labels, refactors Docker cleanup and checkout steps, removes Docker Hub login, adds a new H100 GPU test job and aggregates its result, and extracts inline spot-termination analysis into a new scripts/task_analyze_spot.sh.

Changes

Cohort / File(s)	Summary
CI Workflow `.github/workflows/pr-test.yml`	Lowercases runner/arch labels and matrices, replaces `docker system prune -f` with `docker image prune -f` + `docker builder prune -f --filter "until=24h"`, removes Docker Hub login steps, adds `sparse-checkout: scripts`, replaces inline spot-analysis with script calls, adds `gpu-tests-h100` job, and includes H100 in test-result aggregation.
Spot Termination Analysis Script `scripts/task_analyze_spot.sh`	New Bash utility that queries GitHub Actions jobs via `gh`, downloads and inspects job logs (zip or text) for spot-termination and infrastructure error patterns, and emits `is_spot_termination` to stdout and `GITHUB_OUTPUT`.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as GitHub Actions Workflow
    participant Script as scripts/task_analyze_spot.sh
    participant GHAPI as GitHub CLI / API
    participant Runner as Runner & Job Logs

    Workflow->>Script: invoke (job_filter, repo, run_id)
    Script->>GHAPI: list jobs for run (filtered)
    GHAPI-->>Script: job metadata (conclusions)
    alt failed/cancelled jobs exist
        Script->>GHAPI: download job logs
        GHAPI-->>Runner: provide logs (zip or text)
        Script->>Script: extract & scan logs for markers/errors
        alt marker/error found
            Script-->>Workflow: set is_spot_termination=true (GITHUB_OUTPUT)
        else
            Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
        end
    else
        Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

ci: refactor PR tests to hide failed spot jobs from PR status #2500: Modifies PR CI workflow and spot-termination handling, closely related to these workflow changes.
[CI] Add on-demand rerun for spot-terminated jobs #2403: Changes CI spot-termination monitoring/analysis scripts and steps; strongly related to the new analysis script.
feat: introduce GitHub Actions workflow for PR testing #2326: Prior edits to .github/workflows/pr-test.yml that intersect with the runner label and job matrix updates.

Suggested labels

run-ci

Suggested reviewers

nvmbreughe
kahyunnam
jimmyzho
yzh119

Poem

🐰 I hopped into logs with ears so keen,
lowercased the runners, kept the workflow clean.
I sniffed for spot-falls in compressed night logs,
cheered for H100 and pruned old Docker clogs. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (3 files): ⚔️ `.github/workflows/pr-test.yml` (content) ⚔️ `scripts/task_run_unit_tests.sh` (content) ⚔️ `scripts/test_utils.sh` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.
Title check	❓ Inconclusive	The title 'Add Hopper to CI' is vague and doesn't accurately capture the main change. While the PR does add H100 (Hopper) testing, the summary shows extensive CI infrastructure refactoring including spot termination analysis, runner label standardization, and Docker command updates. The title omits these significant changes.	Consider a more descriptive title such as 'Add H100 GPU testing and improve spot termination detection' to better reflect all major changes included in this comprehensive CI update.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description provides a GitHub issue reference and explains the main objective (adding H100 to CI and improving spot termination detection), but the PR checklist items are unchecked and no details about testing or pre-commit verification are provided, making the description incomplete relative to the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch ci_hopper
Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yongwww · 2026-02-12T22:08:50Z

cc: @dierksen @yzh119 @sricketts

gemini-code-assist

Code Review

This PR adds a script to analyze CI job failures for spot instance termination. The script is a good addition. I've found two areas for improvement to make it more robust: handling API pagination to ensure all jobs are checked, and using secure temporary file creation. Details are in the specific comments.

gemini-code-assist · 2026-02-12T22:09:42Z

scripts/task_analyze_spot.sh

+FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \
+  --jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")


The current gh api call to fetch jobs has a hardcoded per_page=100 limit and does not handle pagination. If a workflow run has more than 100 jobs, this script could miss the failed job, leading to an incorrect result. To make this robust, you should use the --paginate flag, which will ensure all jobs from all pages are fetched and processed.

Suggested change

FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \

--jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")

FAILED_JOBS=$(gh api --paginate "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs" \

--jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")

fixed in b4e8481

gemini-code-assist · 2026-02-12T22:09:42Z

scripts/task_analyze_spot.sh

+SPOT_TERMINATION=false
+
+# Temp file for job logs (cleaned up on exit)
+LOG_FILE="/tmp/job_log.txt"


Using a hardcoded path for a temporary file is not recommended due to potential race conditions and security vulnerabilities. Please use mktemp to generate a unique temporary file name.

Note that after this change, you will need to adjust how the corresponding .zip file is handled. For example, you could define LOG_FILE_ZIP="${LOG_FILE}.zip" and use that variable throughout the script.

Suggested change

LOG_FILE="/tmp/job_log.txt"

LOG_FILE=$(mktemp)

fixed in b4e8481

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@scripts/task_analyze_spot.sh`:
- Around line 45-53: The current broad grep of JOB_INFO (variable JOB_INFO from
gh api for JOB_ID in FAILED_JOBS) risks matching fail-fast cancellations; narrow
the heuristic by inspecting the JSON field that contains runner/annotation
messages (e.g., .annotations[].message or the runner-specific message field)
instead of grepping the whole JOB_INFO, and add a guard that skips setting
SPOT_TERMINATION when the workflow-triggering job’s own conclusion is "failure"
(i.e., only mark SPOT_TERMINATION when the triggering job did not fail and the
runner/annotation field contains "operation was canceled" or runner-lost
patterns).

🧹 Nitpick comments (4)

scripts/task_analyze_spot.sh (2)
36-37: Pagination: per_page=100 may miss jobs in large matrix runs.

The AOT matrix alone is 2 arches × 4 CUDA versions = 8 jobs, plus reruns and other GPU jobs all share the same run ID. While 100 is likely sufficient today, consider using --paginate (as done elsewhere in this workflow for team member listing) or adding a comment noting the assumption.

30-33: Hardcoded /tmp/job_log.txt is safe for now but fragile.

Each analysis job runs on its own ubuntu-latest runner, so there's no concurrent collision risk today. However, using mktemp would be more robust if this script is ever reused in a different context.
♻️ Suggested improvement
-LOG_FILE="/tmp/job_log.txt"
-cleanup() { rm -f "$LOG_FILE" "${LOG_FILE}.zip"; }
+LOG_DIR=$(mktemp -d)
+LOG_FILE="${LOG_DIR}/job_log.txt"
+cleanup() { rm -rf "$LOG_DIR"; }
.github/workflows/pr-test.yml (2)
159-159: TODO: Re-add .github/ to skip patterns before merging.

Line 159 has a TODO noting that ^\.github/ should be added back to SKIP_PATTERNS before merging to main. Without it, changes to only .github/ workflow files will trigger the full build/test suite unnecessarily on main. Make sure this is addressed before merge.

661-667: H100 result handling in the summary is correct but inconsistent with other tests.

The H100 result is handled inline rather than through the check_status helper function used for AOT, A10G, and T4. This is understandable since H100 has no rerun logic, but the check_status function already handles the simple case (when spot_term is not "true" and the spot result is "success", it reports passed). You could reuse it for consistency:
♻️ Optional: reuse check_status for H100
-          # H100 tests (no rerun logic yet - CB instances don't get spot terminated)
-          echo "" >> $GITHUB_STEP_SUMMARY
-          H100="${{ needs.gpu-tests-h100.result }}"
-          echo "GPU Tests (H100): $H100" >> $GITHUB_STEP_SUMMARY
-          if [ "$H100" != "success" ] && [ "$H100" != "skipped" ] && [ "${{ github.event.inputs.skip_gpu }}" != "true" ]; then
-            FAILED=true
-          fi
+          echo "" >> $GITHUB_STEP_SUMMARY
+          check_status "GPU Tests (H100)" \
+            "${{ github.event.inputs.skip_gpu }}" \
+            "${{ needs.gpu-tests-h100.result }}" \
+            "false" \
+            "" || FAILED=true

coderabbitai · 2026-02-12T22:11:50Z

scripts/task_analyze_spot.sh

+for JOB_ID in $FAILED_JOBS; do
+  # First check job metadata for runner communication errors
+  # This catches "The operation was canceled" which appears in annotations, not logs
+  JOB_INFO=$(gh api "/repos/${REPOSITORY}/actions/jobs/${JOB_ID}" 2>/dev/null || true)
+  if echo "$JOB_INFO" | grep -qiE "operation was canceled|runner.*lost|lost communication"; then
+    echo "Detected: Runner lost communication or operation canceled (job $JOB_ID)"
+    SPOT_TERMINATION=true
+    break
+  fi


⚠️ Potential issue | 🟠 Major

"operation was canceled" may cause false-positive spot-termination reruns with fail-fast: true.

When fail-fast: true is set (as it is for all matrix jobs in this workflow), GitHub Actions cancels sibling matrix jobs when one fails. Cancelled jobs can have "The operation was canceled" in their annotations/metadata. Grepping the entire JSON response for this pattern will match those fail-fast cancellations too, incorrectly flagging them as spot terminations and triggering unnecessary on-demand reruns.

Previously, the analysis relied on specific network error patterns (e.g., connection reset by peer, grpc.*closing) that fail-fast cancellation does not produce. This new, broader check on line 49 undermines that safety. Consider either:

Restricting the grep to a specific JSON field (e.g., .steps[].conclusion or annotation messages from the runner itself), or

Adding a negative guard that skips this heuristic when the triggering job's own conclusion is failure (i.e., a real test failure caused the fail-fast cascade).

Based on learnings: "the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns."

🤖 Prompt for AI Agents

In `@scripts/task_analyze_spot.sh` around lines 45 - 53, The current broad grep of JOB_INFO (variable JOB_INFO from gh api for JOB_ID in FAILED_JOBS) risks matching fail-fast cancellations; narrow the heuristic by inspecting the JSON field that contains runner/annotation messages (e.g., .annotations[].message or the runner-specific message field) instead of grepping the whole JOB_INFO, and add a guard that skips setting SPOT_TERMINATION when the workflow-triggering job’s own conclusion is "failure" (i.e., only mark SPOT_TERMINATION when the triggering job did not fail and the runner/annotation field contains "operation was canceled" or runner-lost patterns).

fail-fast cancellation produces "Canceling since another job has failed" in the logs, not "operation was canceled" in the job metadata.

@yongwww, thank you for the clarification! You're absolutely right—I misunderstood the distinction. Since the script checks job metadata (not logs), and fail-fast cancellations produce "Canceling since another job has failed" in logs rather than "operation was canceled" in metadata, there's no risk of false positives here. My concern was unfounded.

I appreciate you explaining this behavioral difference between spot termination and fail-fast cancellation!

(\_(\ (='.'=) Learned something new! (")_(")

✏️ Learnings added

Learnt from: yongwww Repo: flashinfer-ai/flashinfer PR: 2552 File: scripts/task_analyze_spot.sh:45-53 Timestamp: 2026-02-12T22:25:03.967Z Learning: In GitHub Actions, fail-fast cancellations produce "Canceling since another job has failed" in the job logs, while spot terminations produce "operation was canceled" in the job metadata (API response). These are distinct patterns in different locations, so checking job metadata for "operation was canceled" (as in `scripts/task_analyze_spot.sh`) will not match fail-fast cancellations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: yongwww Repo: flashinfer-ai/flashinfer PR: 2500 File: .github/workflows/pr-test-runner.yml:56-67 Timestamp: 2026-02-06T06:06:12.386Z Learning: In `.github/workflows/pr-test-runner.yml`, the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns.

Learnt from: yongwww Repo: flashinfer-ai/flashinfer PR: 2500 File: .github/workflows/pr-test-runner.yml:56-67 Timestamp: 2026-02-06T05:51:05.040Z Learning: In `.github/workflows/pr-test-runner.yml`, the team prefers to keep `fail-fast: true` for matrix jobs to reduce overall test time, even with the potential for false-positive spot-termination reruns.

yongwww · 2026-02-12T22:35:27Z

@bkryu FYI, as we discussed earlier, changes to benchmarks won’t automatically trigger the PR tests with this change.

dierksen

LGTM

yongwww added 9 commits February 12, 2026 13:44

ci: Enable blackwell tests in public ci

f480b2d

remove skip

6dba708

Enable H100

88d6e19

ci: improve spot termination detection for automatic reruns

9a59fd0

- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script

ci: add gate dependency to B200/H100 jobs for authorization check

4b5e4a1

update labels

5eab788

Replace docker system prune with targeted cleanup to preserve cached …

b9d66f1

…images

remove Docker login, extract spot analysis to script

36d4fe7

remove b200

c50e7d0

yongwww requested review from jimmyzho, kahyunnam, nvmbreughe and yzh119 as code owners February 12, 2026 22:07

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

yongwww mentioned this pull request Feb 12, 2026

Expand Public CI GPU Architecture Coverage via GitHub Actions #2355

Open

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

fix comments, skip docs, benchmarks

b4e8481

fix permission

a3be640

dierksen approved these changes Feb 18, 2026

View reviewed changes

bkryu approved these changes Feb 19, 2026

View reviewed changes

yongwww merged commit 1556d78 into flashinfer-ai:main Feb 19, 2026
56 of 70 checks passed

coderabbitai bot mentioned this pull request Feb 19, 2026

ci: fix H100 cleanup #2590

Merged

5 tasks

		FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \
		--jq ".jobs[] \| select(.name \| ${JOB_FILTER}) \| select(.conclusion == \"failure\" or .conclusion == \"cancelled\") \| .id")

Conversation

yongwww commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

yongwww commented Feb 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongwww Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

yongwww commented Feb 12, 2026

Uh oh!

dierksen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yongwww commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

coderabbitai bot Feb 12, 2026 •

edited

Loading