Skip to content

Add Hopper to CI#2552

Merged
yongwww merged 11 commits intoflashinfer-ai:mainfrom
yongwww:ci_hopper
Feb 19, 2026
Merged

Add Hopper to CI#2552
yongwww merged 11 commits intoflashinfer-ai:mainfrom
yongwww:ci_hopper

Conversation

@yongwww
Copy link
Member

@yongwww yongwww commented Feb 12, 2026

📌 Description

#2355.

Add H100 to public ci testing.

Added "Error: The operation was canceled." (occurred in jobs like https://github.com/flashinfer-ai/flashinfer/actions/runs/21931167188/attempts/1?pr=2425) as one of the spot termination signals.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Tests

    • Added H100 GPU capacity testing and included its results in final summaries
    • Standardized GPU/AOT test matrices and runner architecture labels
    • Improved failure diagnostics with automated spot-termination analysis
  • Chores

    • Updated CI cleanup procedures to use targeted image/builder pruning and removed redundant login steps
    • Added sparse-checkout for test scripts and harmonized cleanup behavior

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yongwww, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request expands the project's continuous integration capabilities by incorporating H100 GPUs for public testing, ensuring broader hardware compatibility. Concurrently, it refines the mechanism for identifying infrastructure-related job failures, specifically improving the detection of spot instance terminations by recognizing common cancellation messages.

Highlights

  • CI Expansion: Integrated H100 (Hopper) GPUs into the public CI testing infrastructure to broaden hardware coverage.
  • Improved Spot Termination Detection: Enhanced the CI's ability to detect spot instance terminations by recognizing "Error: The operation was canceled." as a key indicator, alongside other infrastructure failure patterns.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • scripts/task_analyze_spot.sh
    • Added a new shell script to analyze GitHub Actions job failures for spot termination indicators.
    • Implemented logic to check job metadata for runner communication errors like "operation was canceled" or "runner lost".
    • Included functionality to download and parse job logs for specific infrastructure error patterns such as "connection reset by peer" or "context canceled".
    • Outputs a boolean is_spot_termination indicating if a job failure is likely due to a spot instance termination.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/pr-test.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Standardizes CI runner/architecture labels, refactors Docker cleanup and checkout steps, removes Docker Hub login, adds a new H100 GPU test job and aggregates its result, and extracts inline spot-termination analysis into a new scripts/task_analyze_spot.sh.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/pr-test.yml
Lowercases runner/arch labels and matrices, replaces docker system prune -f with docker image prune -f + docker builder prune -f --filter "until=24h", removes Docker Hub login steps, adds sparse-checkout: scripts, replaces inline spot-analysis with script calls, adds gpu-tests-h100 job, and includes H100 in test-result aggregation.
Spot Termination Analysis Script
scripts/task_analyze_spot.sh
New Bash utility that queries GitHub Actions jobs via gh, downloads and inspects job logs (zip or text) for spot-termination and infrastructure error patterns, and emits is_spot_termination to stdout and GITHUB_OUTPUT.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as GitHub Actions Workflow
    participant Script as scripts/task_analyze_spot.sh
    participant GHAPI as GitHub CLI / API
    participant Runner as Runner & Job Logs

    Workflow->>Script: invoke (job_filter, repo, run_id)
    Script->>GHAPI: list jobs for run (filtered)
    GHAPI-->>Script: job metadata (conclusions)
    alt failed/cancelled jobs exist
        Script->>GHAPI: download job logs
        GHAPI-->>Runner: provide logs (zip or text)
        Script->>Script: extract & scan logs for markers/errors
        alt marker/error found
            Script-->>Workflow: set is_spot_termination=true (GITHUB_OUTPUT)
        else
            Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
        end
    else
        Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • nvmbreughe
  • kahyunnam
  • jimmyzho
  • yzh119

Poem

🐰 I hopped into logs with ears so keen,
lowercased the runners, kept the workflow clean.
I sniffed for spot-falls in compressed night logs,
cheered for H100 and pruned old Docker clogs. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (3 files):

⚔️ .github/workflows/pr-test.yml (content)
⚔️ scripts/task_run_unit_tests.sh (content)
⚔️ scripts/test_utils.sh (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Title check ❓ Inconclusive The title 'Add Hopper to CI' is vague and doesn't accurately capture the main change. While the PR does add H100 (Hopper) testing, the summary shows extensive CI infrastructure refactoring including spot termination analysis, runner label standardization, and Docker command updates. The title omits these significant changes. Consider a more descriptive title such as 'Add H100 GPU testing and improve spot termination detection' to better reflect all major changes included in this comprehensive CI update.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description provides a GitHub issue reference and explains the main objective (adding H100 to CI and improving spot termination detection), but the PR checklist items are unchecked and no details about testing or pre-commit verification are provided, making the description incomplete relative to the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch ci_hopper
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yongwww
Copy link
Member Author

yongwww commented Feb 12, 2026

cc: @dierksen @yzh119 @sricketts

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a script to analyze CI job failures for spot instance termination. The script is a good addition. I've found two areas for improvement to make it more robust: handling API pagination to ensure all jobs are checked, and using secure temporary file creation. Details are in the specific comments.

Comment on lines 36 to 37
FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \
--jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current gh api call to fetch jobs has a hardcoded per_page=100 limit and does not handle pagination. If a workflow run has more than 100 jobs, this script could miss the failed job, leading to an incorrect result. To make this robust, you should use the --paginate flag, which will ensure all jobs from all pages are fetched and processed.

Suggested change
FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \
--jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")
FAILED_JOBS=$(gh api --paginate "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs" \
--jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b4e8481

SPOT_TERMINATION=false

# Temp file for job logs (cleaned up on exit)
LOG_FILE="/tmp/job_log.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded path for a temporary file is not recommended due to potential race conditions and security vulnerabilities. Please use mktemp to generate a unique temporary file name.

Note that after this change, you will need to adjust how the corresponding .zip file is handled. For example, you could define LOG_FILE_ZIP="${LOG_FILE}.zip" and use that variable throughout the script.

Suggested change
LOG_FILE="/tmp/job_log.txt"
LOG_FILE=$(mktemp)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b4e8481

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@scripts/task_analyze_spot.sh`:
- Around line 45-53: The current broad grep of JOB_INFO (variable JOB_INFO from
gh api for JOB_ID in FAILED_JOBS) risks matching fail-fast cancellations; narrow
the heuristic by inspecting the JSON field that contains runner/annotation
messages (e.g., .annotations[].message or the runner-specific message field)
instead of grepping the whole JOB_INFO, and add a guard that skips setting
SPOT_TERMINATION when the workflow-triggering job’s own conclusion is "failure"
(i.e., only mark SPOT_TERMINATION when the triggering job did not fail and the
runner/annotation field contains "operation was canceled" or runner-lost
patterns).
🧹 Nitpick comments (4)
scripts/task_analyze_spot.sh (2)

36-37: Pagination: per_page=100 may miss jobs in large matrix runs.

The AOT matrix alone is 2 arches × 4 CUDA versions = 8 jobs, plus reruns and other GPU jobs all share the same run ID. While 100 is likely sufficient today, consider using --paginate (as done elsewhere in this workflow for team member listing) or adding a comment noting the assumption.


30-33: Hardcoded /tmp/job_log.txt is safe for now but fragile.

Each analysis job runs on its own ubuntu-latest runner, so there's no concurrent collision risk today. However, using mktemp would be more robust if this script is ever reused in a different context.

♻️ Suggested improvement
-LOG_FILE="/tmp/job_log.txt"
-cleanup() { rm -f "$LOG_FILE" "${LOG_FILE}.zip"; }
+LOG_DIR=$(mktemp -d)
+LOG_FILE="${LOG_DIR}/job_log.txt"
+cleanup() { rm -rf "$LOG_DIR"; }
.github/workflows/pr-test.yml (2)

159-159: TODO: Re-add .github/ to skip patterns before merging.

Line 159 has a TODO noting that ^\.github/ should be added back to SKIP_PATTERNS before merging to main. Without it, changes to only .github/ workflow files will trigger the full build/test suite unnecessarily on main. Make sure this is addressed before merge.


661-667: H100 result handling in the summary is correct but inconsistent with other tests.

The H100 result is handled inline rather than through the check_status helper function used for AOT, A10G, and T4. This is understandable since H100 has no rerun logic, but the check_status function already handles the simple case (when spot_term is not "true" and the spot result is "success", it reports passed). You could reuse it for consistency:

♻️ Optional: reuse check_status for H100
-          # H100 tests (no rerun logic yet - CB instances don't get spot terminated)
-          echo "" >> $GITHUB_STEP_SUMMARY
-          H100="${{ needs.gpu-tests-h100.result }}"
-          echo "GPU Tests (H100): $H100" >> $GITHUB_STEP_SUMMARY
-          if [ "$H100" != "success" ] && [ "$H100" != "skipped" ] && [ "${{ github.event.inputs.skip_gpu }}" != "true" ]; then
-            FAILED=true
-          fi
+          echo "" >> $GITHUB_STEP_SUMMARY
+          check_status "GPU Tests (H100)" \
+            "${{ github.event.inputs.skip_gpu }}" \
+            "${{ needs.gpu-tests-h100.result }}" \
+            "false" \
+            "" || FAILED=true

Comment on lines +45 to +53
for JOB_ID in $FAILED_JOBS; do
# First check job metadata for runner communication errors
# This catches "The operation was canceled" which appears in annotations, not logs
JOB_INFO=$(gh api "/repos/${REPOSITORY}/actions/jobs/${JOB_ID}" 2>/dev/null || true)
if echo "$JOB_INFO" | grep -qiE "operation was canceled|runner.*lost|lost communication"; then
echo "Detected: Runner lost communication or operation canceled (job $JOB_ID)"
SPOT_TERMINATION=true
break
fi
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

"operation was canceled" may cause false-positive spot-termination reruns with fail-fast: true.

When fail-fast: true is set (as it is for all matrix jobs in this workflow), GitHub Actions cancels sibling matrix jobs when one fails. Cancelled jobs can have "The operation was canceled" in their annotations/metadata. Grepping the entire JSON response for this pattern will match those fail-fast cancellations too, incorrectly flagging them as spot terminations and triggering unnecessary on-demand reruns.

Previously, the analysis relied on specific network error patterns (e.g., connection reset by peer, grpc.*closing) that fail-fast cancellation does not produce. This new, broader check on line 49 undermines that safety. Consider either:

  1. Restricting the grep to a specific JSON field (e.g., .steps[].conclusion or annotation messages from the runner itself), or
  2. Adding a negative guard that skips this heuristic when the triggering job's own conclusion is failure (i.e., a real test failure caused the fail-fast cascade).

Based on learnings: "the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns."

🤖 Prompt for AI Agents
In `@scripts/task_analyze_spot.sh` around lines 45 - 53, The current broad grep of
JOB_INFO (variable JOB_INFO from gh api for JOB_ID in FAILED_JOBS) risks
matching fail-fast cancellations; narrow the heuristic by inspecting the JSON
field that contains runner/annotation messages (e.g., .annotations[].message or
the runner-specific message field) instead of grepping the whole JOB_INFO, and
add a guard that skips setting SPOT_TERMINATION when the workflow-triggering
job’s own conclusion is "failure" (i.e., only mark SPOT_TERMINATION when the
triggering job did not fail and the runner/annotation field contains "operation
was canceled" or runner-lost patterns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fail-fast cancellation produces "Canceling since another job has failed" in the logs, not "operation was canceled" in the job metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongwww, thank you for the clarification! You're absolutely right—I misunderstood the distinction. Since the script checks job metadata (not logs), and fail-fast cancellations produce "Canceling since another job has failed" in logs rather than "operation was canceled" in metadata, there's no risk of false positives here. My concern was unfounded.

I appreciate you explaining this behavioral difference between spot termination and fail-fast cancellation!

(\_(\ 
(='.'=)  Learned something new!
(")_(")

✏️ Learnings added
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2552
File: scripts/task_analyze_spot.sh:45-53
Timestamp: 2026-02-12T22:25:03.967Z
Learning: In GitHub Actions, fail-fast cancellations produce "Canceling since another job has failed" in the job logs, while spot terminations produce "operation was canceled" in the job metadata (API response). These are distinct patterns in different locations, so checking job metadata for "operation was canceled" (as in `scripts/task_analyze_spot.sh`) will not match fail-fast cancellations.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T06:06:12.386Z
Learning: In `.github/workflows/pr-test-runner.yml`, the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns.

Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T05:51:05.040Z
Learning: In `.github/workflows/pr-test-runner.yml`, the team prefers to keep `fail-fast: true` for matrix jobs to reduce overall test time, even with the potential for false-positive spot-termination reruns.

@yongwww
Copy link
Member Author

yongwww commented Feb 12, 2026

@bkryu FYI, as we discussed earlier, changes to benchmarks won’t automatically trigger the PR tests with this change.

Copy link
Contributor

@dierksen dierksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yongwww yongwww merged commit 1556d78 into flashinfer-ai:main Feb 19, 2026
56 of 70 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 19, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants