[CI]: Enable Blackwell & Hopper in public CI testing by yongwww · Pull Request #2413 · flashinfer-ai/flashinfer

yongwww · 2026-01-24T07:41:45Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Tests
- Added GPU test support for B200 and H100 with dedicated capacity blocks; test orchestration and final results aggregation now include these GPUs and affect PR status.
- Normalized architecture labels across matrices and rerun logic for consistent AOT and GPU job handling.
Bug Fixes / Reliability
- Replaced inline log parsing with a scripted spot-failure analysis and hardened detection for infrastructure/spot terminations.
- Removed redundant Docker Hub login steps and improved sparse checkout for analysis scripts.
Chores
- Switched cleanup to targeted image and builder prune commands.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gemini-code-assist · 2026-01-24T07:41:50Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

coderabbitai · 2026-01-24T07:41:51Z

📝 Walkthrough

Walkthrough

Adds two GPU test jobs (gpu-tests-b200, gpu-tests-h100), normalizes runner/arch labels to lowercase, replaces broad docker prune with targeted prune commands, factors spot-failure analysis into a new script (scripts/task_analyze_spot.sh), and extends test-results-summary to include B200/H100 statuses.

Changes

Cohort / File(s)	Summary
Workflow file `\.github/workflows/pr-test.yml`	Added `gpu-tests-b200` and `gpu-tests-h100`; normalized `runs-on` and matrix `arch` values to lowercase; removed Docker Hub login steps from some jobs; updated DOCKER_IMAGE references for GPU blocks.
Cleanup & pruning `\.github/workflows/pr-test.yml`	Replaced `docker system prune` with `docker image prune` and added `docker builder prune --filter "until=24h"` in cleanup steps.
Failure handling & log analysis `\.github/workflows/pr-test.yml`, `scripts/task_analyze_spot.sh`	Extracted spot/infra failure detection into `scripts/task_analyze_spot.sh`; workflow now sparse-checkouts scripts, invokes the analyzer to gate log-downloads and rerun logic; adjusted rerun matrix to use lowercase arch values.
Results aggregation `\.github/workflows/pr-test.yml`	Expanded `test-results-summary` `needs` and final evaluation to include B200 and H100 blocks and propagate their statuses in the final summary.
New script `scripts/task_analyze_spot.sh`	New shell script that queries GitHub Jobs for a run, inspects failed/cancelled job logs and metadata for spot-termination/infrastructure error patterns, and emits `is_spot_termination` via `GITHUB_OUTPUT`.

Sequence Diagram(s)

sequenceDiagram
    participant PR as "PR Trigger"
    participant GH as "GitHub Actions"
    participant JobB as "gpu-tests-b200"
    participant JobH as "gpu-tests-h100"
    participant Script as "scripts/task_analyze_spot.sh"
    participant Art as "Artifacts / Logs"
    participant Summary as "test-results-summary"

    PR->>GH: trigger pr-test workflow
    GH->>JobB: start B200 job (setup, tests, upload logs)
    GH->>JobH: start H100 job (setup, tests, upload logs)
    JobB-->>Art: upload logs & status
    JobH-->>Art: upload logs & status
    Summary->>Script: call analyzer (job_filter, repo, run_id)
    Script->>GH: query workflow run jobs / download logs
    Script-->>Summary: return is_spot_termination flag
    alt is_spot_termination == false
        Summary->>Art: download logs, aggregate statuses -> final pass/fail
    else
        Summary->>Summary: mark SPOT_TERMINATION / infra-failure, adjust rerun matrix
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: introduce GitHub Actions workflow for PR testing #2326 — Modifies the same .github/workflows/pr-test.yml and overlaps on GPU jobs and test-results-summary changes.
[CI] Add on-demand rerun for spot-terminated jobs #2403 — Implements similar CI spot-termination detection and rerun flow tied to log analysis.
ci: add Docker Hub authentication to mitigate pull rate limits #2393 — Touches Docker Hub authentication steps in the same workflow (opposite change to login behavior).

Suggested labels

run-ci

Suggested reviewers

yzh119
nvmbreughe
kahyunnam
jimmyzho

Poem

🐰 I hopped through CI with a curious nose,
Two GPUs now spin where the test river flows.
I sniffed for spot slips, pruned images just so,
Collected the logs, then watched statuses grow.
Carrots for passing — hop, rerun, and go! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description contains only a template reference (issue `#2355`) with incomplete checklist items and no substantive explanation of the changes made.	Add a detailed description of what this PR does (e.g., enabling Blackwell & Hopper GPU testing, removing Docker login steps, introducing spot termination analysis script) and why these changes are needed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: enabling Blackwell and Hopper GPU support in the CI testing workflow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yongwww · 2026-01-24T17:13:56Z

The B200 test is green:
https://github.com/flashinfer-ai/flashinfer/actions/runs/21314185264/job/61365090019?pr=2413

The H100 test is still pending due to a lack of available H100 runners (no EC2 capacity available at the moment).

yzh119 · 2026-01-24T20:45:52Z

Considering we have 8 GPUs for Hopper and Blackwell runners, my suggestion is to have a script automatically load balance the unittests and dispatch them to all 8 GPUs.

yongwww · 2026-01-25T13:07:21Z

Considering we have 8 GPUs for Hopper and Blackwell runners, my suggestion is to have a script automatically load balance the unittests and dispatch them to all 8 GPUs.

Thanks @yzh119! Good point about utilizing all 8 GPUs. Here are the two approaches I've considered:
Option 1: 8 Runners (1 per GPU) - current

✅ 8 PRs can run in parallel
❌ Cannot run multi-GPU tests

Option 2: 1 Runner (1 per node as suggested)

✅ Can run multi-GPU tests, we can utilize all the GPUs for each run (therefore quicker to complete the related PR test)
❌ Only 1 PR runs at a time; others queue

Currently using Option 1 in the pr, I will switch to Option 2.

update (Jan 29, 2026)

Option 3: 1 Runner for 4 GPUS, 4 runners with 1 GPU each (proposed by @dierksen )

✅ several prs can run in parallel
✅ we can run multi-gpu tests (only one runner for multi-gpu, but we can have 2 2-gpu runners if needed)
❌ a bit complex for managing the runners (but not hard to handle)

I’m in favor of Option 3, please let me know if you have any concerns. @yzh119 @dierksen

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @.github/workflows/pr-test.yml:
- Around line 733-737: The gpu-tests-h100 job is missing the gate dependency and
authorization check (same as the B200 job); update the job definition for
gpu-tests-h100 to include the gate job in its needs list and extend its if
condition to require the gate authorization output (e.g., add gate to "needs:
setup" so it becomes "needs: [setup, gate]" and append "&&
needs.gate.outputs.authorized == 'true'" to the existing if expression), making
sure the output key (authorized) matches the gate job's output name.
- Around line 684-688: The gpu-tests-b200 job definition (gpu-tests-b200) is
missing the gate dependency and authorization check, allowing unauthorized PRs
to run on B200 runners; update the job to include needs: [gate, setup] and add
the same conditional check used by other GPU jobs (i.e. include
needs.gate.outputs.authorized == 'true' in the if expression alongside the
existing skip_build and skip_gpu checks) so the job waits for gate and only runs
when authorized.
- Around line 775-776: The H100 job ("Run H100 Kernel Tests") incorrectly
invokes task_test_blackwell_kernels.sh; either create a Hopper-specific script
named task_test_hopper_kernels.sh and update the job to run that, or if
Blackwell kernels are verified compatible with Hopper, update the job to
document/clarify the compatibility and rename the script reference to a more
generic name (or add a wrapper script) so the job no longer mislabels
H100/Hopper as Blackwell; update the workflow entry that runs bash ci/bash.sh
${DOCKER_IMAGE} ./scripts/task_test_blackwell_kernels.sh to point to
task_test_hopper_kernels.sh or the new generic/wrapper script and add the new
script file (task_test_hopper_kernels.sh) implementing Hopper-specific tests if
creating one.

.github/workflows/pr-test.yml

yongwww · 2026-02-01T09:00:52Z

All CI jobs are green except tests/moe/test_trtllm_gen_fused_moe.py on B200, which matches the current pipeline.

yongwww · 2026-02-01T09:01:39Z

@flashinfer-bot rerun failed

- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script

…images

yongwww changed the title ~~ci: Enable blackwell tests in public ci~~ [CI]: Enable Blackwell tests in public CI Jan 24, 2026

yongwww changed the title ~~[CI]: Enable Blackwell tests in public CI~~ [CI]: Enable Blackwell & Hopper in public CI testing Jan 24, 2026

yongwww marked this pull request as ready for review January 24, 2026 17:10

yongwww requested review from jimmyzho, kahyunnam, nvmbreughe and yzh119 as code owners January 24, 2026 17:10

yongwww mentioned this pull request Jan 24, 2026

Expand Public CI GPU Architecture Coverage via GitHub Actions #2355

Open

yongwww force-pushed the ci_blackwell branch from 5b0a655 to 6a39907 Compare January 30, 2026 21:39

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

.github/workflows/pr-test.yml Outdated Show resolved Hide resolved

.github/workflows/pr-test.yml Outdated Show resolved Hide resolved

.github/workflows/pr-test.yml Show resolved Hide resolved

yongwww added 8 commits February 1, 2026 01:15

ci: Enable blackwell tests in public ci

ce66062

remove skip

6f31f15

fix: make task_test_blackwell_kernels.sh executable

f2db27e

Enable H100

c657a52

ci: improve spot termination detection for automatic reruns

666a005

- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script

ci: add gate dependency to B200/H100 jobs for authorization check

ccad175

update labels

335e093

Replace docker system prune with targeted cleanup to preserve cached …

bb247c3

…images

yongwww force-pushed the ci_blackwell branch from 15d0ad0 to bb247c3 Compare February 1, 2026 09:16

remove Docker login, extract spot analysis to script

3dd308a

yongwww marked this pull request as draft February 12, 2026 21:43

Conversation

yongwww commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Uh oh!

coderabbitai bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

yongwww commented Jan 24, 2026

Uh oh!

yzh119 commented Jan 24, 2026

Uh oh!

yongwww commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yongwww commented Feb 1, 2026

Uh oh!

yongwww commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yongwww commented Jan 24, 2026 •

edited

Loading

coderabbitai bot commented Jan 24, 2026 •

edited

Loading

yongwww commented Jan 25, 2026 •

edited

Loading