Skip to content

[CI]: Enable Blackwell & Hopper in public CI testing#2413

Draft
yongwww wants to merge 9 commits intoflashinfer-ai:mainfrom
yongwww:ci_blackwell
Draft

[CI]: Enable Blackwell & Hopper in public CI testing#2413
yongwww wants to merge 9 commits intoflashinfer-ai:mainfrom
yongwww:ci_blackwell

Conversation

@yongwww
Copy link
Member

@yongwww yongwww commented Jan 24, 2026

📌 Description

#2355

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Tests

    • Added GPU test support for B200 and H100 with dedicated capacity blocks; test orchestration and final results aggregation now include these GPUs and affect PR status.
    • Normalized architecture labels across matrices and rerun logic for consistent AOT and GPU job handling.
  • Bug Fixes / Reliability

    • Replaced inline log parsing with a scripted spot-failure analysis and hardened detection for infrastructure/spot terminations.
    • Removed redundant Docker Hub login steps and improved sparse checkout for analysis scripts.
  • Chores

    • Switched cleanup to targeted image and builder prune commands.

✏️ Tip: You can customize this high-level summary in your review settings.

@gemini-code-assist
Copy link
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

Adds two GPU test jobs (gpu-tests-b200, gpu-tests-h100), normalizes runner/arch labels to lowercase, replaces broad docker prune with targeted prune commands, factors spot-failure analysis into a new script (scripts/task_analyze_spot.sh), and extends test-results-summary to include B200/H100 statuses.

Changes

Cohort / File(s) Summary
Workflow file
\.github/workflows/pr-test.yml
Added gpu-tests-b200 and gpu-tests-h100; normalized runs-on and matrix arch values to lowercase; removed Docker Hub login steps from some jobs; updated DOCKER_IMAGE references for GPU blocks.
Cleanup & pruning
\.github/workflows/pr-test.yml
Replaced docker system prune with docker image prune and added docker builder prune --filter "until=24h" in cleanup steps.
Failure handling & log analysis
\.github/workflows/pr-test.yml, scripts/task_analyze_spot.sh
Extracted spot/infra failure detection into scripts/task_analyze_spot.sh; workflow now sparse-checkouts scripts, invokes the analyzer to gate log-downloads and rerun logic; adjusted rerun matrix to use lowercase arch values.
Results aggregation
\.github/workflows/pr-test.yml
Expanded test-results-summary needs and final evaluation to include B200 and H100 blocks and propagate their statuses in the final summary.
New script
scripts/task_analyze_spot.sh
New shell script that queries GitHub Jobs for a run, inspects failed/cancelled job logs and metadata for spot-termination/infrastructure error patterns, and emits is_spot_termination via GITHUB_OUTPUT.

Sequence Diagram(s)

sequenceDiagram
    participant PR as "PR Trigger"
    participant GH as "GitHub Actions"
    participant JobB as "gpu-tests-b200"
    participant JobH as "gpu-tests-h100"
    participant Script as "scripts/task_analyze_spot.sh"
    participant Art as "Artifacts / Logs"
    participant Summary as "test-results-summary"

    PR->>GH: trigger pr-test workflow
    GH->>JobB: start B200 job (setup, tests, upload logs)
    GH->>JobH: start H100 job (setup, tests, upload logs)
    JobB-->>Art: upload logs & status
    JobH-->>Art: upload logs & status
    Summary->>Script: call analyzer (job_filter, repo, run_id)
    Script->>GH: query workflow run jobs / download logs
    Script-->>Summary: return is_spot_termination flag
    alt is_spot_termination == false
        Summary->>Art: download logs, aggregate statuses -> final pass/fail
    else
        Summary->>Summary: mark SPOT_TERMINATION / infra-failure, adjust rerun matrix
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • yzh119
  • nvmbreughe
  • kahyunnam
  • jimmyzho

Poem

🐰 I hopped through CI with a curious nose,
Two GPUs now spin where the test river flows.
I sniffed for spot slips, pruned images just so,
Collected the logs, then watched statuses grow.
Carrots for passing — hop, rerun, and go! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description contains only a template reference (issue #2355) with incomplete checklist items and no substantive explanation of the changes made. Add a detailed description of what this PR does (e.g., enabling Blackwell & Hopper GPU testing, removing Docker login steps, introducing spot termination analysis script) and why these changes are needed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: enabling Blackwell and Hopper GPU support in the CI testing workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yongwww yongwww changed the title ci: Enable blackwell tests in public ci [CI]: Enable Blackwell tests in public CI Jan 24, 2026
@yongwww yongwww changed the title [CI]: Enable Blackwell tests in public CI [CI]: Enable Blackwell & Hopper in public CI testing Jan 24, 2026
@yongwww yongwww marked this pull request as ready for review January 24, 2026 17:10
@yongwww
Copy link
Member Author

yongwww commented Jan 24, 2026

The B200 test is green:
https://github.com/flashinfer-ai/flashinfer/actions/runs/21314185264/job/61365090019?pr=2413

The H100 test is still pending due to a lack of available H100 runners (no EC2 capacity available at the moment).

@yzh119
Copy link
Collaborator

yzh119 commented Jan 24, 2026

Considering we have 8 GPUs for Hopper and Blackwell runners, my suggestion is to have a script automatically load balance the unittests and dispatch them to all 8 GPUs.

@yongwww
Copy link
Member Author

yongwww commented Jan 25, 2026

Considering we have 8 GPUs for Hopper and Blackwell runners, my suggestion is to have a script automatically load balance the unittests and dispatch them to all 8 GPUs.

Thanks @yzh119! Good point about utilizing all 8 GPUs. Here are the two approaches I've considered:
Option 1: 8 Runners (1 per GPU) - current

  • ✅ 8 PRs can run in parallel
  • ❌ Cannot run multi-GPU tests

Option 2: 1 Runner (1 per node as suggested)

  • ✅ Can run multi-GPU tests, we can utilize all the GPUs for each run (therefore quicker to complete the related PR test)
  • ❌ Only 1 PR runs at a time; others queue

Currently using Option 1 in the pr, I will switch to Option 2.


update (Jan 29, 2026)

Option 3: 1 Runner for 4 GPUS, 4 runners with 1 GPU each (proposed by @dierksen )

  • ✅ several prs can run in parallel
  • ✅ we can run multi-gpu tests (only one runner for multi-gpu, but we can have 2 2-gpu runners if needed)
  • ❌ a bit complex for managing the runners (but not hard to handle)

I’m in favor of Option 3, please let me know if you have any concerns. @yzh119 @dierksen

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 733-737: The gpu-tests-h100 job is missing the gate dependency and
authorization check (same as the B200 job); update the job definition for
gpu-tests-h100 to include the gate job in its needs list and extend its if
condition to require the gate authorization output (e.g., add gate to "needs:
setup" so it becomes "needs: [setup, gate]" and append "&&
needs.gate.outputs.authorized == 'true'" to the existing if expression), making
sure the output key (authorized) matches the gate job's output name.
- Around line 684-688: The gpu-tests-b200 job definition (gpu-tests-b200) is
missing the gate dependency and authorization check, allowing unauthorized PRs
to run on B200 runners; update the job to include needs: [gate, setup] and add
the same conditional check used by other GPU jobs (i.e. include
needs.gate.outputs.authorized == 'true' in the if expression alongside the
existing skip_build and skip_gpu checks) so the job waits for gate and only runs
when authorized.
- Around line 775-776: The H100 job ("Run H100 Kernel Tests") incorrectly
invokes task_test_blackwell_kernels.sh; either create a Hopper-specific script
named task_test_hopper_kernels.sh and update the job to run that, or if
Blackwell kernels are verified compatible with Hopper, update the job to
document/clarify the compatibility and rename the script reference to a more
generic name (or add a wrapper script) so the job no longer mislabels
H100/Hopper as Blackwell; update the workflow entry that runs bash ci/bash.sh
${DOCKER_IMAGE} ./scripts/task_test_blackwell_kernels.sh to point to
task_test_hopper_kernels.sh or the new generic/wrapper script and add the new
script file (task_test_hopper_kernels.sh) implementing Hopper-specific tests if
creating one.

@yongwww
Copy link
Member Author

yongwww commented Feb 1, 2026

All CI jobs are green except tests/moe/test_trtllm_gen_fused_moe.py on B200, which matches the current pipeline.

@yongwww
Copy link
Member Author

yongwww commented Feb 1, 2026

@flashinfer-bot rerun failed

@yongwww yongwww marked this pull request as draft February 12, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants