[CI]: Enable Blackwell & Hopper in public CI testing#2413
[CI]: Enable Blackwell & Hopper in public CI testing#2413yongwww wants to merge 9 commits intoflashinfer-ai:mainfrom
Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
📝 WalkthroughWalkthroughAdds two GPU test jobs ( Changes
Sequence Diagram(s)sequenceDiagram
participant PR as "PR Trigger"
participant GH as "GitHub Actions"
participant JobB as "gpu-tests-b200"
participant JobH as "gpu-tests-h100"
participant Script as "scripts/task_analyze_spot.sh"
participant Art as "Artifacts / Logs"
participant Summary as "test-results-summary"
PR->>GH: trigger pr-test workflow
GH->>JobB: start B200 job (setup, tests, upload logs)
GH->>JobH: start H100 job (setup, tests, upload logs)
JobB-->>Art: upload logs & status
JobH-->>Art: upload logs & status
Summary->>Script: call analyzer (job_filter, repo, run_id)
Script->>GH: query workflow run jobs / download logs
Script-->>Summary: return is_spot_termination flag
alt is_spot_termination == false
Summary->>Art: download logs, aggregate statuses -> final pass/fail
else
Summary->>Summary: mark SPOT_TERMINATION / infra-failure, adjust rerun matrix
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The B200 test is green: The H100 test is still pending due to a lack of available H100 runners (no EC2 capacity available at the moment). |
|
Considering we have 8 GPUs for Hopper and Blackwell runners, my suggestion is to have a script automatically load balance the unittests and dispatch them to all 8 GPUs. |
Thanks @yzh119! Good point about utilizing all 8 GPUs. Here are the two approaches I've considered:
Option 2: 1 Runner (1 per node as suggested)
Currently using Option 1 in the pr, I will switch to Option 2. update (Jan 29, 2026) Option 3: 1 Runner for 4 GPUS, 4 runners with 1 GPU each (proposed by @dierksen )
I’m in favor of Option 3, please let me know if you have any concerns. @yzh119 @dierksen |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test.yml:
- Around line 733-737: The gpu-tests-h100 job is missing the gate dependency and
authorization check (same as the B200 job); update the job definition for
gpu-tests-h100 to include the gate job in its needs list and extend its if
condition to require the gate authorization output (e.g., add gate to "needs:
setup" so it becomes "needs: [setup, gate]" and append "&&
needs.gate.outputs.authorized == 'true'" to the existing if expression), making
sure the output key (authorized) matches the gate job's output name.
- Around line 684-688: The gpu-tests-b200 job definition (gpu-tests-b200) is
missing the gate dependency and authorization check, allowing unauthorized PRs
to run on B200 runners; update the job to include needs: [gate, setup] and add
the same conditional check used by other GPU jobs (i.e. include
needs.gate.outputs.authorized == 'true' in the if expression alongside the
existing skip_build and skip_gpu checks) so the job waits for gate and only runs
when authorized.
- Around line 775-776: The H100 job ("Run H100 Kernel Tests") incorrectly
invokes task_test_blackwell_kernels.sh; either create a Hopper-specific script
named task_test_hopper_kernels.sh and update the job to run that, or if
Blackwell kernels are verified compatible with Hopper, update the job to
document/clarify the compatibility and rename the script reference to a more
generic name (or add a wrapper script) so the job no longer mislabels
H100/Hopper as Blackwell; update the workflow entry that runs bash ci/bash.sh
${DOCKER_IMAGE} ./scripts/task_test_blackwell_kernels.sh to point to
task_test_hopper_kernels.sh or the new generic/wrapper script and add the new
script file (task_test_hopper_kernels.sh) implementing Hopper-specific tests if
creating one.
|
All CI jobs are green except tests/moe/test_trtllm_gen_fused_moe.py on B200, which matches the current pipeline. |
|
@flashinfer-bot rerun failed |
- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script
📌 Description
#2355
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Tests
Bug Fixes / Reliability
Chores
✏️ Tip: You can customize this high-level summary in your review settings.