Conversation
- Check job metadata/annotations for operation was canceled errors - Treat failed log downloads as infrastructure failures - Fixes cases where spot termination happens too fast for monitor script
Summary of ChangesHello @yongwww, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request expands the project's continuous integration capabilities by incorporating H100 GPUs for public testing, ensuring broader hardware compatibility. Concurrently, it refines the mechanism for identifying infrastructure-related job failures, specifically improving the detection of spot instance terminations by recognizing common cancellation messages. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughStandardizes CI runner/architecture labels, refactors Docker cleanup and checkout steps, removes Docker Hub login, adds a new H100 GPU test job and aggregates its result, and extracts inline spot-termination analysis into a new Changes
Sequence Diagram(s)sequenceDiagram
participant Workflow as GitHub Actions Workflow
participant Script as scripts/task_analyze_spot.sh
participant GHAPI as GitHub CLI / API
participant Runner as Runner & Job Logs
Workflow->>Script: invoke (job_filter, repo, run_id)
Script->>GHAPI: list jobs for run (filtered)
GHAPI-->>Script: job metadata (conclusions)
alt failed/cancelled jobs exist
Script->>GHAPI: download job logs
GHAPI-->>Runner: provide logs (zip or text)
Script->>Script: extract & scan logs for markers/errors
alt marker/error found
Script-->>Workflow: set is_spot_termination=true (GITHUB_OUTPUT)
else
Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
end
else
Script-->>Workflow: set is_spot_termination=false (GITHUB_OUTPUT)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
No actionable comments were generated in the recent review. 🎉 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This PR adds a script to analyze CI job failures for spot instance termination. The script is a good addition. I've found two areas for improvement to make it more robust: handling API pagination to ensure all jobs are checked, and using secure temporary file creation. Details are in the specific comments.
scripts/task_analyze_spot.sh
Outdated
| FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \ | ||
| --jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id") |
There was a problem hiding this comment.
The current gh api call to fetch jobs has a hardcoded per_page=100 limit and does not handle pagination. If a workflow run has more than 100 jobs, this script could miss the failed job, leading to an incorrect result. To make this robust, you should use the --paginate flag, which will ensure all jobs from all pages are fetched and processed.
| FAILED_JOBS=$(gh api "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs?per_page=100" \ | |
| --jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id") | |
| FAILED_JOBS=$(gh api --paginate "/repos/${REPOSITORY}/actions/runs/${RUN_ID}/jobs" \ | |
| --jq ".jobs[] | select(.name | ${JOB_FILTER}) | select(.conclusion == \"failure\" or .conclusion == \"cancelled\") | .id") |
scripts/task_analyze_spot.sh
Outdated
| SPOT_TERMINATION=false | ||
|
|
||
| # Temp file for job logs (cleaned up on exit) | ||
| LOG_FILE="/tmp/job_log.txt" |
There was a problem hiding this comment.
Using a hardcoded path for a temporary file is not recommended due to potential race conditions and security vulnerabilities. Please use mktemp to generate a unique temporary file name.
Note that after this change, you will need to adjust how the corresponding .zip file is handled. For example, you could define LOG_FILE_ZIP="${LOG_FILE}.zip" and use that variable throughout the script.
| LOG_FILE="/tmp/job_log.txt" | |
| LOG_FILE=$(mktemp) |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/task_analyze_spot.sh`:
- Around line 45-53: The current broad grep of JOB_INFO (variable JOB_INFO from
gh api for JOB_ID in FAILED_JOBS) risks matching fail-fast cancellations; narrow
the heuristic by inspecting the JSON field that contains runner/annotation
messages (e.g., .annotations[].message or the runner-specific message field)
instead of grepping the whole JOB_INFO, and add a guard that skips setting
SPOT_TERMINATION when the workflow-triggering job’s own conclusion is "failure"
(i.e., only mark SPOT_TERMINATION when the triggering job did not fail and the
runner/annotation field contains "operation was canceled" or runner-lost
patterns).
🧹 Nitpick comments (4)
scripts/task_analyze_spot.sh (2)
36-37: Pagination:per_page=100may miss jobs in large matrix runs.The AOT matrix alone is 2 arches × 4 CUDA versions = 8 jobs, plus reruns and other GPU jobs all share the same run ID. While 100 is likely sufficient today, consider using
--paginate(as done elsewhere in this workflow for team member listing) or adding a comment noting the assumption.
30-33: Hardcoded/tmp/job_log.txtis safe for now but fragile.Each analysis job runs on its own
ubuntu-latestrunner, so there's no concurrent collision risk today. However, usingmktempwould be more robust if this script is ever reused in a different context.♻️ Suggested improvement
-LOG_FILE="/tmp/job_log.txt" -cleanup() { rm -f "$LOG_FILE" "${LOG_FILE}.zip"; } +LOG_DIR=$(mktemp -d) +LOG_FILE="${LOG_DIR}/job_log.txt" +cleanup() { rm -rf "$LOG_DIR"; }.github/workflows/pr-test.yml (2)
159-159: TODO: Re-add.github/to skip patterns before merging.Line 159 has a TODO noting that
^\.github/should be added back toSKIP_PATTERNSbefore merging to main. Without it, changes to only.github/workflow files will trigger the full build/test suite unnecessarily on main. Make sure this is addressed before merge.
661-667: H100 result handling in the summary is correct but inconsistent with other tests.The H100 result is handled inline rather than through the
check_statushelper function used for AOT, A10G, and T4. This is understandable since H100 has no rerun logic, but thecheck_statusfunction already handles the simple case (whenspot_termis not"true"and the spot result is"success", it reports passed). You could reuse it for consistency:♻️ Optional: reuse check_status for H100
- # H100 tests (no rerun logic yet - CB instances don't get spot terminated) - echo "" >> $GITHUB_STEP_SUMMARY - H100="${{ needs.gpu-tests-h100.result }}" - echo "GPU Tests (H100): $H100" >> $GITHUB_STEP_SUMMARY - if [ "$H100" != "success" ] && [ "$H100" != "skipped" ] && [ "${{ github.event.inputs.skip_gpu }}" != "true" ]; then - FAILED=true - fi + echo "" >> $GITHUB_STEP_SUMMARY + check_status "GPU Tests (H100)" \ + "${{ github.event.inputs.skip_gpu }}" \ + "${{ needs.gpu-tests-h100.result }}" \ + "false" \ + "" || FAILED=true
| for JOB_ID in $FAILED_JOBS; do | ||
| # First check job metadata for runner communication errors | ||
| # This catches "The operation was canceled" which appears in annotations, not logs | ||
| JOB_INFO=$(gh api "/repos/${REPOSITORY}/actions/jobs/${JOB_ID}" 2>/dev/null || true) | ||
| if echo "$JOB_INFO" | grep -qiE "operation was canceled|runner.*lost|lost communication"; then | ||
| echo "Detected: Runner lost communication or operation canceled (job $JOB_ID)" | ||
| SPOT_TERMINATION=true | ||
| break | ||
| fi |
There was a problem hiding this comment.
"operation was canceled" may cause false-positive spot-termination reruns with fail-fast: true.
When fail-fast: true is set (as it is for all matrix jobs in this workflow), GitHub Actions cancels sibling matrix jobs when one fails. Cancelled jobs can have "The operation was canceled" in their annotations/metadata. Grepping the entire JSON response for this pattern will match those fail-fast cancellations too, incorrectly flagging them as spot terminations and triggering unnecessary on-demand reruns.
Previously, the analysis relied on specific network error patterns (e.g., connection reset by peer, grpc.*closing) that fail-fast cancellation does not produce. This new, broader check on line 49 undermines that safety. Consider either:
- Restricting the grep to a specific JSON field (e.g.,
.steps[].conclusionor annotation messages from the runner itself), or - Adding a negative guard that skips this heuristic when the triggering job's own conclusion is
failure(i.e., a real test failure caused the fail-fast cascade).
Based on learnings: "the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns."
🤖 Prompt for AI Agents
In `@scripts/task_analyze_spot.sh` around lines 45 - 53, The current broad grep of
JOB_INFO (variable JOB_INFO from gh api for JOB_ID in FAILED_JOBS) risks
matching fail-fast cancellations; narrow the heuristic by inspecting the JSON
field that contains runner/annotation messages (e.g., .annotations[].message or
the runner-specific message field) instead of grepping the whole JOB_INFO, and
add a guard that skips setting SPOT_TERMINATION when the workflow-triggering
job’s own conclusion is "failure" (i.e., only mark SPOT_TERMINATION when the
triggering job did not fail and the runner/annotation field contains "operation
was canceled" or runner-lost patterns).
There was a problem hiding this comment.
fail-fast cancellation produces "Canceling since another job has failed" in the logs, not "operation was canceled" in the job metadata.
There was a problem hiding this comment.
@yongwww, thank you for the clarification! You're absolutely right—I misunderstood the distinction. Since the script checks job metadata (not logs), and fail-fast cancellations produce "Canceling since another job has failed" in logs rather than "operation was canceled" in metadata, there's no risk of false positives here. My concern was unfounded.
I appreciate you explaining this behavioral difference between spot termination and fail-fast cancellation!
(\_(\
(='.'=) Learned something new!
(")_(")
✏️ Learnings added
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2552
File: scripts/task_analyze_spot.sh:45-53
Timestamp: 2026-02-12T22:25:03.967Z
Learning: In GitHub Actions, fail-fast cancellations produce "Canceling since another job has failed" in the job logs, while spot terminations produce "operation was canceled" in the job metadata (API response). These are distinct patterns in different locations, so checking job metadata for "operation was canceled" (as in `scripts/task_analyze_spot.sh`) will not match fail-fast cancellations.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T06:06:12.386Z
Learning: In `.github/workflows/pr-test-runner.yml`, the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns.
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T05:51:05.040Z
Learning: In `.github/workflows/pr-test-runner.yml`, the team prefers to keep `fail-fast: true` for matrix jobs to reduce overall test time, even with the potential for false-positive spot-termination reruns.
|
@bkryu FYI, as we discussed earlier, changes to benchmarks won’t automatically trigger the PR tests with this change. |
📌 Description
#2355.
Add H100 to public ci testing.
Added "Error: The operation was canceled." (occurred in jobs like https://github.com/flashinfer-ai/flashinfer/actions/runs/21931167188/attempts/1?pr=2425) as one of the spot termination signals.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Tests
Chores