Skip to content

Add parallel testing to unit test script#2531

Merged
kahyunnam merged 6 commits intoflashinfer-ai:mainfrom
dierksen:parallel-testing
Feb 12, 2026
Merged

Add parallel testing to unit test script#2531
kahyunnam merged 6 commits intoflashinfer-ai:mainfrom
dierksen:parallel-testing

Conversation

@dierksen
Copy link
Contributor

@dierksen dierksen commented Feb 9, 2026

  • Enable parallel testing on systems with multiple GPUs
  • Remove default for CUDA_VISIBLE_DEVICES

📌 Description

To make better use of available GPUs, instead of running all unit tests sequentially, this breaks
them up into chunks based on the number of available GPUs. This should help reduce issues with
timeouts on internal CI/CD and give us faster results from ci-bot as well.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • [x ] I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • [x ] I have installed the hooks with pre-commit install.
  • [x ] I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • GPU-aware parallel test execution enabled by default with automatic GPU detection and graceful fallback to single-GPU mode.
  • Tests

    • Added a dedicated GPU-detection test covering multiple visibility and fallback scenarios.
    • Improved test harness to run tests concurrently across GPUs, assign/reclaim GPUs dynamically, report per-test outcomes, and summarize pass/fail totals.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dierksen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the efficiency of unit test execution, particularly on systems equipped with multiple GPUs. By enabling parallel testing, it aims to reduce CI/CD pipeline timeouts and accelerate feedback from automated testing, leading to a more streamlined development workflow.

Highlights

  • Parallel Test Execution: Implemented a new mechanism to run unit tests in parallel across multiple GPUs, significantly reducing execution time on multi-GPU systems.
  • GPU Detection Logic: Introduced a detect_gpus function that intelligently determines available GPUs based on CUDA_VISIBLE_DEVICES or nvidia-smi output.
  • Default CUDA_VISIBLE_DEVICES Removal: Removed the default setting for CUDA_VISIBLE_DEVICES in test_utils.sh, allowing the new detection logic to take precedence.
  • Parallel Tests Toggle: Added a PARALLEL_TESTS environment variable, defaulting to true, to enable or disable parallel test execution.
  • GPU Detection Test Script: Added a dedicated script (test_gpu_detection.sh) to verify the new GPU detection logic under various scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • scripts/test_gpu_detection.sh
    • Added a new shell script to test the detect_gpus function under different CUDA_VISIBLE_DEVICES configurations and nvidia-smi availability.
  • scripts/test_utils.sh
    • Removed the default assignment for CUDA_VISIBLE_DEVICES.
    • Added PARALLEL_TESTS environment variable with a default value of true.
    • Introduced the detect_gpus function to identify available GPUs.
    • Implemented the run_tests_parallel function to orchestrate parallel test execution across detected GPUs.
    • Modified the execute_tests function to conditionally invoke parallel or sequential test execution based on the PARALLEL_TESTS flag.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

Adds GPU detection and parallel test orchestration utilities, a new GPU-detection test script, and enables parallel test execution in the unit test task script.

Changes

Cohort / File(s) Summary
GPU Detection Test Script
scripts/test_gpu_detection.sh
New shell test validating GPU detection across five scenarios: comma/space CUDA_VISIBLE_DEVICES, single-GPU, PARALLEL_TESTS=false behavior, unset-device handling with optional nvidia-smi, and summary/pass-fail reporting.
Parallel Test Execution Framework
scripts/test_utils.sh
Added detect_gpus(), run_tests_parallel(), and run_single_test_background(); execute_tests() now delegates to parallel flow when PARALLEL_TESTS=true; includes GPU assignment, dynamic reclamation, per-test background execution, and result aggregation (logging/JUnit preserved).
Unit Test Task Invocation
scripts/task_run_unit_tests.sh
Exports PARALLEL_TESTS=true, enabling the parallel test execution path by default for unit test runs.

Sequence Diagram

sequenceDiagram
    participant Executor as Test Executor
    participant Detector as GPU Detector
    participant Orchestrator as Parallel Orchestrator
    participant Worker as Background Test Worker
    participant Aggregator as Result Aggregator

    Executor->>Detector: detect_gpus()
    Detector-->>Executor: gpu_list

    Executor->>Orchestrator: run_tests_parallel(tests, gpu_list)

    par parallel test runs
        Orchestrator->>Worker: run_single_test_background(test1, gpu=0)
        Worker-->>Orchestrator: exit code / log
    and
        Orchestrator->>Worker: run_single_test_background(test2, gpu=1)
        Worker-->>Orchestrator: exit code / log
    and
        Orchestrator->>Worker: run_single_test_background(testN, gpu=k)
        Worker-->>Orchestrator: exit code / log
    end

    Orchestrator->>Aggregator: collect per-test results & JUnit XML
    Aggregator-->>Executor: aggregated report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • nvmbreughe
  • kahyunnam
  • jimmyzho
  • yongwww

Poem

🐰 I sniff the GPUs from zero to three,
I spin up tests and set them free,
Logs in my burrow, results hop back,
I tally passes on the rabbit-track,
A happy hopper, reporting with glee.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description provides context for why parallel testing is needed and mentions key changes, but the Tests checklist is incomplete. Clarify test coverage: either add and verify tests for the new parallel testing logic, or explain why existing tests cover the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: enabling parallel testing in the unit test script.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
scripts/test_utils.sh (3)

454-527: Nested function run_single_test_background leaks into global scope.

In bash, functions defined inside other functions become globally visible. After run_tests_parallel runs, run_single_test_background remains callable from anywhere, and it captures variables like total_files, mode, and PARALLEL_TMP_DIR by dynamic scope—meaning it will silently use stale or unset values if invoked outside its intended context.

This is unlikely to cause issues in practice given the script's usage, but worth being aware of.


610-629: Duplicate sanity-mode result parsing in PASSED and FAILED branches.

Lines 612–617 and 623–628 contain identical logic for extracting and accumulating total_in_file / sampled_in_file. Consider extracting the common parsing above the if/elif so it runs once for any non-SKIPPED result.

Sketch
             TOTAL_TESTS=$((TOTAL_TESTS + 1))
 
+            # Parse sanity-mode counts (common to PASSED and FAILED)
+            if [ "$mode" = "sanity" ] && [[ "$result" == PASSED:* || "$result" == FAILED:* ]]; then
+                local total_in_file sampled_in_file
+                IFS=':' read -r _ total_in_file sampled_in_file <<< "$result"
+                TOTAL_TEST_CASES=$((TOTAL_TEST_CASES + total_in_file))
+                SAMPLED_TEST_CASES=$((SAMPLED_TEST_CASES + sampled_in_file))
+            fi
+
             if [[ "$result" == PASSED* ]]; then
                 PASSED_TESTS=$((PASSED_TESTS + 1))
-                if [ "$mode" = "sanity" ]; then
-                    local total_in_file sampled_in_file
-                    IFS=':' read -r _ total_in_file sampled_in_file <<< "$result"
-                    TOTAL_TEST_CASES=$((TOTAL_TEST_CASES + total_in_file))
-                    SAMPLED_TEST_CASES=$((SAMPLED_TEST_CASES + sampled_in_file))
-                fi
             elif [[ "$result" == FAILED* ]]; then
                 FAILED_TESTS="$FAILED_TESTS\n  - $test_file"
                 EXIT_CODE=1
-                if [ "$mode" = "sanity" ]; then
-                    local total_in_file sampled_in_file
-                    IFS=':' read -r _ total_in_file sampled_in_file <<< "$result"
-                    TOTAL_TEST_CASES=$((TOTAL_TEST_CASES + total_in_file))
-                    SAMPLED_TEST_CASES=$((SAMPLED_TEST_CASES + sampled_in_file))
-                fi

434-435: Trap preservation can break if the existing EXIT trap body contains single quotes.

The sed extracts the inner body from trap -p EXIT, then splices it back via string concatenation into a new single-quoted trap string. If the original trap contained single quotes (which trap -p escapes as '\''), the re-wrapping produces malformed quoting.

A more robust alternative uses eval:

Suggested alternative
-    PREV_EXIT_TRAP=$(trap -p EXIT | sed -E "s/^trap -- '(.*)' EXIT$/\1/")
-    trap 'rm -rf "$PARALLEL_TMP_DIR"; '"$PREV_EXIT_TRAP" EXIT
+    PREV_EXIT_TRAP=$(trap -p EXIT)
+    # shellcheck disable=SC2064
+    trap "rm -rf '$PARALLEL_TMP_DIR'; ${PREV_EXIT_TRAP:+eval \"\$PREV_EXIT_TRAP\"}" EXIT

This is unlikely to matter in practice since the caller probably doesn't set an EXIT trap with quotes, but it's worth noting for robustness.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel testing for unit tests, which is a great improvement for CI/CD efficiency. The implementation is solid, but I've identified a few areas for improvement regarding robustness and performance. Specifically, the GPU detection can be made more reliable, the parallel job scheduling logic has a flaw that underutilizes GPUs, and the method for passing job information is not robust against certain characters in file paths. I've provided detailed comments and suggestions to address these points.

Comment on lines +503 to +535
for i in "${!test_files_array[@]}"; do
local test_file="${test_files_array[$i]}"
local gpu_index=$((i % NUM_GPUS))
local gpu_id="${GPU_LIST[$gpu_index]}"
local file_index=$((i + 1))

# Launch test in background
local job_info
job_info=$(run_single_test_background "$test_file" "$gpu_id" "$file_index")

# Parse job info
local pid result_file log_file
IFS=':' read -r pid test_file result_file log_file file_index <<< "$job_info"
test_result_files[$pid]="$result_file:$test_file:$log_file:$file_index"
test_pid_map[$pid]="$test_file"

active_jobs=$((active_jobs + 1))

# Limit concurrent jobs to NUM_GPUS
while [ $active_jobs -ge $NUM_GPUS ]; do
# Wait for any job to finish
for pid in "${!test_pid_map[@]}"; do
if ! kill -0 "$pid" 2>/dev/null; then
# Job finished, process result
wait "$pid" 2>/dev/null || true
active_jobs=$((active_jobs - 1))
unset "test_pid_map[$pid]"
break
fi
done
sleep 0.1
done
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current job scheduling logic underutilizes the available GPUs. For example, with NUM_GPUS=4, it will only keep 3 tests running concurrently. This is because it launches a job and then checks if capacity is met, at which point it waits. This results in NUM_GPUS - 1 concurrency.

The correct approach is to wait for a slot to become available before launching a new job when at capacity. This can be fixed by restructuring the launch loop and removing the redundant active_jobs counter in favor of checking the size of test_pid_map directly.

    for i in "${!test_files_array[@]}"; do
        # Wait for a free slot if we are at capacity
        while [ "${#test_pid_map[@]}" -ge "$NUM_GPUS" ]; do
            local finished=false
            for pid in "${!test_pid_map[@]}"; do
                if ! kill -0 "$pid" 2>/dev/null; then
                    wait "$pid" 2>/dev/null || true
                    unset "test_pid_map[$pid]"
                    finished=true
                    break
                fi
            done
            if ! $finished; then
                sleep 0.1
            fi
        done

        local test_file="${test_files_array[$i]}"
        local gpu_index=$((i % NUM_GPUS))
        local gpu_id="${GPU_LIST[$gpu_index]}"
        local file_index=$((i + 1))

        # Launch test in background
        local job_info
        job_info=$(run_single_test_background "$test_file" "$gpu_id" "$file_index")

        # Parse job info
        local pid result_file log_file
        IFS=':' read -r pid test_file result_file log_file file_index <<< "$job_info"
        test_result_files[$pid]="$result_file:$test_file:$log_file:$file_index"
        test_pid_map[$pid]="$test_file"
    done

Comment on lines +377 to +381
AVAILABLE_GPUS=$(nvidia-smi --list-gpus 2>/dev/null | awk '{print NR-1}' | tr '\n' ' ' | sed 's/ $//')
if [ -n "$AVAILABLE_GPUS" ]; then
echo "$AVAILABLE_GPUS"
return
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of nvidia-smi --list-gpus is not ideal for scripting as its output format is intended for human consumption and may change between driver versions. A more robust approach is to use nvidia-smi --query-gpu, which provides a stable, script-friendly interface. This also allows for better error handling by checking the command's exit code.

Suggested change
AVAILABLE_GPUS=$(nvidia-smi --list-gpus 2>/dev/null | awk '{print NR-1}' | tr '\n' ' ' | sed 's/ $//')
if [ -n "$AVAILABLE_GPUS" ]; then
echo "$AVAILABLE_GPUS"
return
fi
gpus=$(nvidia-smi --query-gpu=index --format=csv,noheader 2>/dev/null)
if [ $? -eq 0 ] && [ -n "$gpus" ]; then
AVAILABLE_GPUS=$(echo "$gpus" | tr '\n' ' ')
echo "$AVAILABLE_GPUS"
return
fi

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@scripts/test_utils.sh`:
- Around line 8-10: Defaulting PARALLEL_TESTS to true changes behavior and can
cause GPU resource contention; change the default to false (or make
PARALLEL_TESTS opt-in) and/or implement auto-fallback logic that sets
PARALLEL_TESTS=false when only a single GPU is detected by your existing
detection logic (e.g., detect_gpus/nvidia-smi check) so parallel execution is
only enabled when multiple GPUs are available; update the assignment for
PARALLEL_TESTS and the startup logic that reads it to respect the
GPU-count-based fallback while leaving SAMPLE_RATE unchanged.
- Around line 408-410: The current trap 'rm -rf "$PARALLEL_TMP_DIR"' overwrites
any existing EXIT handler; capture the existing EXIT trap (e.g.,
PREV_EXIT_TRAP=$(trap -p EXIT)), then install a new EXIT trap that removes
PARALLEL_TMP_DIR and then invokes the previous trap (via eval "$PREV_EXIT_TRAP"
or calling the saved handler), ensuring you reference PARALLEL_TMP_DIR, mktemp,
and the trap invocation in the change so the temporary directory is cleaned
while preserving any caller-registered EXIT cleanup.
- Around line 501-535: The round-robin GPU assignment (using i % NUM_GPUS and
GPU_LIST) allows multiple concurrent tests to use the same GPU because the
throttle only tracks active_jobs; change this to a free-GPU queue: maintain an
available_gpus list initialized from GPU_LIST, pop a GPU id from available_gpus
when launching run_single_test_background (use run_single_test_background
"$test_file" "$gpu_id" "$file_index"), store a mapping pid->gpu (e.g.,
test_gpu_map[$pid]="$gpu_id") and still track test_result_files/test_pid_map by
pid, then in the wait loop detect finished pids, return their GPU to
available_gpus before launching new jobs, and only launch when available_gpus is
non-empty (instead of checking active_jobs >= NUM_GPUS); ensure all places
referencing test_pid_map, test_result_files, active_jobs are updated to use the
new pid->gpu mapping and the available_gpus queue.
🧹 Nitpick comments (3)
scripts/test_gpu_detection.sh (1)

1-56: Consider adding assertions to make this a proper automated test.

This script prints detected values but never asserts expected outcomes, so regressions can only be caught by a human reading the output. Adding simple equality checks would make it usable in CI:

Example assertion pattern
 echo "Test 1: CUDA_VISIBLE_DEVICES='0,1,2,3'"
 export CUDA_VISIBLE_DEVICES="0,1,2,3"
 export PARALLEL_TESTS=true
 DETECTED=$(detect_gpus)
 echo "Detected GPUs: $DETECTED"
+if [ "$DETECTED" != "0 1 2 3" ]; then
+    echo "FAIL: Expected '0 1 2 3', got '$DETECTED'"
+    exit 1
+fi
 echo ""
scripts/test_utils.sh (2)

548-590: Test results are printed in non-deterministic order.

Iterating over the associative array test_result_files yields PIDs in arbitrary order, so logs and result lines will appear in a different (seemingly random) order each run. Consider sorting by file_index when printing to make output reproducible and easier to scan.


412-416: Paths with spaces will break the array split.

IFS=' ' read -r -a test_files_array <<< "$test_files" mirrors the pre-existing for test_file in $test_files pattern, so this isn't a regression. Worth noting if test paths ever contain spaces in the future—passing an array (or newline-delimited list) would be more robust.

- Fix GPU queue management: use free-GPU queue instead of round-robin
  to ensure only one test runs per GPU at a time (critical fix)
- Preserve existing EXIT trap to avoid breaking caller cleanup
- Add auto-fallback to sequential execution when only 1 GPU detected
- Sort test results by file_index for deterministic output
- Add assertions to test_gpu_detection.sh for automated validation

Addresses all critical and nitpick issues from PR flashinfer-ai#2531 review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@dierksen
Copy link
Contributor Author

/bot run

@flashinfer-bot
Copy link
Collaborator

@dierksen is not authorized to trigger this CI job. cc: @yzh119, @sricketts, @yongwww

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@scripts/test_utils.sh`:
- Around line 558-565: The colon-delimited parsing of the output from
run_single_test_background (the job_info variable) is unsafe because file paths
may contain ':'; update run_single_test_background to return fields using a safe
separator (e.g., NUL or newline) or, better, write each field to separate files
in PARALLEL_TMP_DIR keyed by PID, then change the consumer that currently does
IFS=':' read -r pid test_file result_file log_file file_index (and the other
result-processing reads that populate test_result_files, test_pid_map,
test_gpu_map) to read from those per-PID files (or use read -d '' for
NUL-separated input) and populate test_result_files[$pid], test_pid_map[$pid],
and test_gpu_map[$pid] accordingly so parsing cannot break on colons in paths.
- Around line 367-373: The unguarded use of CUDA_VISIBLE_DEVICES will fail under
set -u; update the conditional and uses to use the default-value syntax. Replace
occurrences in the if test and in the assignment with
"${CUDA_VISIBLE_DEVICES:-}" (e.g., change if [ -n "$CUDA_VISIBLE_DEVICES" ] && [
"$CUDA_VISIBLE_DEVICES" != "-1" ] to if [ -n "${CUDA_VISIBLE_DEVICES:-}" ] && [
"${CUDA_VISIBLE_DEVICES:-}" != "-1" ] and change AVAILABLE_GPUS=$(echo
"$CUDA_VISIBLE_DEVICES" | ...) to AVAILABLE_GPUS=$(echo
"${CUDA_VISIBLE_DEVICES:-}" | ...)) so the script won’t error when the variable
is unset.
- Around line 604-635: Add an explicit else branch for the existing if [ -f
"$result_file" ] check that logs a clear warning when the result file is missing
(include $test_file and $result_file in the message) and set EXIT_CODE=1 so CI
can surface the issue; modify the block around variables result_file,
TOTAL_TESTS and EXIT_CODE (the loop that reads and handles result) to echo or
printf a one-line warning like "WARNING: missing result file for $test_file
(expected $result_file)" and set EXIT_CODE=1 (do not change existing
passed/failed/skipped counting logic).
🧹 Nitpick comments (3)
scripts/test_utils.sh (3)

433-435: Trap preservation: edge-case fragility with single quotes in existing trap.

The sed pattern on line 434 strips the outer '...' wrapping from trap -p EXIT, but if the existing trap body itself contains escaped single quotes (e.g., 'can'\''t'), the resulting string will be malformed. This is unlikely in practice but worth a defensive note.


454-527: Nested function run_single_test_background leaks into global scope.

In bash, functions defined inside other functions are still added to the global function namespace. After run_tests_parallel runs, run_single_test_background remains callable from anywhere. This is a bash quirk, not a bug, but worth knowing — especially since it silently captures mode, total_files, and PARALLEL_TMP_DIR from the caller's scope at fork time and won't work correctly if called standalone.


437-440: Test file paths with spaces will break the space-delimited splitting.

Both the sequential fallback (line 410) and the parallel path (line 439) split $test_files on whitespace. This is consistent with the pre-existing pattern in the script, but worth noting: if any test file path contains spaces, it will be incorrectly split into multiple entries.

@dierksen
Copy link
Contributor Author

/bot run

@flashinfer-bot
Copy link
Collaborator

@dierksen is not authorized to trigger this CI job. cc: @yzh119, @sricketts, @yongwww

@yongwww
Copy link
Member

yongwww commented Feb 10, 2026

maybe it needs some time to propagate, could you try it again in a few minutes? @dierksen

@dierksen
Copy link
Contributor Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !305 has been created, and the CI pipeline #43649074 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #43649074: 10/20 passed

@yongwww
Copy link
Member

yongwww commented Feb 12, 2026

Hi @dierksen, I am trying to read the multi-gpu / multi-node test log in the pipeline, for example in https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/265444028, I saw:

TEST SUMMARY
==========================================
Total test files executed: 0
Passed: 0
Failed: 0

It’d be good to double-check whether the tests are actually running, or if this is just a summary/reporting issue.

Copy link
Contributor Author

@dierksen dierksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bot run

Copy link
Contributor Author

@dierksen dierksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dierksen, I am trying to read the multi-gpu / multi-node test log in the pipeline, for example in https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/265444028, I saw:

TEST SUMMARY
==========================================
Total test files executed: 0
Passed: 0
Failed: 0

It’d be good to double-check whether the tests are actually running, or if this is just a summary/reporting issue.

Good catch, I think the multi-GPU tests were being skipped because they were essentially falling back to single-GPU testing in parallel. I updated the logic for that and requested another ci-bot run.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@scripts/test_utils.sh`:
- Line 10: The comment for the PARALLEL_TESTS shell variable is inaccurate: it
says "Enable parallel test execution by default" while the variable is
initialized to false; update either the comment or the default so they match —
e.g., change the comment to "Disable parallel test execution by default" or set
PARALLEL_TESTS to true; modify the line that sets PARALLEL_TESTS (the :
"${PARALLEL_TESTS:=false}" assignment) and its inline comment to keep semantics
and documentation consistent.

@dierksen
Copy link
Contributor Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !305 has been updated with latest changes, and the CI pipeline #43843112 is currently running. I'll report back once the pipeline job completes.

@yongwww
Copy link
Member

yongwww commented Feb 12, 2026

GitLab MR !305 has been updated with latest changes, and the CI pipeline #43843112 is currently running. I'll report back once the pipeline job completes.

Seems like the multi-gpu (4-gpu) test duration is not faster than before. For example, in run 266339217, muti-gpu cu13 takes ~12 min, however, the test is taking ~17 min with this change.

@dierksen
Copy link
Contributor Author

dierksen commented Feb 12, 2026

GitLab MR !305 has been updated with latest changes, and the CI pipeline #43843112 is currently running. I'll report back once the pipeline job completes.

Seems like the multi-gpu (4-gpu) test duration is not faster than before. For example, in run 266339217, muti-gpu cu13 takes ~12 min, however, the test is taking ~17 min with this change.

I wouldn't expected the multi-gpu test to be faster than before; the parallel testing is disabled in those cases now.

In the cu130 case here, it seems like the 4 minute difference is just due to wait time for compute resources. The cu129 case was actually faster, likely for the same reason.

I had claude compile a table for the two pipeline runs for a little better visibility:

Test Job Before After Δ Time % Change
multi_gpu_test_b300: [cu129] 11m 58s 8m 49s -3m 8s -26.3%
multi_gpu_test_b300: [cu130] 12m 51s 17m 13s +4m 22s +34.0%
multi_node_test_gb200: [cu129] 5m 32s 4m 41s -0m 50s -15.3%
multi_node_test_gb200: [cu130] 9m 42s 4m 41s -5m 1s -51.7%
multi_node_test_gb300: [cu129] 7m 28s 7m 46s +0m 18s +4.1%
multi_node_test_gb300: [cu130] 5m 51s 6m 47s +0m 56s +16.0%
unit_test_5090: [cu129] 146m 24s 147m 26s +1m 1s +0.7%
unit_test_5090: [cu130] 148m 50s 149m 21s +0m 31s +0.4%
unit_test_b200: [cu129] 215m 19s 55m 7s -160m 12s -74.4%
unit_test_b200: [cu130] 207m 6s 54m 24s -152m 41s -73.7%
unit_test_b300: [cu129] 221m 36s 220m 9s -1m 26s -0.6%
unit_test_b300: [cu130] 213m 49s 213m 58s +0m 9s +0.1%
unit_test_gb200: [cu129] 314m 46s 68m 11s -246m 34s -78.3%
unit_test_gb200: [cu130] 314m 45s 66m 42s -248m 3s -78.8%
unit_test_gb300: [cu129] 360m 52s 78m 36s -282m 15s -78.2%
unit_test_gb300: [cu130] 360m 52s 72m 50s -288m 1s -79.8%
unit_test_h100: [cu129] 213m 32s 236m 9s +22m 37s +10.6%
unit_test_h100: [cu130] 199m 28s 243m 39s +44m 10s +22.1%
unit_test_rtx-pro-6000-blackwell: [cu129] 126m 19s 126m 31s +0m 12s +0.2%
unit_test_rtx-pro-6000-blackwell: [cu130] 126m 5s 124m 44s -1m 21s -1.1%
unit_test_spark: [cu129] 205m 30s 192m 15s -13m 15s -6.5%
unit_test_spark: [cu130] 189m 57s 216m 49s +26m 51s +14.1%
unit_test_thor: [cu130] 331m 8s 464m 14s +133m 6s +40.2%

B200, GB200, and GB300 see huge improvements since those are running in clusters. H100, Spark, and Thor are still going through some changes on the internal CI, so I wouldn't put too much stock in those. The multi-GPU numbers look pretty variable because the queue time can wind up being significant compared to the total runtime of the tests.

(edit: I should also note the GB200 and GB300 were timing out before, whereas now they're actually finishing)

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #43843112: 16/20 passed

Copy link
Collaborator

@kahyunnam kahyunnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kahyunnam kahyunnam merged commit 1e94c60 into flashinfer-ai:main Feb 12, 2026
25 checks passed
@dierksen dierksen deleted the parallel-testing branch February 12, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants