Add YAML based benchmark definitions for cuML regression runs#7980
Add YAML based benchmark definitions for cuML regression runs#7980dantegd wants to merge 10 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds YAML-driven benchmark manifests and end-to-end config handling: new CLI args ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/benchmark/config.py`:
- Around line 593-607: The algorithm filter currently does exact matches in
_apply_algorithm_filter, so pass-ins like "logisticregression" with different
case fail; make matching case-insensitive by normalizing both sides (use
str.casefold() for robustness) — build wanted = set(s.casefold() for s in
algorithm_filter) and test e["algorithm"].casefold() in wanted when filtering
benchmark_entries, and keep the existing BenchmarkConfigError behavior (you may
include the original algorithm_filter value in the error text if desired).
- Around line 577-590: The _apply_profile_selection function currently returns
an empty list when a selected_profile filters out all benchmark_entries; update
it to raise an exception instead of silently returning nothing: after computing
include_tags and the filtered list (using benchmark_entries, profiles,
selected_profile, include_tags), if the resulting list is empty raise a
ValueError (or a custom exception) with a clear message that includes the
selected_profile and the include_tags so CI fails fast on mistagged/overly
narrow profiles.
- Around line 379-390: The numeric validators currently accept bools because
bool is a subclass of int; update the checks to explicitly reject booleans
before the isinstance checks: for the `version` validation (around
symbol/version check), modify the guard to first check `isinstance(value, bool)`
and raise BenchmarkConfigError if True, then continue with the existing
`isinstance(..., int)` or `isinstance(..., (int, float))` checks; do the same in
the loop that validates `n_reps` and `random_state` and the `test_split` check,
and also update `_normalize_int_list()` and `_normalize_shapes()` to reject
bools (check `isinstance(item, bool)` and raise) before allowing int/float
normalization so boolean values like True/False are not treated as 1/0.
In `@python/cuml/cuml/benchmark/run_benchmarks.py`:
- Around line 390-392: The bug is that run_benchmark's explicit_options default
(empty set) causes "--skip-gpu/--skip-cpu" to be treated as explicitly absent
and thus override args.skip_gpu/args.skip_cpu; change run_benchmark signature to
default explicit_options to None and treat None as "no explicit overrides" so
the code falls back to args.skip_gpu/args.skip_cpu. Update the checks that
compute allow_gpu_runs and allow_cpu_runs (and any other "--skip-*" membership
tests) to first check if explicit_options is None (use args.skip_* flags) or
else consult the provided explicit_options set; reference symbols:
run_benchmark, explicit_options, args.skip_gpu, args.skip_cpu, allow_gpu_runs,
allow_cpu_runs.
- Around line 275-278: In both _validate_benchmark_inputs and _validate_args add
the stacklevel=2 kwarg to the warnings.warn calls so the warning points to the
caller rather than the helper; locate the warnings.warn invocation in
_validate_benchmark_inputs (the message about "--input-type... Switching to
'numpy'") and the analogous warnings.warn in _validate_args and pass
stacklevel=2 to each call.
In `@wiki/BENCHMARK.md`:
- Around line 51-59: The docs show running run_benchmarks.py with a YAML config
but omit installing PyYAML; update the standalone prerequisites to list PyYAML
as a required package and instruct users to install it (e.g., via pip) before
running the example so that python/cuml/cuml/benchmark/config.py can call
yaml.safe_load() successfully; reference run_benchmarks.py and the
yaml.safe_load() usage in config.py so readers know why PyYAML is required.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 85ab12c0-638b-461b-86f7-e112e1fbab5e
📒 Files selected for processing (9)
python/cuml/cuml/benchmark/cli.pypython/cuml/cuml/benchmark/config.pypython/cuml/cuml/benchmark/configs/single_gpu.yamlpython/cuml/cuml/benchmark/configs/test.yamlpython/cuml/cuml/benchmark/run_benchmarks.pypython/cuml/cuml/benchmark/runners.pypython/cuml/tests/test_benchmark_config.pypython/cuml/tests/test_benchmark_runners.pywiki/BENCHMARK.md
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/benchmark/config.py`:
- Around line 209-211: The current check only verifies type via
_is_int_value(version) so any integer (e.g., 2) is accepted; change the
validation in the config parsing (around version, _is_int_value,
BenchmarkConfigError) to reject unsupported manifest versions by enforcing the
allowed set (e.g., only 1) — after confirming version is an int, verify version
is in the supported_versions list/tuple and if not raise BenchmarkConfigError
with a clear message listing supported versions so future schema bumps fail
fast.
- Around line 487-491: The resolver currently only checks that test_split is
numeric; update load_and_resolve_config (where the block uses
_is_numeric_value(entry.get("test_split"))) to also validate that the numeric
value is within [0.0, 1.0] and raise BenchmarkConfigError (using benchmark_name)
if it is out of range; locate the check around the existing _is_numeric_value
call and replace/extend it so it first ensures numeric and then verifies 0.0 <=
float(entry.get("test_split")) <= 1.0, raising a clear BenchmarkConfigError if
the range check fails.
In `@wiki/BENCHMARK.md`:
- Around line 245-249: The bullets imply mutually exclusive tiers but profile
selection in config.py matches any shared tag, so update the wording for the
'default', 'extended', and 'nightly' profiles to state they are cumulative:
'extended' includes workloads from both 'default-profile' and 'extended-profile'
(i.e., default + extended row-count tiers) and 'nightly' includes workloads from
'default-profile', 'extended-profile', and 'nightly-profile' (i.e., all three
tiers); mention that selection matches any shared tag so profiles aggregate
rather than replace earlier tiers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fd348e6e-9351-4c34-9e37-0ea75bb5bdee
📒 Files selected for processing (5)
conda/recipes/cuml/recipe.yamlpython/cuml/cuml/benchmark/config.pypython/cuml/cuml/benchmark/run_benchmarks.pypython/cuml/tests/test_benchmark_config.pywiki/BENCHMARK.md
✅ Files skipped from review due to trivial changes (1)
- conda/recipes/cuml/recipe.yaml
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuml/cuml/benchmark/config.py (1)
824-831: Consider addingstrict=Truetozip()for defensive programming.Since Python 3.10,
zip()has an optionalstrictkeyword argument. Whilegrid_keysand the unpackedcombotuple are guaranteed to have matching lengths (since both derive from the samegriddict anditertools.productmaintains length consistency), addingstrict=Trueprovides defensive protection against potential bugs elsewhere. Given that cuML requires Python 3.11+, the parameter is fully available.Suggested improvement
for combo in combinations: - combo_dict = dict(zip(grid_keys, combo)) + combo_dict = dict(zip(grid_keys, combo, strict=True)) result.append({**fixed, **combo_dict})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/benchmark/config.py` around lines 824 - 831, The zip call in the combination builder should be made strict to guard against mismatched lengths: update the dict creation that uses dict(zip(grid_keys, combo)) to pass strict=True (i.e., dict(zip(grid_keys, combo, strict=True))) so combo and grid_keys must match; locate this change around the variables grid_keys, combinations, combo, and combo_dict in the combination generation block and add strict=True to the zip invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/cuml/cuml/benchmark/config.py`:
- Around line 824-831: The zip call in the combination builder should be made
strict to guard against mismatched lengths: update the dict creation that uses
dict(zip(grid_keys, combo)) to pass strict=True (i.e., dict(zip(grid_keys,
combo, strict=True))) so combo and grid_keys must match; locate this change
around the variables grid_keys, combinations, combo, and combo_dict in the
combination generation block and add strict=True to the zip invocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3b7d7a84-2d24-4cec-8787-1d5c38ad0673
📒 Files selected for processing (4)
python/cuml/cuml/benchmark/config.pypython/cuml/cuml/benchmark/configs/single_gpu.yamlpython/cuml/tests/test_benchmark_config.pywiki/BENCHMARK.md
✅ Files skipped from review due to trivial changes (1)
- python/cuml/cuml/benchmark/configs/single_gpu.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- wiki/BENCHMARK.md
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuml/cuml/benchmark/config.py (1)
648-653: Consider validating tag types in_named_tags.If
explicit_tagsis a non-list value (e.g., a bare string from YAML liketags: foo), line 652 creates[name, explicit_tags]without validation. While_validate_tag_listcatches this elsewhere, this function could produce unexpected results if called with unvalidated input.This is minor since the validation pipeline should catch malformed tags before reaching this point.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/benchmark/config.py` around lines 648 - 653, The helper _named_tags currently appends non-list explicit_tags blindly which can produce unexpected types; update _named_tags to validate the type when explicit_tags is not None and not a list: if explicit_tags is a str return [name, explicit_tags], otherwise raise a clear TypeError/ValueError (or coerce to str only if intended) so callers like _validate_tag_list no longer receive malformed tag values; reference _named_tags and _validate_tag_list to locate and enforce this validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/cuml/cuml/benchmark/config.py`:
- Around line 648-653: The helper _named_tags currently appends non-list
explicit_tags blindly which can produce unexpected types; update _named_tags to
validate the type when explicit_tags is not None and not a list: if
explicit_tags is a str return [name, explicit_tags], otherwise raise a clear
TypeError/ValueError (or coerce to str only if intended) so callers like
_validate_tag_list no longer receive malformed tag values; reference _named_tags
and _validate_tag_list to locate and enforce this validation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ab3c6499-d2df-43b9-befc-d6be930ac399
📒 Files selected for processing (7)
python/cuml/cuml/benchmark/cli.pypython/cuml/cuml/benchmark/config.pypython/cuml/cuml/benchmark/configs/single_gpu.yamlpython/cuml/cuml/benchmark/configs/test.yamlpython/cuml/cuml/benchmark/run_benchmarks.pypython/cuml/tests/test_benchmark_config.pywiki/BENCHMARK.md
✅ Files skipped from review due to trivial changes (2)
- python/cuml/cuml/benchmark/configs/test.yaml
- python/cuml/cuml/benchmark/configs/single_gpu.yaml
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuml/cuml/benchmark/run_benchmarks.py (1)
460-460: Consider addingstrict=Truetozip()for defensive coding.Both lists are derived from the same source, so they should always have matching lengths. However, adding
strict=Truewould catch any future bugs that might cause length mismatches.Suggested fix
- for entry, selected_backends in zip(benchmark_entries, entry_backends): + for entry, selected_backends in zip(benchmark_entries, entry_backends, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/benchmark/run_benchmarks.py` at line 460, The zip over benchmark_entries and entry_backends may silently truncate if their lengths diverge; update the iteration that currently reads "for entry, selected_backends in zip(benchmark_entries, entry_backends):" to use strict pairing by passing strict=True to zip so mismatched lengths raise an error (i.e., change to zip(..., strict=True)); target the loop in run_benchmarks.py where benchmark_entries and entry_backends are zipped to implement this defensive check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/cuml/cuml/benchmark/run_benchmarks.py`:
- Line 460: The zip over benchmark_entries and entry_backends may silently
truncate if their lengths diverge; update the iteration that currently reads
"for entry, selected_backends in zip(benchmark_entries, entry_backends):" to use
strict pairing by passing strict=True to zip so mismatched lengths raise an
error (i.e., change to zip(..., strict=True)); target the loop in
run_benchmarks.py where benchmark_entries and entry_backends are zipped to
implement this defensive check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 353290e0-bf39-4a2c-84ed-85a5b599a305
📒 Files selected for processing (2)
python/cuml/cuml/benchmark/run_benchmarks.pypython/cuml/tests/test_benchmark_config.py
| ) | ||
|
|
||
| # Run CPU benchmark | ||
| if run_cpu and algo_pair.has_cpu(): |
There was a problem hiding this comment.
I ran this on my dgx spark - smoke test passed but single_gpu failed.
- cuml: 26.06.00
- cupy: 14.0.1
- numpy: 2.4.3
- sklearn: 1.8.0
- scipy: 1.16.3
- CUDA toolkit: 13.0
- GPU: NVIDIA GB10, sm_121 (Blackwell, aarch64)
quoting Claude:
The bug: single_gpu.yaml sets input_type: cupy, which means all benchmark data is generated as CuPy (GPU) arrays. When the runner compared cuML against the sklearn CPU baseline, it passed those CuPy arrays directly to sklearn. CuPy 13.x removed implicit NumPy conversion, so sklearn's internal numpy.asarray(cupy_array) call raises TypeError instead of silently converting.
The fix: One line at the top of the CPU benchmark block in runners.py — convert data to NumPy before passing it to anything CPU-side:
cpu_data = datagen._convert_to_numpy(data)
_convert_to_numpy already existed and handles tuples recursively, so it covers the setup, run, and accuracy sections all at once.
# Run CPU benchmark
if run_cpu and algo_pair.has_cpu():
+ cpu_data = datagen._convert_to_numpy(data)
setup_override = algo_pair.setup_cpu(
- data, **param_overrides, **cpu_param_overrides
+ cpu_data, **param_overrides, **cpu_param_overrides
)
cpu_timer = BenchmarkTimer(self.n_reps)
for rep in cpu_timer.benchmark_runs():
cpu_model = algo_pair.run_cpu(
- data,
+ cpu_data,
**setup_override,
)
cpu_elapsed = np.min(cpu_timer.timings)
if algo_pair.accuracy_function:
if algo_pair.cpu_data_prep_hook is not None:
- X_test, y_test = algo_pair.cpu_data_prep_hook(data[2:])
+ X_test, y_test = algo_pair.cpu_data_prep_hook(cpu_data[2:])
else:
- X_test, y_test = data[2:]
+ X_test, y_test = cpu_data[2:]
if hasattr(cpu_model, "predict"):
y_pred_cpu = cpu_model.predict(X_test)
else:
| - numba >=0.60.0,<0.65.0 | ||
| - numba-cuda >=0.22.2,<0.29.0 | ||
| - numpy >=1.23,<3.0 | ||
| - pyyaml |
There was a problem hiding this comment.
Is this benchmark CLI user-facing? Do we want to always require the dependencies for it to run be installed (like you added pyyaml here to the recipe)? Or should we instead error at runtime nicely and flag the extra dependency needed? I might vote for the latter since I suspect most users won't ever use this functionality.
There was a problem hiding this comment.
Also, I suspect we might need a similar change in dependencies.yaml.
There was a problem hiding this comment.
Agreed. This is user-facing, but we should not add pyyaml to our standard run dependencies.
| try: | ||
| import yaml | ||
| except ImportError as exc: # pragma: no cover - dependency issue | ||
| raise RuntimeError( |
There was a problem hiding this comment.
Nit: this should be an ImportError, not a RuntimeError.
| try: | ||
| from cuml.benchmark import algorithms | ||
| except ImportError: | ||
| if not any("cuml/benchmark" in p for p in sys.path): | ||
| raise | ||
| import algorithms # noqa: E402 |
There was a problem hiding this comment.
Is this to hack around local execution instead of the packaged execution? If so, I'm against this - the executing version should always be the built version of the package, and not what's in the source tree.
There was a problem hiding this comment.
We want to be able to run the benchmark suite without having cuML installed at all to generate the sklearn base-line.
| import algorithms # noqa: E402 | ||
|
|
||
|
|
||
| TOP_LEVEL_KEYS = {"version", "suite", "profiles", "defaults", "benchmarks"} |
There was a problem hiding this comment.
No strong thoughts on the schema, but I would advise we use something like pydantic (or my library msgspec) to manage these rather than manually writing out the parsing and validation logic. It's easier to manage, easier to read, and would drastically reduce the amount of code added in this PR.
If we opt to make the deps for running the benchmarks not required, then any additional dep should be fine here. I'd vote for pydantic since it's ubiquitous and I'm not having to support it :).
|
|
||
| try: | ||
| import yaml | ||
| except ImportError as exc: # pragma: no cover - dependency issue |
There was a problem hiding this comment.
| except ImportError as exc: # pragma: no cover - dependency issue | |
| except ModuleNotFound as exc: # pragma: no cover - dependency issue |
| - numba >=0.60.0,<0.65.0 | ||
| - numba-cuda >=0.22.2,<0.29.0 | ||
| - numpy >=1.23,<3.0 | ||
| - pyyaml |
There was a problem hiding this comment.
I don't think we should add pyaml to our runtime dependencies.
| - numba >=0.60.0,<0.65.0 | ||
| - numba-cuda >=0.22.2,<0.29.0 | ||
| - numpy >=1.23,<3.0 | ||
| - pyyaml |
There was a problem hiding this comment.
Agreed. This is user-facing, but we should not add pyyaml to our standard run dependencies.
| try: | ||
| from cuml.benchmark import algorithms | ||
| except ImportError: | ||
| if not any("cuml/benchmark" in p for p in sys.path): | ||
| raise | ||
| import algorithms # noqa: E402 |
There was a problem hiding this comment.
We want to be able to run the benchmark suite without having cuML installed at all to generate the sklearn base-line.
| import algorithms # noqa: E402 | ||
|
|
||
|
|
||
| TOP_LEVEL_KEYS = {"version", "suite", "profiles", "defaults", "benchmarks"} |
This PR adds YAML-based benchmark definitions to describe and run cuML benchmarks. It introduces manifest loading and validation, default and profile resolution, support for explicit shape pairs and parameter grids, and the CLI/config plumbing needed to run benchmarks directly from a YAML file. It also adds the first manifests for a tiny smoke suite and a single-GPU suite, plus test coverage and docs for the new flow.
The point of this PR is to put a clean declarative layer in front of the existing benchmark harness so we can define suites in one place instead of rebuilding them through large CLI invocations. Follow-up PRs can then build on top of this foundation: richer result metadata and output schemas, better timing statistics and warmup handling, stronger reproducibility controls, baseline comparison tooling, and potentially CI/nightly workflows.