Disable GPU resource processor for nodes using DRA for accelerator attachment by mtrqq · Pull Request #8547 · kubernetes/autoscaler

mtrqq · 2025-09-18T08:05:43Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

GpuCustomResourcesProcessor.FilterOutNodesWithUnreadyResources filters out node objects with GPU labels while allocatable and capacity properties remain not populated. This is a good readiness indicator for nodes using device plugin, but for DRA-enabled nodes it doesn't work as they don't have device information in the capacity or allocatable. Instead there's another processor which needs to be configured for DRA infrastructure introduced as part of #8109. This CL effectively disables filtering for DRA-enabled nodes within GPU custom resource processor while doesn't affect other places using GPULabel instead of GetNodeGpuConfig

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Nodes with GPUs exposed via DRA are no longer treated as unready if they don't have the nvidia.com/gpu custom resource in allocatable

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

mtrqq · 2025-09-18T08:05:59Z

/assign towca

towca

There are a couple more places where we assume that GPU == device plugin:

GetGpuInfoForMetrics produces different metrics if ResourceName is missing from Node capacity, and returns the ResourceName to be used as a metric label. We need to skip the capacity checking in the DRA case, and return something DRA-specific to be used as the metric label. Something like dra_<dra_driver_name> would make sense. The name of the DRA Nvidia GPU driver is gpu.nvidia.com, but it feels weird hardcoding it in GetGpuInfoForMetrics. Maybe instead of adding AttachedUsingDra we add DraDriverName and rename ResourceName to DevicePluginResourceName? Only one of them would ever be non-empty, and that could be checked instead of the bool. WDYT?
utilization.Calculate() assumes that if *GpuConfig is non-nil it should base the utilization on ResourceName from Node capacity. This should be a very easy fix, we just need to check the new bool/DRA driver name and go to the DRA util logic if it's true/non-empty.

Could you handle these parts as well? Can be in in further commits in this PR, or further PRs.

…tachment

towca

Thanks for taking care of this! Just have 2 small comments before we can merge

towca · 2025-09-29T16:07:45Z

Thanks for addressing the comments!

/lgtm
/approve

towca · 2025-09-29T16:16:37Z

/release-note-edit

Nodes with GPUs exposed via DRA are no longer treated as unready if they don't have the nvidia.com/gpu custom resource in allocatable

k8s-ci-robot · 2025-09-29T16:18:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrqq, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2025-09-30T19:15:30Z

+
+// GpuDraDriverEnabled checks whether GPU driver is enabled on the node
+func GpuDraDriverEnabled(node *apiv1.Node) bool {
+	return node.Labels[DraGPULabel] == "true"


@mtrqq @towca are there any safe ways for us to implement this generally across providers, or is this necessarily a provider-specific + dra driver-specific story?

cc @nojnhuh see above how GCE is implementing disambiguation between device-plugin and DRA driver accelerator-enabled nodes

I think we may do something more generic than this, but it'll still be intermediate step before DRA Extended Resource KEP lands and may become de facto a new standard completely replacing device plugin. So at this point it feels that it's not worth investing into any cross cloudprovider refactoring

WDYT, @towca?

We discussed this during the SIG Autoscaling meeting, the conclusion was that detecting the enablement should be provider-specific, at least for the foreseeable future.

jackfrancis · 2026-01-12T16:17:39Z

/cherry-pick cluster-autoscaler-release-1.34

k8s-infra-cherrypick-robot · 2026-01-12T16:18:18Z

@jackfrancis: new pull request created: #9045

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 18, 2025

k8s-ci-robot assigned towca Sep 18, 2025

k8s-ci-robot added the area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component label Sep 18, 2025

k8s-ci-robot requested review from feiskyer and towca September 18, 2025 08:06

k8s-ci-robot added area/provider/gce and removed do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Sep 18, 2025

towca requested changes Sep 19, 2025

View reviewed changes

Disable GPU resource processor for nodes using DRA for accelerator at…

1acc8c2

…tachment

mtrqq force-pushed the dra-gpu-processor branch from 87901dc to 1acc8c2 Compare September 23, 2025 09:00

k8s-ci-robot added area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2025

mtrqq requested a review from towca September 23, 2025 09:08

Add handling for DRA GPUs exposed in GetGpuInfoForMetrics

fb6dca0

mtrqq force-pushed the dra-gpu-processor branch from d1604af to e4d55ef Compare September 23, 2025 12:10

towca reviewed Sep 26, 2025

View reviewed changes

Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go Outdated

Comment thread cluster-autoscaler/utils/gpu/gpu_test.go

mtrqq force-pushed the dra-gpu-processor branch 2 times, most recently from 56744f1 to ef33683 Compare September 29, 2025 15:55

Handle resource utilization calculation for GPUs exposed using DRA

d529b17

mtrqq force-pushed the dra-gpu-processor branch from ef33683 to d529b17 Compare September 29, 2025 15:58

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 29, 2025

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 29, 2025

k8s-ci-robot merged commit cef695b into kubernetes:master Sep 29, 2025
7 checks passed

towca approved these changes Sep 29, 2025

View reviewed changes

jackfrancis reviewed Sep 30, 2025

View reviewed changes

k8s-infra-cherrypick-robot mentioned this pull request Jan 12, 2026

[cluster-autoscaler-release-1.34] Disable GPU resource processor for nodes using DRA for accelerator attachment #9045

Merged

jackfrancis mentioned this pull request Jan 12, 2026

[cluster-autoscaler-1.34] Add Intel GPU (Habana Gaudi) autoscaler support #9049

Merged

Conversation

mtrqq commented Sep 18, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

mtrqq commented Sep 18, 2025

Uh oh!

towca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

towca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

towca commented Sep 29, 2025

Uh oh!

towca commented Sep 29, 2025

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

jackfrancis Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

mtrqq Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

towca Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Jan 12, 2026

Uh oh!

k8s-infra-cherrypick-robot commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mtrqq commented Sep 18, 2025 •

edited by k8s-ci-robot

Loading