Skip to content

Disable GPU resource processor for nodes using DRA for accelerator attachment#8547

Merged
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
mtrqq:dra-gpu-processor
Sep 29, 2025
Merged

Disable GPU resource processor for nodes using DRA for accelerator attachment#8547
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
mtrqq:dra-gpu-processor

Conversation

@mtrqq
Copy link
Copy Markdown
Contributor

@mtrqq mtrqq commented Sep 18, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

GpuCustomResourcesProcessor.FilterOutNodesWithUnreadyResources filters out node objects with GPU labels while allocatable and capacity properties remain not populated. This is a good readiness indicator for nodes using device plugin, but for DRA-enabled nodes it doesn't work as they don't have device information in the capacity or allocatable. Instead there's another processor which needs to be configured for DRA infrastructure introduced as part of #8109. This CL effectively disables filtering for DRA-enabled nodes within GPU custom resource processor while doesn't affect other places using GPULabel instead of GetNodeGpuConfig

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Nodes with GPUs exposed via DRA are no longer treated as unready if they don't have the nvidia.com/gpu custom resource in allocatable

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Sep 18, 2025
@mtrqq
Copy link
Copy Markdown
Contributor Author

mtrqq commented Sep 18, 2025

/assign towca

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 18, 2025
@k8s-ci-robot k8s-ci-robot added the area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component label Sep 18, 2025
@k8s-ci-robot k8s-ci-robot added area/provider/gce and removed do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Sep 18, 2025
Copy link
Copy Markdown
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple more places where we assume that GPU == device plugin:

  • GetGpuInfoForMetrics produces different metrics if ResourceName is missing from Node capacity, and returns the ResourceName to be used as a metric label. We need to skip the capacity checking in the DRA case, and return something DRA-specific to be used as the metric label. Something like dra_<dra_driver_name> would make sense. The name of the DRA Nvidia GPU driver is gpu.nvidia.com, but it feels weird hardcoding it in GetGpuInfoForMetrics. Maybe instead of adding AttachedUsingDra we add DraDriverName and rename ResourceName to DevicePluginResourceName? Only one of them would ever be non-empty, and that could be checked instead of the bool. WDYT?
  • utilization.Calculate() assumes that if *GpuConfig is non-nil it should base the utilization on ResourceName from Node capacity. This should be a very easy fix, we just need to check the new bool/DRA driver name and go to the DRA util logic if it's true/non-empty.

Could you handle these parts as well? Can be in in further commits in this PR, or further PRs.

Comment thread cluster-autoscaler/cloudprovider/cloud_provider.go Outdated
Comment thread cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go Outdated
Comment thread cluster-autoscaler/cloudprovider/gce/gce_cloud_provider.go Outdated
Comment thread cluster-autoscaler/cloudprovider/cloud_provider.go Outdated
Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go Outdated
Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go
Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go
@k8s-ci-robot k8s-ci-robot added area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 23, 2025
@mtrqq mtrqq requested a review from towca September 23, 2025 09:08
Copy link
Copy Markdown
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this! Just have 2 small comments before we can merge

Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go Outdated
Comment thread cluster-autoscaler/utils/gpu/gpu_test.go
@mtrqq mtrqq force-pushed the dra-gpu-processor branch 2 times, most recently from 56744f1 to ef33683 Compare September 29, 2025 15:55
@towca
Copy link
Copy Markdown
Collaborator

towca commented Sep 29, 2025

Thanks for addressing the comments!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 29, 2025
@towca
Copy link
Copy Markdown
Collaborator

towca commented Sep 29, 2025

/release-note-edit

Nodes with GPUs exposed via DRA are no longer treated as unready if they don't have the nvidia.com/gpu custom resource in allocatable

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit cef695b into kubernetes:master Sep 29, 2025
7 checks passed
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrqq, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


// GpuDraDriverEnabled checks whether GPU driver is enabled on the node
func GpuDraDriverEnabled(node *apiv1.Node) bool {
return node.Labels[DraGPULabel] == "true"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrqq @towca are there any safe ways for us to implement this generally across providers, or is this necessarily a provider-specific + dra driver-specific story?

cc @nojnhuh see above how GCE is implementing disambiguation between device-plugin and DRA driver accelerator-enabled nodes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may do something more generic than this, but it'll still be intermediate step before DRA Extended Resource KEP lands and may become de facto a new standard completely replacing device plugin. So at this point it feels that it's not worth investing into any cross cloudprovider refactoring

WDYT, @towca?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this during the SIG Autoscaling meeting, the conclusion was that detecting the enablement should be provider-specific, at least for the foreseeable future.

@jackfrancis
Copy link
Copy Markdown
Contributor

/cherry-pick cluster-autoscaler-release-1.34

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: new pull request created: #9045

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component area/provider/gce area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants