Disable GPU resource processor for nodes using DRA for accelerator attachment#8547
Conversation
|
/assign towca |
towca
left a comment
There was a problem hiding this comment.
There are a couple more places where we assume that GPU == device plugin:
GetGpuInfoForMetricsproduces different metrics if ResourceName is missing from Node capacity, and returns the ResourceName to be used as a metric label. We need to skip the capacity checking in the DRA case, and return something DRA-specific to be used as the metric label. Something likedra_<dra_driver_name>would make sense. The name of the DRA Nvidia GPU driver isgpu.nvidia.com, but it feels weird hardcoding it inGetGpuInfoForMetrics. Maybe instead of addingAttachedUsingDrawe addDraDriverNameand renameResourceNametoDevicePluginResourceName? Only one of them would ever be non-empty, and that could be checked instead of the bool. WDYT?utilization.Calculate()assumes that if*GpuConfigis non-nil it should base the utilization on ResourceName from Node capacity. This should be a very easy fix, we just need to check the new bool/DRA driver name and go to the DRA util logic if it's true/non-empty.
Could you handle these parts as well? Can be in in further commits in this PR, or further PRs.
87901dc to
1acc8c2
Compare
d1604af to
e4d55ef
Compare
towca
left a comment
There was a problem hiding this comment.
Thanks for taking care of this! Just have 2 small comments before we can merge
56744f1 to
ef33683
Compare
ef33683 to
d529b17
Compare
|
Thanks for addressing the comments! /lgtm |
|
/release-note-edit |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mtrqq, towca The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| // GpuDraDriverEnabled checks whether GPU driver is enabled on the node | ||
| func GpuDraDriverEnabled(node *apiv1.Node) bool { | ||
| return node.Labels[DraGPULabel] == "true" |
There was a problem hiding this comment.
There was a problem hiding this comment.
I think we may do something more generic than this, but it'll still be intermediate step before DRA Extended Resource KEP lands and may become de facto a new standard completely replacing device plugin. So at this point it feels that it's not worth investing into any cross cloudprovider refactoring
WDYT, @towca?
There was a problem hiding this comment.
We discussed this during the SIG Autoscaling meeting, the conclusion was that detecting the enablement should be provider-specific, at least for the foreseeable future.
|
/cherry-pick cluster-autoscaler-release-1.34 |
|
@jackfrancis: new pull request created: #9045 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
GpuCustomResourcesProcessor.FilterOutNodesWithUnreadyResourcesfilters out node objects with GPU labels while allocatable and capacity properties remain not populated. This is a good readiness indicator for nodes using device plugin, but for DRA-enabled nodes it doesn't work as they don't have device information in the capacity or allocatable. Instead there's another processor which needs to be configured for DRA infrastructure introduced as part of #8109. This CL effectively disables filtering for DRA-enabled nodes within GPU custom resource processor while doesn't affect other places usingGPULabelinstead ofGetNodeGpuConfigWhich issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: