Skip to content

Add Intel GPU (Habana Gaudi) autoscaler support#8853

Merged
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
DorWeinstock:add-intel-gaudi-support
Nov 26, 2025
Merged

Add Intel GPU (Habana Gaudi) autoscaler support#8853
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
DorWeinstock:add-intel-gaudi-support

Conversation

@DorWeinstock
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add support for Intel Habana Gaudi GPUs in the cluster autoscaler by:

  • Define ResourceIntelGPU resource name (habana.ai/gaudi)
  • Add Intel GPU to GPUVendorResourceNames list
  • Refactor GPU detection logic to iterate through all GPU vendor resource names instead of checking vendors individually

This enables the autoscaler to properly detect and handle Intel GPU nodes alongside existing NVIDIA, AMD, and DirectX GPU support.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Changes has been tested against IBM cloud provider.

Does this PR introduce a user-facing change?

pods can now reuqest habana.ai/gaudi as a valid resource

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 24, 2025
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Nov 24, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component labels Nov 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @DorWeinstock!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @DorWeinstock. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. label Nov 24, 2025
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 24, 2025
Copy link
Copy Markdown
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, and i think it's a good upgrade. i have a couple questions about using the gpu package a little more.

Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go Outdated
Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go Outdated
Comment thread cluster-autoscaler/processors/customresources/gpu_processor.go
Add support for Intel Habana Gaudi GPUs in the cluster autoscaler by:
- Define ResourceIntelGPU resource name (habana.ai/gaudi)
- Add Intel GPU to GPUVendorResourceNames list
- Refactor GPU detection logic to iterate through all GPU vendor resource names
    instead of checking vendors individually

This enables the autoscaler to properly detect and handle Intel GPU nodes
alongside existing NVIDIA, AMD, and DirectX GPU support.
@DorWeinstock DorWeinstock force-pushed the add-intel-gaudi-support branch from 486d3b7 to 5873c7f Compare November 25, 2025 17:42
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 25, 2025
Extract the GPU allocatable detection loop into a new NodeHasGpuAllocatable
helper function in utils/gpu/gpu.go. This eliminates code duplication across
gpu_processor.go and makes the logic more maintainable.

The new function returns both the GPU allocatable value and whether it exists,
allowing callers to get both pieces of information in a single call.

Changes:
- Add NodeHasGpuAllocatable() helper in utils/gpu/gpu.go
- Update NodeHasGpu() to use the new helper
- Simplify FilterOutNodesWithUnreadyResources() in gpu_processor.go
- Simplify GetNodeGpuTarget() in gpu_processor.go
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 25, 2025
@DorWeinstock
Copy link
Copy Markdown
Contributor Author

DorWeinstock commented Nov 25, 2025

func NodeHasGpuAllocatable(node *apiv1.Node) (gpuAllocatableValue int64, hasGpuAllocatable bool) has been implemented and is now being called both in gpu.go and gpu_processor.go
@elmiko, @vadasambar please check now.

@jackfrancis
Copy link
Copy Markdown
Contributor

@yansun1996 do the non-Intel changes in this PR address your desired changes in #8865 ?

@yansun1996
Copy link
Copy Markdown
Contributor

@yansun1996 do the non-Intel changes in this PR address your desired changes in #8865 ?

yep @jackfrancis this PR is doing the same changes compared to #8865

@jackfrancis
Copy link
Copy Markdown
Contributor

/test pull-cluster-autoscaler-e2e-azure-master

@jackfrancis
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 26, 2025
Copy link
Copy Markdown
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DorWeinstock, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit ffcbfee into kubernetes:master Nov 26, 2025
9 checks passed
@jackfrancis
Copy link
Copy Markdown
Contributor

/cherry-pick cluster-autoscaler-release-1.34

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: #8853 failed to apply on top of branch "cluster-autoscaler-release-1.34":

Applying: Add Intel GPU (Habana Gaudi) autoscaler support
Using index info to reconstruct a base tree...
M	cluster-autoscaler/processors/customresources/gpu_processor.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/processors/customresources/gpu_processor.go
CONFLICT (content): Merge conflict in cluster-autoscaler/processors/customresources/gpu_processor.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Add Intel GPU (Habana Gaudi) autoscaler support

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants