Skip to content

fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info#8449

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
kincoy:fix/simulate-removal-nonexistent-node
Oct 27, 2025
Merged

fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info#8449
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
kincoy:fix/simulate-removal-nonexistent-node

Conversation

@kincoy
Copy link
Copy Markdown
Member

@kincoy kincoy commented Aug 18, 2025

What type of PR is this?

/kind bug


What this PR does / why we need it:

This PR fixes a potential nil dereference issue in SimulateNodeRemoval when a node is missing from the clusterSnapshot.

Previously, if clusterSnapshot.GetNodeInfo failed, the function would continue and potentially panic when accessing nodeInfo.Pods().

This fix introduces:

  • A new UnremovableReason: NoNodeInfo, used to mark such nodes as unremovable.
  • An early return from SimulateNodeRemoval when the node is missing.
  • A defensive placeholder node (with only .Name set) to maintain observability and event compatibility.
  • A dedicated unit test case to verify this scenario is correctly handled.

Which issue(s) this PR fixes:

N/A


Special notes for your reviewer:

This PR fixes a potential nil dereference in SimulateNodeRemoval when a node is missing from the cluster snapshot.
It adds a new UnremovableReason (NoNodeInfo) to capture this edge case, and add the test coverage.


Does this PR introduce a user-facing change?

fix: handle missing node info in SimulateNodeRemoval

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @kincoy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component label Aug 18, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. label Aug 18, 2025
@k8s-ci-robot k8s-ci-robot requested a review from elmiko August 18, 2025 09:35
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 18, 2025
Copy link
Copy Markdown
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the PR and adding the test too.

the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.

would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo, but can we also test for UnexpectedError?)

@kincoy
Copy link
Copy Markdown
Member Author

kincoy commented Aug 22, 2025

thank you for the PR and adding the test too.

the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.

would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo, but can we also test for UnexpectedError?)

Thanks! I looked into GetNodeInfo — aside from ErrNodeNotFound, other errors only happen when draEnabled is true and WrapSchedulerNodeInfo fails, which is rare and hard to simulate without artificial mocks.

If you have a clean way to test this case, I’m happy to give it a try!

@elmiko
Copy link
Copy Markdown
Contributor

elmiko commented Aug 27, 2025

Thanks! I looked into GetNodeInfo — aside from ErrNodeNotFound, other errors only happen when draEnabled is true and WrapSchedulerNodeInfo fails, which is rare and hard to simulate without artificial mocks.

thanks for investigating, i was worried it might take a complicated mock to make it work. i don't think it's worth the effort to create a test with a mock just for the error condition.

/lgtm

would be good to get another review from a maintainer

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2025
Comment thread cluster-autoscaler/simulator/cluster.go Outdated
Comment thread cluster-autoscaler/simulator/cluster.go Outdated
@kincoy kincoy force-pushed the fix/simulate-removal-nonexistent-node branch from 3d61fed to ed37b25 Compare August 28, 2025 02:10
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2025
@kincoy kincoy force-pushed the fix/simulate-removal-nonexistent-node branch from ed37b25 to 6fcb503 Compare August 28, 2025 02:14
@kincoy
Copy link
Copy Markdown
Member Author

kincoy commented Sep 1, 2025

Friendly ping @jackfrancis — this PR has been idle for a while. Would appreciate a review when convenient 🙏

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025
@jackfrancis
Copy link
Copy Markdown
Contributor

@kincoy sorry for the delay, this lgtm, and I'll give it the official stamp after a rebase

thanks!

@k8s-ci-robot k8s-ci-robot added area/balancer Issues or PRs related to the Balancer component area/helm-charts area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 6, 2025
@jackfrancis jackfrancis added the release-note-none Denotes a PR that doesn't merit a release note. label Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 6, 2025
@jackfrancis
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 6, 2025
@jackfrancis
Copy link
Copy Markdown
Contributor

/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33
/cherry-pick cluster-autoscaler-release-1.34

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.32, cluster-autoscaler-release-1.33, cluster-autoscaler-release-1.34 in new PRs and assign them to you.

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33
/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jackfrancis
Copy link
Copy Markdown
Contributor

/lgtm
/approve
/hold

@towca @BigDarkClown @x13n @elmiko for another review

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 6, 2025
@jackfrancis
Copy link
Copy Markdown
Contributor

/test pull-cluster-autoscaler-e2e-azure-master

@x13n
Copy link
Copy Markdown
Member

x13n commented Oct 8, 2025

Hm... so #8473 removed unit tests along with the tested function, but the underlying SimulateNodeRemoval was only indirectly tested. In retrospect, the tests should've been probably rewritten instead of removed, so that this PR could extend them. @jackfrancis do you think you could bring the tests back, updating them to test SimulateNodeRemoval instead of FindNodesToRemove that got deleted? Alternatively, @kincoy would you be able to do that as a part of this PR?

@kincoy
Copy link
Copy Markdown
Member Author

kincoy commented Oct 27, 2025

@kincoy would you be able to do that as a part of this PR?

Yeah, I’ll add a new test file in this PR to cover SimulateNodeRemoval.

@kincoy kincoy force-pushed the fix/simulate-removal-nonexistent-node branch from 7e694d6 to a40418d Compare October 27, 2025 09:12
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2025
@kincoy
Copy link
Copy Markdown
Member Author

kincoy commented Oct 27, 2025

@x13n hi, I’ve updated the PR and added the test file. The function’s coverage is now over 80%.

~ go tool cover -func=cover.out
k8s.io/autoscaler/cluster-autoscaler/simulator/cluster.go:111: NewRemovalSimulator 100.0%
k8s.io/autoscaler/cluster-autoscaler/simulator/cluster.go:126: SimulateNodeRemoval 81.8%
k8s.io/autoscaler/cluster-autoscaler/simulator/cluster.go:168: withForkedSnapshot 66.7%
k8s.io/autoscaler/cluster-autoscaler/simulator/cluster.go:184: findPlaceFor 88.9%

@jackfrancis
Copy link
Copy Markdown
Contributor

/lgtm
/approve
/hold cancel

thank you @kincoy!

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, kincoy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.33":

Applying: fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info
Using index info to reconstruct a base tree...
M	cluster-autoscaler/simulator/cluster.go
Falling back to patching base and 3-way merge...
CONFLICT (add/add): Merge conflict in cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33
/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.34":

Applying: fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info
Using index info to reconstruct a base tree...
M	cluster-autoscaler/simulator/cluster.go
Falling back to patching base and 3-way merge...
CONFLICT (add/add): Merge conflict in cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33
/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.32":

Applying: fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info
Using index info to reconstruct a base tree...
M	cluster-autoscaler/simulator/cluster.go
Falling back to patching base and 3-way merge...
CONFLICT (add/add): Merge conflict in cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster_test.go
Auto-merging cluster-autoscaler/simulator/cluster.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33
/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@towca
Copy link
Copy Markdown
Collaborator

towca commented Oct 31, 2025

@kincoy @jackfrancis @x13n Do we know why the NodeInfo is missing in the first place? The fix itself looks good, but IIUC there was a time when this logic worked correctly even without the fix - do we know what changed? I'm worried the fix might be masking a wider regression in ClusterSnapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants