fix(cluster-autoscaler): prevent panic in SimulateNodeRemoval by handling missing node info#8449
Conversation
|
Hi @kincoy. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
elmiko
left a comment
There was a problem hiding this comment.
thank you for the PR and adding the test too.
the changes here look generally good to me, but i am not overly familiar with the simulator code. in specific, the returns for the failure condition.
would it be possible to add a unit test for the other return value as well? (i see the test for the unremovable node with the NoNodeInfo, but can we also test for UnexpectedError?)
Thanks! I looked into If you have a clean way to test this case, I’m happy to give it a try! |
thanks for investigating, i was worried it might take a complicated mock to make it work. i don't think it's worth the effort to create a test with a mock just for the error condition. /lgtm would be good to get another review from a maintainer |
3d61fed to
ed37b25
Compare
ed37b25 to
6fcb503
Compare
|
Friendly ping @jackfrancis — this PR has been idle for a while. Would appreciate a review when convenient 🙏 |
|
@kincoy sorry for the delay, this lgtm, and I'll give it the official stamp after a rebase thanks! |
|
/ok-to-test |
|
/cherry-pick cluster-autoscaler-release-1.32 |
|
@jackfrancis: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/lgtm @towca @BigDarkClown @x13n @elmiko for another review |
|
/test pull-cluster-autoscaler-e2e-azure-master |
|
Hm... so #8473 removed unit tests along with the tested function, but the underlying |
Yeah, I’ll add a new test file in this PR to cover |
7e694d6 to
a40418d
Compare
…ling missing node info
|
@x13n hi, I’ve updated the PR and added the test file. The function’s coverage is now over 80%. ~ go tool cover -func=cover.out |
|
/lgtm thank you @kincoy! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, kincoy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.33": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.34": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jackfrancis: #8449 failed to apply on top of branch "cluster-autoscaler-release-1.32": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@kincoy @jackfrancis @x13n Do we know why the NodeInfo is missing in the first place? The fix itself looks good, but IIUC there was a time when this logic worked correctly even without the fix - do we know what changed? I'm worried the fix might be masking a wider regression in ClusterSnapshot. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a potential
nildereference issue inSimulateNodeRemovalwhen a node is missing from theclusterSnapshot.Previously, if
clusterSnapshot.GetNodeInfofailed, the function would continue and potentially panic when accessingnodeInfo.Pods().This fix introduces:
UnremovableReason:NoNodeInfo, used to mark such nodes as unremovable.SimulateNodeRemovalwhen the node is missing..Nameset) to maintain observability and event compatibility.Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
This PR fixes a potential nil dereference in
SimulateNodeRemovalwhen a node is missing from the cluster snapshot.It adds a new
UnremovableReason(NoNodeInfo) to capture this edge case, and add the test coverage.Does this PR introduce a user-facing change?