-
Notifications
You must be signed in to change notification settings - Fork 1.4k
✨ Add scale from/to 0 support for CAPD #12572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/hold for testing During initial testing with autoscler and CAPI v1beta2 api's seems like the autoscaler needs to be updated to change in CAPI's apiVersion to apiGroup change in v1beta2. will update more based on the test progress. |
|
/cc @sbueringer |
This should work for now as we are pinning the CAPI_VERSION here:
There is this other issue here though: kubernetes/autoscaler#7908 (comment) So overall. This should work at the moment with autoscaler v1.32.1 We have to make the following changes to get the corresponding test coverage. cluster-api/test/e2e/autoscaler_test.go Line 39 in 1636804
Should be AutoscalerVersion: "v1.32.1",
ScaleToAndFromZero: true,(Let's make that change in this PR, it's fine to downgrade for now until kubernetes/autoscaler#7908 (comment) is fixed) (autoscaler test is part of |
|
i think the approach here looks solid, although one thing i would want to double check is that the system resources reported by the container runtime are correct. in the past, i have seen the inside container resources looking very similar to the system wide resources and this can cause issues when multiple containers are started as kubelets with the same resource capacity as the host. i know docker and podman allow limiting the resource capacity of a container, but i haven't seen it working as i would expect with kubernetes. |
hey, Thanks for the feedback, I will update based on my obervation during testing. But do you recommend any other way of fetching system resources apart from using runtime? |
|
The current behavior is expected in my opinion. If you run CAPD and look at Node allocatble resource etc it will show the entire resource for every single Node. We want the capacity information on the DockerMachineTemplate to match that. We do not want to introduce limiting/reserving CAPD Machine memory/CPU as part of enabling CAPD for autoscaling from/to 0. Let's please also not forget that CAPD existly for the sole purpose of testing core CAPI. If we would start enforcing memory reservations/limits on CAPD containers so the actual available resources are perfectly split up we would not be able to run our e2e tests anymore (where we run a huge number of basically empty CAPD Machines at the same time). |
|
/test pull-cluster-api-e2e-main (to run the autoscaler test) |
|
A question in general I wanted to record it before I miss it, Whats the CAPI recomended way of logging a CRD name in logs, |
|
The second option |
Thanks. Should that be record here https://cluster-api.sigs.k8s.io/developer/core/logging#log-messages? |
Makes sense. Feel free to open a PR. It should be something along the lines of: "If kinds are mentioned in log messages always use the literal kind". Maybe with an example |
|
will debug and update on the failure, will try to run locally, But from autoscaler logs I could see I hope that autoscaler did not remove the scaled nodes, Just trying to better read the logs of the artifactory |
|
Hm yeah. Who knows what the problem is. In CAPV we are running the autoscaler on the mgmt cluster for test setup reasons. So there might be an additional problem when running the autoscaler on the workload cluster that we weren't aware of yet |
not specifically, i more wanted to highlight the behavior. i absolutely defer to @sbueringer though, this makes sense to me:
on the autoscaler failure
it looks like the autoscaler removed the nodes due to them being underutilized. this is where knowing the capacity of nodes created on capd will be important. by default the autoscaler will want to remove nodes that are below 50% resource utilization as calculated by comparing the requests of pods on the node versus the allocatable and capacity resources of the node. so, if the node is taking the resource capacity from the host, it could be much larger than expected and we need to account for that either by lowering the threshold for removal or by adjusting our workloads for testing to use higher requests. |
|
Just for context. We are already testing CAPD with the autoscaler. The only part we did not test before was scale from/to 0. The way this works today in CAPD is that the Node comes up with some fantasy number for memory and sets it on
So my assumption would be that this should continue to work (In reality neither the Deployment nor the Machine will use anything close to the memory values we see) |
|
The test currently fails at the following point:
So we should have no Pods of that Deployment anymore in the workload cluster, but the autoscaler currently does not scale the MD to 0, it stays at 3 |
|
Just another update
Observing machines getting provisioned and deleted too quickly| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick review. But I think I found nothing that explains the test failure. I'll take a look
test/infrastructure/docker/internal/controllers/dockermachinetemplate_controller.go
Outdated
Show resolved
Hide resolved
test/infrastructure/docker/internal/controllers/dockermachinetemplate_controller.go
Outdated
Show resolved
Hide resolved
test/infrastructure/docker/internal/controllers/dockermachinetemplate_controller.go
Outdated
Show resolved
Hide resolved
test/infrastructure/docker/internal/controllers/dockermachinetemplate_controller.go
Outdated
Show resolved
Hide resolved
test/infrastructure/docker/internal/controllers/dockermachinetemplate_controller.go
Outdated
Show resolved
Hide resolved
|
@Karthik-K-N Probably this is the issue: #12572 (comment) I used |
1ea99d3 to
1c0c9f1
Compare
|
/test pull-cluster-api-e2e-main |
sbueringer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Last nit.
Also verified in CI. cpu/memory is perfectly matching up between Node and DockerMachineTemplate
capacity:
cpu: "16"
ephemeral-storage: 292825676Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 130810784Ki
pods: "110"status:
capacity:
cpu: "16"
memory: 130810784Ki|
/cherry-pick release-1.11 |
|
@sbueringer: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test pull-cluster-api-e2e-main |
|
LGTM label has been added. DetailsGit tree hash: 3fa8f57fa8318a6e8cc39fb93bbae5f9797a78e1 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@Karthik-K-N Thank you very much! I think this will help a lot finding compatibility issues between Cluster API and cluster-autoscaler sooner |
|
@sbueringer: new pull request created: #12591 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
This PR adds scale from 0 support for CAPD
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #12505