Skip to content

Conversation

@Karthik-K-N
Copy link
Contributor

What this PR does / why we need it:

This PR adds scale from 0 support for CAPD

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #12505

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Aug 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 1, 2025
@Karthik-K-N
Copy link
Contributor Author

/hold for testing

During initial testing with autoscler and CAPI v1beta2 api's seems like the autoscaler needs to be updated to change in CAPI's apiVersion to apiGroup change in v1beta2. will update more based on the test progress.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 1, 2025
@Karthik-K-N
Copy link
Contributor Author

/cc @sbueringer

@k8s-ci-robot k8s-ci-robot requested a review from sbueringer August 1, 2025 14:49
@sbueringer
Copy link
Member

sbueringer commented Aug 1, 2025

During initial testing with autoscler and CAPI v1beta2 api's seems like the autoscaler needs to be updated to change in CAPI's apiVersion to apiGroup change in v1beta2. will update more based on the test progress.

This should work for now as we are pinning the CAPI_VERSION here:

(until autoscaler is adjusted)

There is this other issue here though: kubernetes/autoscaler#7908 (comment)

So overall. This should work at the moment with autoscaler v1.32.1

We have to make the following changes to get the corresponding test coverage.

AutoscalerVersion: "v1.33.0",

Should be

AutoscalerVersion:                     "v1.32.1",
ScaleToAndFromZero:                    true,

(Let's make that change in this PR, it's fine to downgrade for now until kubernetes/autoscaler#7908 (comment) is fixed)

(autoscaler test is part of pull-cluster-api-e2e-main)

@sbueringer sbueringer added the area/provider/infrastructure-docker Issues or PRs related to the docker infrastructure provider label Aug 1, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Aug 1, 2025
@elmiko
Copy link
Contributor

elmiko commented Aug 1, 2025

i think the approach here looks solid, although one thing i would want to double check is that the system resources reported by the container runtime are correct. in the past, i have seen the inside container resources looking very similar to the system wide resources and this can cause issues when multiple containers are started as kubelets with the same resource capacity as the host.

i know docker and podman allow limiting the resource capacity of a container, but i haven't seen it working as i would expect with kubernetes.

@Karthik-K-N
Copy link
Contributor Author

i think the approach here looks solid, although one thing i would want to double check is that the system resources reported by the container runtime are correct. in the past, i have seen the inside container resources looking very similar to the system wide resources and this can cause issues when multiple containers are started as kubelets with the same resource capacity as the host.

i know docker and podman allow limiting the resource capacity of a container, but i haven't seen it working as i would expect with kubernetes.

hey, Thanks for the feedback, I will update based on my obervation during testing. But do you recommend any other way of fetching system resources apart from using runtime?

@sbueringer
Copy link
Member

sbueringer commented Aug 4, 2025

The current behavior is expected in my opinion.

If you run CAPD and look at Node allocatble resource etc it will show the entire resource for every single Node. We want the capacity information on the DockerMachineTemplate to match that.

We do not want to introduce limiting/reserving CAPD Machine memory/CPU as part of enabling CAPD for autoscaling from/to 0.

Let's please also not forget that CAPD existly for the sole purpose of testing core CAPI. If we would start enforcing memory reservations/limits on CAPD containers so the actual available resources are perfectly split up we would not be able to run our e2e tests anymore (where we run a huge number of basically empty CAPD Machines at the same time).

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

(to run the autoscaler test)

@Karthik-K-N
Copy link
Contributor Author

A question in general I wanted to record it before I miss it, Whats the CAPI recomended way of logging a CRD name in logs,
Say for example we have DockerMachineTemplate, how should that be logged

log.Info("Calculating capacity for Docker Machine Template")

or

log.Info("Calculating capacity for DockerMachineTemplate")

or

log.Info("Calculating capacity for docker machine template")

@sbueringer
Copy link
Member

The second option

@Karthik-K-N
Copy link
Contributor Author

The second option

Thanks. Should that be record here https://cluster-api.sigs.k8s.io/developer/core/logging#log-messages?

@sbueringer
Copy link
Member

The second option

Thanks. Should that be record here cluster-api.sigs.k8s.io/developer/core/logging#log-messages?

Makes sense. Feel free to open a PR. It should be something along the lines of: "If kinds are mentioned in log messages always use the literal kind". Maybe with an example

@Karthik-K-N
Copy link
Contributor Author

will debug and update on the failure, will try to run locally, But from autoscaler logs I could see

I0804 08:29:28.274247       1 actuator.go:166] Scale-down: removing empty node "autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr"
I0804 08:29:28.274883       1 actuator.go:286] Scale-down: waiting 5s before trying to delete nodes
W0804 08:29:33.288013       1 warnings.go:70] cluster.x-k8s.io/v1beta1 Machine is deprecated; use cluster.x-k8s.io/v1beta2 Machine
W0804 08:29:33.343132       1 warnings.go:70] cluster.x-k8s.io/v1beta1 Machine is deprecated; use cluster.x-k8s.io/v1beta2 Machine
W0804 08:29:39.092767       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:29:39.111636       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:29:50.702257       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:29:50.720390       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:30:02.303995       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:30:02.320868       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:30:13.914242       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:30:13.933645       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:30:25.519404       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:30:25.537134       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:30:37.127578       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:30:37.147262       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:30:48.930771       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:30:48.947725       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
W0804 08:31:00.743607       1 clusterstate.go:657] Nodegroup is nil for docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr
W0804 08:31:00.765473       1 static_autoscaler.go:821] No node group for node docker:////autoscaler-d4p5q9-md-0-gqk4j-gv24j-h5gcr, skipping
I0804 08:31:02.124780       1 actuator.go:166] Scale-down: removing empty node "autoscaler-d4p5q9-md-0-gqk4j-gv24j-jhdhf"
I0804 08:31:02.125125       1 actuator.go:166] Scale-down: removing empty node "autoscaler-d4p5q9-md-0-gqk4j-gv24j-xkjcp"
I0804 08:31:02.125546       1 actuator.go:286] Scale-down: waiting 5s before trying to delete nodes
I0804 08:31:02.136484       1 actuator.go:259] Scale-down: removing node autoscaler-d4p5q9-md-0-gqk4j-gv24j-c9mx7, utilization: {0.00625 0.00039140503889954515 0 0 cpu 0.00625}, pods to reschedule: cluster-autoscaler-7cc5775ccc-6x94r
I0804 08:31:02.136816       1 actuator.go:286] Scale-down: waiting 5s before trying to delete nodes
W0804 08:31:07.139129       1 warnings.go:70] cluster.x-k8s.io/v1beta1 Machine is deprecated; use cluster.x-k8s.io/v1beta2 Machine

I hope that autoscaler did not remove the scaled nodes, Just trying to better read the logs of the artifactory

@sbueringer
Copy link
Member

Hm yeah. Who knows what the problem is. In CAPV we are running the autoscaler on the mgmt cluster for test setup reasons. So there might be an additional problem when running the autoscaler on the workload cluster that we weren't aware of yet

@elmiko
Copy link
Contributor

elmiko commented Aug 4, 2025

hey, Thanks for the feedback, I will update based on my obervation during testing. But do you recommend any other way of fetching system resources apart from using runtime?

not specifically, i more wanted to highlight the behavior. i absolutely defer to @sbueringer though, this makes sense to me:

We do not want to introduce limiting/reserving CAPD Machine memory/CPU as part of enabling CAPD for autoscaling from/to 0.

on the autoscaler failure

will debug and update on the failure, will try to run locally, But from autoscaler logs I could see

it looks like the autoscaler removed the nodes due to them being underutilized. this is where knowing the capacity of nodes created on capd will be important. by default the autoscaler will want to remove nodes that are below 50% resource utilization as calculated by comparing the requests of pods on the node versus the allocatable and capacity resources of the node. so, if the node is taking the resource capacity from the host, it could be much larger than expected and we need to account for that either by lowering the threshold for removal or by adjusting our workloads for testing to use higher requests.

@sbueringer
Copy link
Member

sbueringer commented Aug 4, 2025

Just for context. We are already testing CAPD with the autoscaler. The only part we did not test before was scale from/to 0.

The way this works today in CAPD is that the Node comes up with some fantasy number for memory and sets it on Node.status.Allocatable.Memory. We then create a Deployment that takes 60% of that fantasy number and sets it as requested memory:

memoryRequired := int64(float64(memory.Value()) * 0.6)

So my assumption would be that this should continue to work

(In reality neither the Deployment nor the Machine will use anything close to the memory values we see)

@sbueringer
Copy link
Member

sbueringer commented Aug 4, 2025

The test currently fails at the following point:

  • Scale that Deployment with the 60% requested memory down to 0
  • Checking the MachineDeployment finished scaling down to zero

So we should have no Pods of that Deployment anymore in the workload cluster, but the autoscaler currently does not scale the MD to 0, it stays at 3

@Karthik-K-N
Copy link
Contributor Author

Just another update

  1. Created management cluster using tilt with autoscaler support
  2. Created a clusterclass from tilt UI and created a development cluster with 0 workers
  3. Edited machinedeployment and added required annotations
  4. Created a workload with following yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: busybox
  name: busybox-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
        - command:
            - sh
            - -c
            - echo Container 1 is Running ; sleep 3600
          image: busybox
          imagePullPolicy: IfNotPresent
          name: busybox
          resources:
            requests:
              cpu: "0.2"
              memory: 3G

  1. Cloned autoscaler and checkout cluster-autoscaler-1.32.1 branch
  2. Built binary and started autoscaler with following args
export CAPI_VERSION=v1beta1

./cluster-autoscaler \
--cloud-provider=clusterapi \
--v=5 \
--namespace=default \
--max-nodes-total=30 \
--scale-down-delay-after-add=100s \
--scale-down-delay-after-delete=10s \
--scale-down-delay-after-failure=10s \
--scale-down-unneeded-time=5m \
--max-node-provision-time=30m \
--balance-similar-node-groups \
--expander=random \
--kubeconfig=/root/karthik-workspace/cluster-api/cmd/clusterctl/workload.conf \
--cloud-config=/root/karthik-workspace/cluster-api/cmd/clusterctl/management.conf

Observing machines getting provisioned and deleted too quickly|

kubectl get machines -w
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Pending        0s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Pending        0s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Pending        0s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   0s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Provisioning   1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Deleting       1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Deleting       1s      v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080                                                                              Deleting       11s     v1.33.0
development-19080-md-0-qgr98-f7tn6-m9hpw   development-19080

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review. But I think I found nothing that explains the test failure. I'll take a look

@sbueringer
Copy link
Member

sbueringer commented Aug 5, 2025

@Karthik-K-N Probably this is the issue: #12572 (comment)

I used make tilt-up and then ran the autoscaler e2e test against it
(xref: https://cluster-api.sigs.k8s.io/developer/core/testing#test-execution-via-ide-1 including tips, not sure if it's 100% up-to-date)

@Karthik-K-N Karthik-K-N force-pushed the autoscale branch 2 times, most recently from 1ea99d3 to 1c0c9f1 Compare August 5, 2025 17:15
@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer
Copy link
Member

/cherry-pick release-1.11

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer sbueringer added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Aug 6, 2025
@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main
/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 3fa8f57fa8318a6e8cc39fb93bbae5f9797a78e1

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2025
@sbueringer
Copy link
Member

@Karthik-K-N Thank you very much! I think this will help a lot finding compatibility issues between Cluster API and cluster-autoscaler sooner

@sbueringer sbueringer changed the title ✨ Add scale from 0 support for CAPD ✨ Add scale from/to 0 support for CAPD Aug 6, 2025
@k8s-ci-robot k8s-ci-robot merged commit bfb09ac into kubernetes-sigs:main Aug 6, 2025
24 of 25 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.11 milestone Aug 6, 2025
@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #12591

Details

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/infrastructure-docker Issues or PRs related to the docker infrastructure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend CAPD to support autoscale from/to 0

5 participants