Implements maxAvailableComponentSets for accurate estimator #6876

mszacillo · 2025-10-27T02:56:20Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Implements maxAvailableComponentSets for the accurate estimator. I have a follow-up commented as a TODO in the implementation for the conversion of the node resource estimation into a plugin.

Which issue(s) this PR fixes:
Part of #6734

Special notes for your reviewer:
Tested E2E using code from #6857. At the moment the estimator returns 0 for the component set estimation, but this is expected since I have not linked this together with the resourcequota plugin (which is under review).

I1027 02:35:43.329736       1 server.go:240] Begin calculating available component sets of resource(kind=FlinkDeployment, name=s-spaaseng/a-5-ticker-demo), request: pb.MaxAvailableComponentSetsRequest{
    Cluster:    "spaas-kaas-tt-dev",
    Components: {
        {
            Name:                "jobmanager",
            ReplicaRequirements: pb.ReplicaRequirements{
                NodeClaim: &pb.NodeClaim{
                    NodeAffinity: (*v1.NodeSelector)(nil),
                    NodeSelector: {},
                    Tolerations:  {
                        {
                            Key:               "node.kubernetes.io/unreachable",
                            Operator:          "Exists",
                            Value:             "",
                            Effect:            "NoExecute",
                            TolerationSeconds: &int64(5),
                        },
                        {
                            Key:               "node.kubernetes.io/not-ready",
                            Operator:          "Exists",
                            Value:             "",
                            Effect:            "NoExecute",
                            TolerationSeconds: &int64(5),
                        },
                    },
                },
                ResourceRequest: {
                    "cpu": {
                        i:      resource.int64Amount{value:500, scale:-3},
                        d:      resource.infDecAmount{},
                        s:      "500m",
                        Format: "DecimalSI",
                    },
                    "memory": {
                        i:      resource.int64Amount{value:2048, scale:-3},
                        d:      resource.infDecAmount{},
                        s:      "2.048",
                        Format: "DecimalSI",
                    },
                },
                Namespace:         "s-spaaseng",
                PriorityClassName: "",
            },
            Replicas: 1,
        },
        {
            Name:                "taskmanager",
            ReplicaRequirements: pb.ReplicaRequirements{
                NodeClaim: &pb.NodeClaim{
                    NodeAffinity: (*v1.NodeSelector)(nil),
                    NodeSelector: {},
                    Tolerations:  {
                        {
                            Key:               "node.kubernetes.io/unreachable",
                            Operator:          "Exists",
                            Value:             "",
                            Effect:            "NoExecute",
                            TolerationSeconds: &int64(5),
                        },
                        {
                            Key:               "node.kubernetes.io/not-ready",
                            Operator:          "Exists",
                            Value:             "",
                            Effect:            "NoExecute",
                            TolerationSeconds: &int64(5),
                        },
                    },
                },
                ResourceRequest: {
                    "cpu": {
                        i:      resource.int64Amount{value:500, scale:-3},
                        d:      resource.infDecAmount{},
                        s:      "500m",
                        Format: "DecimalSI",
                    },
                    "memory": {
                        i:      resource.int64Amount{value:2048, scale:-3},
                        d:      resource.infDecAmount{},
                        s:      "2.048",
                        Format: "DecimalSI",
                    },
                },
                Namespace:         "s-spaaseng",
                PriorityClassName: "",
            },
            Replicas: 1,
        },
    },
}
I1027 02:35:43.330091       1 server.go:247] Finish calculating cluster available component sets of resource(kind=FlinkDeployment, name=s-spaaseng/a-5-ticker-demo), max component sets: 0, time elapsed: 1.688997ms

Does this PR introduce a user-facing change?:

`karmada-scheduler`/`karmada-scheduler-estimator`: Implemented maxAvailableComponentSets for the accurate estimator.

gemini-code-assist · 2025-10-27T02:56:33Z

Summary of Changes

Hello @mszacillo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature to the estimator, allowing it to accurately determine the maximum number of multi-component sets that can be scheduled across various clusters. This enhancement is crucial for advanced resource allocation and scheduling decisions, providing a more granular and comprehensive view of cluster capacity beyond single replica estimations.

Highlights

Feature Implementation: This pull request implements the maxAvailableComponentSets functionality for the accurate estimator, enabling it to calculate the maximum number of complete multi-component sets a cluster can host.
Client-Side Logic: The client-side (pkg/estimator/client/accurate.go) now includes the MaxAvailableComponentSets method, which orchestrates concurrent estimation requests across multiple clusters and handles the conversion of API types to protobuf messages for gRPC communication.
Server-Side Estimation: On the server-side (pkg/estimator/server/estimate.go and pkg/estimator/server/server.go), new functions EstimateComponents and estimateComponents have been added to process incoming requests, update the scheduler cache snapshot, and run estimation plugins to determine the maximum component sets.
Metrics Integration: A new metric type, EstimatingTypeMaxAvailableComponentSets, has been introduced in pkg/estimator/server/metrics/metrics.go to track the performance and usage of this new estimation capability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request implements the maxAvailableComponentSets functionality for the accurate estimator, which is a great addition. The changes are well-structured, with corresponding client and server implementations, as well as metrics.

My review focuses on a couple of points in the server-side estimation logic. I've identified a potential bug where the estimation would always return zero, and a minor naming issue that could affect code clarity. The proposed suggestions aim to fix the bug and improve maintainability. Overall, this is a solid contribution.

pkg/estimator/server/estimate.go

codecov-commenter · 2025-10-27T03:12:38Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.14%. Comparing base (f60f341) to head (621f845).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/estimator/client/accurate.go	0.00%	53 Missing ⚠️
pkg/estimator/server/estimate.go	0.00%	23 Missing ⚠️
pkg/estimator/server/server.go	0.00%	22 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6876      +/-   ##
==========================================
- Coverage   46.24%   46.14%   -0.11%     
==========================================
  Files         692      692              
  Lines       47194    47288      +94     
==========================================
- Hits        21826    21819       -7     
- Misses      23715    23816     +101     
  Partials     1653     1653

Flag	Coverage Δ
unittests	`46.14% <0.00%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mszacillo · 2025-10-27T17:15:57Z

@RainbowMango @zhzhuang-zju @seanlaii Should be ready for review!

RainbowMango

/assign

RainbowMango

/lgtm

RainbowMango · 2025-10-28T03:26:23Z

pkg/estimator/client/accurate.go

+// toPBReplicaRequirements converts the API ComponentReplicaRequirements to the pb.ReplicaRequirements value.
+func toPBReplicaRequirements(cr *workv1alpha2.ComponentReplicaRequirements, namespace string) pb.ReplicaRequirements {
+	var out pb.ReplicaRequirements
+	out.Namespace = namespace
+	if cr == nil {
+		return out
+	}
+	out.ResourceRequest = cr.ResourceRequest
+	out.PriorityClassName = cr.PriorityClassName
+	if cr.NodeClaim != nil {
+		out.NodeClaim = &pb.NodeClaim{
+			NodeAffinity: cr.NodeClaim.HardNodeAffinity,
+			NodeSelector: cr.NodeClaim.NodeSelector,
+			Tolerations:  cr.NodeClaim.Tolerations,
+		}
+	}
+	return out
+}


This is correct and not a blocker. But we should give it another look at the definition of the gRPC data structure. The conversion should be more straightforward.

@seanlaii Do you have any idea?

PS:
I added a follow-up task to #6734 for revising the design here.

pkg/estimator/server/estimate.go

Signed-off-by: mszacillo <[email protected]>

RainbowMango

/lgtm
/approve

karmada-bot · 2025-10-28T04:10:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/estimator/OWNERS~~ [RainbowMango]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RainbowMango · 2025-10-28T07:56:27Z

@zhzhuang-zju Can you help look at the failing tests?

zhzhuang-zju · 2025-10-28T09:30:14Z

Failed Test Scenario:
After deploying a Deployment requesting cpu:10 to member cluster member1, the test checks whether the allUsed field of the quota equals cpu:10, but it is actually empty.

Event Timeline from Logs:

The federated-resource-quota-enforcement-controller receives the FederatedResourceQuota (FRQ) creation event. It lists all ResourceBindings (RBs) in the namespace, but at this moment rb.spec.clusters is empty, so it attempts to update the quota’s allUsed to an empty value.
Concurrently, the webhook tries to update the quota’s allUsed to cpu:10.

The webhook succeeds first:

2025-10-28T04:28:10.639030322Z stderr F {"ts":1761625690638.9297,"caller":"resourcebinding/validating.go:237","msg":"Updating status of FRQ karmadatest-hxsjh/frq-6pnxg with OverallUsed: map[cpu:{{10 -3} {<nil>}  DecimalSI}] for RB karmadatest-hxsjh/deploy-ctz9t-deployment","v":4}
2025-10-28T04:28:10.650167067Z stderr F {"ts":1761625690650.0215,"caller":"resourcebinding/validating.go:250","msg":"Successfully updated status of FRQ karmadatest-hxsjh/frq-6pnxg for RB karmadatest-hxsjh/deploy-ctz9t-deployment.","v":2}

The federated-resource-quota-enforcement-controller then fails with a conflict error and retries. During the retry, it fetches an outdated version of the ResourceBinding, and thus updates the quota’s allUsed back to empty:

2025-10-28T04:28:10.533934235Z stderr F {"ts":1761625690140.0073,"caller":"federatedresourcequota/federated_resource_quota_enforcement_controller.go:69","msg":"QuotaEnforcementController reconciling","v":4,"namespacedName":"karmadatest-hxsjh/frq-6pnxg"}
2025-10-28T04:28:10.631017162Z stderr F {"ts":1761625690630.2998,"caller":"federatedresourcequota/federated_resource_quota_enforcement_controller.go:240","msg":"Collecting FederatedResourceQuota status using ResourceBindings.","v":4}
2025-10-28T04:28:10.688433682Z stderr F {"ts":1761625690686.8625,"caller":"federatedresourcequota/federated_resource_quota_enforcement_controller.go:86","msg":"Failed to collect status for FederatedResourceQuota","federatedResourceQuota":"karmadatest-hxsjh/frq-6pnxg","err":"Operation cannot be fulfilled on federatedresourcequotas.policy.karmada.io \"frq-6pnxg\": the object has been modified; please apply your changes to the latest version and try again"}
2025-10-28T04:28:10.69834722Z stderr F {"ts":1761625690698.286,"caller":"federatedresourcequota/federated_resource_quota_enforcement_controller.go:69","msg":"QuotaEnforcementController reconciling","v":4,"namespacedName":"karmadatest-hxsjh/frq-6pnxg"}
2025-10-28T04:28:10.698493112Z stderr F {"ts":1761625690698.436,"caller":"federatedresourcequota/federated_resource_quota_enforcement_controller.go:240","msg":"Collecting FederatedResourceQuota status using ResourceBindings.","v":4}

At this point, the CI checks the quota and observes that allUsed does not match the expected value, causing the test to fail.

  In [BeforeEach] at: /home/runner/work/karmada/karmada/test/e2e/suites/base/federatedresourcequota_test.go:349 @ 10/28/25 04:28:13.927

Summary:
This is a very low-probability race condition that occurs only when both the webhook and the federated-resource-quota-enforcement-controller are triggered almost simultaneously, and the controller happens to read an outdated version of the ResourceBinding.

Is this a bug?
I don't think so. The quota system is eventually consistent, and this issue is resolved automatically by the recalculation mechanism. Kubernetes’ native ResourceQuota exhibits similar behavior (see: kubernetes/kubernetes#123434 (comment)).

Can the test be hardened?
Yes. The e2e test can wait until the federated-resource-quota-enforcement-controller has successfully updated the quota before proceeding to deploy the workload.

/retest

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 27, 2025

karmada-bot requested review from Garrybest and RainbowMango October 27, 2025 02:56

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

pkg/estimator/server/estimate.go Outdated Show resolved Hide resolved

pkg/estimator/server/estimate.go Outdated Show resolved Hide resolved

mszacillo force-pushed the accurate-estimator-impl branch 3 times, most recently from f9c29e1 to ed013e5 Compare October 27, 2025 16:55

RainbowMango reviewed Oct 28, 2025

View reviewed changes

karmada-bot assigned RainbowMango Oct 28, 2025

RainbowMango reviewed Oct 28, 2025

View reviewed changes

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025

RainbowMango mentioned this pull request Oct 28, 2025

[Umbrella] Multi-components workload scheduling - phase II #6734

Open

22 tasks

Implements maxAvailableComponentSets for accurate estimator

621f845

Signed-off-by: mszacillo <[email protected]>

mszacillo force-pushed the accurate-estimator-impl branch from ed013e5 to 621f845 Compare October 28, 2025 03:54

karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025

RainbowMango added this to the v1.16 milestone Oct 28, 2025

RainbowMango approved these changes Oct 28, 2025

View reviewed changes

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025

karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2025

karmada-bot merged commit ea144d1 into karmada-io:master Oct 28, 2025
33 of 34 checks passed

RainbowMango mentioned this pull request Oct 28, 2025

[Feature] Add ResourceQuota plugin for multi-component scheduling #6875

Merged

zhzhuang-zju mentioned this pull request Oct 28, 2025

Flaky E2E tests – tracking intermittent failures #6841

Open

6 tasks

zhzhuang-zju mentioned this pull request Nov 5, 2025

flaky test: deploy workloads should be rejected in case of no enough quota left #6907

Merged

Implements maxAvailableComponentSets for accurate estimator #6876

Implements maxAvailableComponentSets for accurate estimator #6876

Uh oh!

Conversation

mszacillo commented Oct 27, 2025 • edited by RainbowMango Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mszacillo commented Oct 27, 2025

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

RainbowMango Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

RainbowMango Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

karmada-bot commented Oct 28, 2025

Uh oh!

RainbowMango commented Oct 28, 2025

Uh oh!

zhzhuang-zju commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mszacillo commented Oct 27, 2025 •

edited by RainbowMango

Loading

codecov-commenter commented Oct 27, 2025 •

edited

Loading