Enable multiple components estimating #6857

RainbowMango · 2025-10-20T07:56:14Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces support for multiple component estimation in the scheduler by adding logic to handle workloads with multiple pod templates that need to be scheduled to a single cluster. The feature is gated behind MultiplePodTemplatesScheduling and uses a new MaxAvailableComponentSets estimator API.

Which issue(s) this PR fixes:
Part of #6734

Special notes for your reviewer:
See test report below: #6857 (comment).

Currently, since replicas are not set for multi-template workloads, the system does not perform actual replica distribution. As a result, when observing the ResourceBinding, one cannot see the replica count per cluster, which makes debugging and validation difficult. This behavior needs further improvement, as the current logic for replica allocation relies solely on checking whether replicas > 0—a condition that is very fragile and insufficient for multi-template scenarios.

Additionally, we should add comprehensive end-to-end (E2E) tests for multi-template workloads to ensure correctness and robustness.

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Enabled the capability for multiple component estimation in the scheduler. The feature is gated behind MultiplePodTemplatesScheduling.

codecov-commenter · 2025-10-20T08:12:49Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.37037% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.28%. Comparing base (f60f341) to head (39bfa3c).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/scheduler/core/util.go	5.88%	16 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6857      +/-   ##
==========================================
+ Coverage   46.24%   46.28%   +0.03%     
==========================================
  Files         692      693       +1     
  Lines       47194    47237      +43     
==========================================
+ Hits        21826    21864      +38     
- Misses      23715    23721       +6     
+ Partials     1653     1652       -1

Flag	Coverage Δ
unittests	`46.28% <70.37%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RainbowMango · 2025-10-20T12:06:47Z

/retest

Copilot

Pull Request Overview

This PR introduces support for multiple component estimation in the scheduler by adding logic to handle workloads with multiple pod templates that need to be scheduled to a single cluster. The feature is gated behind MultiplePodTemplatesScheduling and uses a new MaxAvailableComponentSets estimator API.

Key Changes:

Added conditional logic to use component set estimation for multi-template workloads
Implemented validation to determine when multi-template scheduling applies
Added comprehensive test coverage for the new estimation logic

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/scheduler/core/util.go	Integrates multi-template estimation into the main replica calculation flow with feature gate check
pkg/scheduler/core/estimation.go	Implements core logic for multi-template scheduling validation and available set calculation
pkg/scheduler/core/estimation_test.go	Provides comprehensive test coverage for multi-template scheduling applicability and calculation scenarios

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pkg/scheduler/core/estimation.go

pkg/scheduler/core/estimation_test.go

RainbowMango · 2025-10-22T06:44:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the capability for multiple component estimation in the scheduler, which is a great feature. The changes are well-structured, with new logic encapsulated in estimation.go and corresponding tests.

I have a few suggestions to improve the implementation:

The applicability check for multi-template scheduling seems to have a logic error regarding the number of components, and a test case has a misleading name.
There's an opportunity to optimize a loop in calculateMultiTemplateAvailableSets for better performance.
The legacy replica calculation logic, which is now in an else block, has a potential bug related to cluster ordering that should be addressed for robustness.

pkg/scheduler/core/estimation.go

pkg/scheduler/core/util.go

pkg/scheduler/core/estimation.go

pkg/scheduler/core/estimation_test.go

pkg/scheduler/core/util.go

Signed-off-by: RainbowMango <[email protected]>

RainbowMango · 2025-10-27T07:34:16Z

After a basic manual test, it shows that the current patch can call the new estimator and propagate the workloads correctly.

Following tests run with Volcano Job, as we just implemented the default interpreter recently.

1: Install Volcano Job CRD
Apply the CRD on Karmada control plane and all of the member clusters by command:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/refs/heads/master/installer/helm/chart/volcano/crd/bases/batch.volcano.sh_jobs.yaml

2: Create a PropagationPolicy for Volcano Job:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: foo
spec:
  resourceSelectors:
    - apiVersion: batch.volcano.sh/v1alpha1
      kind: Job
      name: dk-job
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    replicaScheduling:
      replicaDivisionPreference: Aggregated    # declares that want to aggregate the replicas
      replicaSchedulingType: Divided
    spreadConstraints:                         # but restrict to use only 1 cluster
      - spreadByField: cluster
        minGroups: 1
        maxGroups: 1

3: Create a Volcano Job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: dk-job
spec:
  maxRetry: 3
  minAvailable: 3
  plugins:
    env: []
    ssh: []
    svc:
    - --disable-network-policy=true
  queue: default
  schedulerName: volcano
  tasks:
  - minAvailable: 1
    name: job-nginx1
    replicas: 1
    template:
      metadata:
        name: nginx1
      spec:
        containers:
        - args:
          - sleep 10
          command:
          - bash
          - -c
          image: nginx:latest
          imagePullPolicy: IfNotPresent
          name: nginx
          resources:
            requests:
              cpu: 100m
        nodeSelector:
          kubernetes.io/os: linux
        restartPolicy: OnFailure
  - minAvailable: 2
    name: job-nginx2
    replicas: 3
    template:
      metadata:
        name: nginx2
      spec:
        containers:
        - args:
          - sleep 30
          command:
          - bash
          - -c
          image: nginx:latest
          imagePullPolicy: IfNotPresent
          name: nginx
          resources:
            requests:
              cpu: 100m
        nodeSelector:
          kubernetes.io/os: linux
        restartPolicy: OnFailure

4. Check propagation state:

-bash-5.0# karmadactl get jobs.batch.volcano.sh --operation-scope=all
NAME     CLUSTER   STATUS   MINAVAILABLE   RUNNINGS   AGE   ADOPTION
dk-job   Karmada                                      34m   -
dk-job   member1                                      34m   Y

It shows that the job now has been scheduled to member1.

5. Check the schedule result from RsourceBinding:

apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
  name: dk-job-job
spec:
  clusters:
  - name: member1   # without replicas assigned
  components:
  - name: job-nginx1
    replicaRequirements:
      nodeClaim:
        nodeSelector:
          kubernetes.io/os: linux
      resourceRequest:
        cpu: 100m
    replicas: 1
  - name: job-nginx2
    replicaRequirements:
      nodeClaim:
        nodeSelector:
          kubernetes.io/os: linux
      resourceRequest:
        cpu: 100m
    replicas: 2
  conflictResolution: Abort
  placement:
    clusterAffinity:
      clusterNames:
      - member1
      - member2
    replicaScheduling:
      replicaDivisionPreference: Aggregated
      replicaSchedulingType: Divided
    spreadConstraints:
    - maxGroups: 1
      minGroups: 1
      spreadByField: cluster
  resource:
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    name: dk-job
    namespace: default
    resourceVersion: "2405"
    uid: 56ef1978-a1fc-44fa-a9ad-084a988d2f2b
  schedulerName: default-scheduler
status:
  aggregatedStatus:
  - applied: true
    clusterName: member1
    health: Unhealthy
    status: {}
  conditions:
  - lastTransitionTime: "2025-10-27T06:55:11Z"
    message: Binding has been scheduled successfully.
    reason: Success
    status: "True"
    type: Scheduled
  - lastTransitionTime: "2025-10-27T06:55:11Z"
    message: All works have been successfully applied
    reason: FullyAppliedSuccess
    status: "True"
    type: FullyApplied
  lastScheduledTime: "2025-10-27T06:55:11Z"
  schedulerObservedGeneration: 2

6: Check scheduler log:

{"ts":1761548111152.9927,"caller":"core/generic_scheduler.go:96","msg":"Feasible clusters scores: [{member1 100} {member2 0}]","v":4}
{"ts":1761548111153.1296,"caller":"core/estimation.go:97","msg":"The estimator(scheduler-estimator) missed estimation from cluster(member1) when estimating for workload(batch.volcano.sh/v1alpha1, kind=Job, default/dk-job).","v":0}
{"ts":1761548111153.307,"caller":"core/estimation.go:97","msg":"The estimator(scheduler-estimator) missed estimation from cluster(member2) when estimating for workload(batch.volcano.sh/v1alpha1, kind=Job, default/dk-job).","v":0}
{"ts":1761548111154.4978,"caller":"core/util.go:112","msg":"Target cluster calculated by estimators (available cluster && maxAvailableReplicas): [{member1 9} {member2 9}]","v":4}
{"ts":1761548111154.7004,"caller":"core/generic_scheduler.go:102","msg":"Selected clusters: [{member1 100 9 member1 9}]","v":4}
{"ts":1761548111154.7827,"caller":"core/generic_scheduler.go:108","msg":"Assigned Replicas: [{member1 0}]","v":4}
{"ts":1761548111154.852,"caller":"scheduler/scheduler.go:590","msg":"ResourceBinding(default/dk-job-job) scheduled to clusters [{member1 0}]","v":4}

It shows that the general estimator helped calculate available replicas({member1 9} {member2 9}), and the accurate estimator missed the estimation as it has not yet been implemented.

RainbowMango · 2025-10-27T07:46:18Z

@mszacillo @zhzhuang-zju @seanlaii
I guess this is ready for review now.

karmada-bot · 2025-10-30T08:01:59Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 20, 2025

karmada-bot requested review from Poor12 and jwcesign October 20, 2025 07:56

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 20, 2025

RainbowMango mentioned this pull request Oct 20, 2025

[Umbrella] Multi-components workload scheduling - phase II #6734

Closed

22 tasks

RainbowMango added this to the v1.16 milestone Oct 20, 2025

RainbowMango mentioned this pull request Oct 20, 2025

Update ReplicaEstimator interface signature, remove unncessary pointer type #6859

Merged

RainbowMango force-pushed the pr_enable_multiple_template_scheduling branch 3 times, most recently from a640c75 to 45aa42b Compare October 22, 2025 04:28

RainbowMango marked this pull request as ready for review October 22, 2025 04:29

Copilot AI review requested due to automatic review settings October 22, 2025 04:29

karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025

karmada-bot requested a review from mrlihanbo October 22, 2025 04:29

Copilot AI reviewed Oct 22, 2025

View reviewed changes

pkg/scheduler/core/estimation.go Show resolved Hide resolved

pkg/scheduler/core/estimation_test.go Outdated Show resolved Hide resolved

pkg/scheduler/core/estimation_test.go Outdated Show resolved Hide resolved

pkg/scheduler/core/estimation_test.go Outdated Show resolved Hide resolved

RainbowMango force-pushed the pr_enable_multiple_template_scheduling branch from 45aa42b to b9c71cb Compare October 22, 2025 06:35

karmada-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2025

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

pkg/scheduler/core/estimation.go Show resolved Hide resolved

pkg/scheduler/core/util.go Show resolved Hide resolved

pkg/scheduler/core/estimation.go Show resolved Hide resolved

pkg/scheduler/core/estimation_test.go Outdated Show resolved Hide resolved

RainbowMango force-pushed the pr_enable_multiple_template_scheduling branch from b9c71cb to c779158 Compare October 22, 2025 07:34

mszacillo reviewed Oct 27, 2025

View reviewed changes

pkg/scheduler/core/util.go Show resolved Hide resolved

mszacillo mentioned this pull request Oct 27, 2025

Implements maxAvailableComponentSets for accurate estimator #6876

Merged

RainbowMango force-pushed the pr_enable_multiple_template_scheduling branch 2 times, most recently from 2f3fc59 to 3c4af68 Compare October 27, 2025 04:41

Enable multiple components estimating

39bfa3c

Signed-off-by: RainbowMango <[email protected]>

RainbowMango force-pushed the pr_enable_multiple_template_scheduling branch from 3c4af68 to 39bfa3c Compare October 27, 2025 06:37

RainbowMango mentioned this pull request Oct 27, 2025

Improve the performance of karmada-scheduler #6597

Merged

RainbowMango added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Oct 30, 2025

karmada-bot merged commit 9902350 into karmada-io:master Oct 30, 2025
24 checks passed

RainbowMango deleted the pr_enable_multiple_template_scheduling branch October 30, 2025 08:02

RainbowMango mentioned this pull request Oct 31, 2025

[Feature] Add ResourceQuota plugin for multi-component scheduling #6875

Merged

Enable multiple components estimating #6857

Enable multiple components estimating #6857

Conversation

RainbowMango commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RainbowMango commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RainbowMango commented Oct 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RainbowMango commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RainbowMango commented Oct 27, 2025

Uh oh!

Uh oh!

karmada-bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RainbowMango commented Oct 20, 2025 •

edited

Loading

codecov-commenter commented Oct 20, 2025 •

edited

Loading

RainbowMango commented Oct 27, 2025 •

edited

Loading