-
Notifications
You must be signed in to change notification settings - Fork 1k
[Feature] Add ResourceQuota plugin for multi-component scheduling
#6875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add ResourceQuota plugin for multi-component scheduling
#6875
Conversation
65a7e6c to
46a3f5b
Compare
ResourceQuota plugin for EstimateComponents
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6875 +/- ##
==========================================
+ Coverage 46.26% 46.33% +0.06%
==========================================
Files 697 697
Lines 47523 47629 +106
==========================================
+ Hits 21988 22068 +80
- Misses 23878 23894 +16
- Partials 1657 1667 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ResourceQuota plugin for EstimateComponentsResourceQuota plugin for multi-component scheduling
46a3f5b to
39428b9
Compare
|
Hi @RainbowMango @zhzhuang-zju @mszacillo, please help take a look when you get a chance! Thank you! |
|
Thanks |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a ResourceQuota plugin for multi-component scheduling, a significant and valuable feature. The implementation is well-structured, with clear separation of concerns and extensive test coverage that handles many edge cases.
I've identified a couple of potential nil pointer dereferences that could lead to panics if a component's ReplicaRequirements are not provided. I've also suggested a refinement to the isSupportedResource function to make it more robust and prevent incorrect resource calculations.
Overall, this is a solid contribution. Addressing these points will further improve the robustness of the new plugin.
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
f6f7e92 to
6ca8cd5
Compare
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review is ongoing.
| return maxSets, framework.NewResult(framework.Noopperation, fmt.Sprintf("%s received empty components list", pl.Name())) | ||
| } | ||
|
|
||
| namespace := components[0].ReplicaRequirements.Namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the thing that I feel is not right. Similar to #6876 (comment).
The namespace might not have to be stored inside of ReplicaRequirements. Furthermore, can we assume all components share the same namespace? If yes, can we put the namespace directly under MaxAvailableComponentSetsRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed the CRDs for SparkApplication and FlinkDeployment, and their different components do not explicitly support configuring the namespace. They only theoretically have the capability to configure the namespace because they include corev1.PodTemplateSpec.
- flink jobManager: https://github.com/apache/flink-kubernetes-operator/blob/3e3cb584a133c7ce0b310e4a1e5d986288fc0303/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/spec/JobManagerSpec.java#L38-L47
- flink taskManager: https://github.com/apache/flink-kubernetes-operator/blob/3e3cb584a133c7ce0b310e4a1e5d986288fc0303/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/spec/TaskManagerSpec.java#L42-L53
- spark driver and executor: https://github.com/kubeflow/spark-operator/blob/d0c8e69063e3eabf51cc5dc3052879cb5ef5a58b/api/v1beta2/sparkapplication_types.go#L419-L433
From the implementation of the Spark operator, it only focuses on the namespace of the SparkApplication. Additionally, I tried deploying a CR locally with namespaces set for both the driver and executor, and it ultimately failed to deploy. Therefore, SparkApplication does not support multi-components cross-namespace configurations.
For FlinkDeployment, I believe it is the same. @mszacillo @seanlaii, you are more knowledgeable about FlinkDeployment. Could you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed experiment and explanation!
I am not familiar with FlinkDeployment, but for RayJob, RayCluster, and RayService, they only support deploying to a single namespace.
6ca8cd5 to
00f9489
Compare
|
/retest |
| // Evaluate each priority class group separately and take the minimum across all groups. | ||
| for priorityClassName, groupComponents := range componentsByPriority { | ||
| klog.V(5).Infof("%s: evaluating %d components with priorityClassName %q", | ||
| pl.Name(), len(groupComponents), priorityClassName) | ||
|
|
||
| groupMaxSets := pl.evaluateComponentGroup(rqList, groupComponents, priorityClassName) | ||
|
|
||
| klog.V(5).Infof("%s: priorityClassName %q allows %d component sets", | ||
| pl.Name(), priorityClassName, groupMaxSets) | ||
|
|
||
| if groupMaxSets < maxSets { | ||
| maxSets = groupMaxSets | ||
| } | ||
|
|
||
| // Early exit if any priority group allows zero sets. | ||
| if maxSets == 0 { | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic in the loop might have a critical flaw in how it evaluates quota consumption across priority classes.
By evaluating each groupComponents separately against the same, unmodified rqList (the same total available quota), this can lead to over-admission.
Consider a workload with two components (A and B) under different PriorityClasses, and the total quota allows only one such component. Evaluating A against the full quota might allow 1 set. Evaluating B against the same full quota might also allow 1 set. The loop takes the minimum (maxSets = 1) and admits the workload. However, running both A and B would require resources for two components, exceeding the quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the current algorithm calculates the available replicas using availableResources / requestedResource, which is convenient. However, considering that in the future we will need to pass in a batch of components requiring reservation, we will first need to subtract these reserved resources from the availableResources before performing the evaluation. Therefore, we still need to develop an algorithm that decrements the Quota accordingly. See the draft idea at #6812 (comment).
In summary, although we don't need it right now, should we prepare for future extensibility at this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By evaluating each groupComponents separately against the same, unmodified rqList (the same total available quota), this can lead to over-admission.
+1. The workflow described in #6875 (comment) is correct in principle. However, during implementation, the first step—Determine which components match RQ's scopes—was moved before the For each ResourceQuota RQ loop, so that components are grouped by priorityClassName upfront.
Take a ResourceQuota without any scope defined as an example: it should originally compute the number of available replica sets based on the entire componentSet. But due to this change, the quota now calculates the available sets separately for each groupedComponents and then takes the minimum across groups—effectively over-constraining the result.
we will first need to subtract these reserved resources from the availableResources before performing the evaluation.
FYI, the SubResource function is specifically used to subtract these reserved resources from the availableResource. We can use it if needed.
Lines 75 to 95 in 9902350
| // SubResource is used to subtract two resources, if r < rr, set r to zero. | |
| func (r *Resource) SubResource(rr *Resource) *Resource { | |
| if r == nil || rr == nil { | |
| return r | |
| } | |
| r.MilliCPU = MaxInt64(r.MilliCPU-rr.MilliCPU, 0) | |
| r.Memory = MaxInt64(r.Memory-rr.Memory, 0) | |
| r.EphemeralStorage = MaxInt64(r.EphemeralStorage-rr.EphemeralStorage, 0) | |
| r.AllowedPodNumber = MaxInt64(r.AllowedPodNumber-rr.AllowedPodNumber, 0) | |
| for rrName, rrScalar := range rr.ScalarResources { | |
| if lifted.IsScalarResourceName(rrName) { | |
| rScalar, ok := r.ScalarResources[rrName] | |
| if ok { | |
| r.ScalarResources[rrName] = MaxInt64(rScalar-rrScalar, 0) | |
| } | |
| } | |
| } | |
| return r | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are totally correct! Sorry, I implemented in a wrong way. I have updated the implementation so that it will now first iterate the resource quota and then do the filtering based on the priority class name. I have also added a test case to validate this case.
Regarding the reservation, maybe we can have something like this:
func evaluateResourcesAgainstQuota(
availableResources corev1.ResourceList,
perSetRequirements corev1.ResourceList,
reservedResources corev1.ResourceList) int32 {
effectiveAvailable := availableResources.DeepCopy()
if reservedResources != nil {
effectiveAvailable.Sub(reserved)
}
filtered := filterConstrainedResources(effectiveAvailable, perSetRequirements)
resource := util.NewResource(effectiveAvailable)
allowed := resource.MaxDivided(filtered)
return int32(allowed)
}
Please feel free to correct me if my understanding is incorrect.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it sounds even better than what I proposed before.
What I was thinking is mostly for the estimation based on node resources, which can not go with the divide way.
00f9489 to
f300b54
Compare
2a42503 to
b6ccfc3
Compare
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Looks pretty good.
Just left some warnings raised by my IDE, some of them are for the legacy code.
It would be great to get them fixed in this PR, but it shouldn't be a blocker.
I will wait for a few hours, and then move it forward.
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
pkg/estimator/server/framework/plugins/resourcequota/resourcequota.go
Outdated
Show resolved
Hide resolved
Signed-off-by: seanlaii <[email protected]>
b6ccfc3 to
c17f8f3
Compare
Sure, rename them. Thanks! |
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
I ran a quick test on my side, and it works well.
The test is similar to #6857 (comment), but I added a ResourceQuota to member1 cluster:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-high
spec:
hard:
cpu: "0"
memory: "0Gi"
pods: "0"The ResourceQuota prevents anything from being provisioned on member1.
It works as expected:
- the final estimation result is
0fromkarmada-scheduler-estimator-member1
{"ts":1761880763617.0396,"caller":"resourcequota/resourcequota.go:198","msg":"ResourceQuotaEstimator: final estimation result: 0 component sets, status: Unschedulable","v":5}
- the job is scheduled to
member2asmember1can not accommodate it.
{"ts":1761880763651.3323,"caller":"core/generic_scheduler.go:102","msg":"Selected clusters: [{member2 100 9 member2 9}]","v":4}
{"ts":1761880763651.5518,"caller":"core/generic_scheduler.go:108","msg":"Assigned Replicas: [{member2 0}]","v":4}
{"ts":1761880763651.6294,"caller":"scheduler/scheduler.go:590","msg":"ResourceBinding(default/dk-job-job) scheduled to clusters [{member2 0}]","v":4}
PS: The log format should be adjusted and is tracked by #6880
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add
ResourceQuotaplugin for multi-component scheduling.Which issue(s) this PR fixes:
Part of #6734
Special notes for your reviewer:
Does this PR introduce a user-facing change?: