Skip to content

Conversation

@JesseStutler
Copy link
Contributor

@JesseStutler JesseStutler commented Apr 11, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it

Related to #496, provide the current implementation code for Gang Scheduling and Volcano Implementation. For coscheduling scheduler plugin, I can separate it in the other PR.

Which issue(s) this PR fixes

Related #407

Special notes for your reviewer

Does this PR introduce a user-facing change?

Added gang scheduling management support through the `Configuration` API. Gang scheduling can now be enabled by setting `gangSchedulingManagement.schedulerProvider` in the controller configuration file. For Helm installations, use `--set gangSchedulingManagement.schedulerProvider=[xxx]` to enable this feature. Currently, only volcano is supported. 

Validation

  1. Create a replicas=2, size=4 leaderWorkerSet:
image The pg's name is [lws name]-[group Index]-[revision] 2. Scale down to 1: image 3. Scale up to 3: image 4. Rolling Update: (1) Configure maxSurge=2, and then execute rolling update: image (2) Rolling update finished, the final state: image

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 11, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and yankay April 11, 2025 03:31
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 11, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @JesseStutler. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 11, 2025
@netlify
Copy link

netlify bot commented Apr 11, 2025

Deploy Preview for kubernetes-sigs-lws canceled.

Name Link
🔨 Latest commit a2cbb68
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/6890c1a6ffda5200088112b0


func NewVolcanoProvider(client client.Client) *VolcanoProvider {
return &VolcanoProvider{
client: client,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the defaultBaseResourceProvider right? Which would be the same for every provider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I add it now.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2025
@yankay
Copy link
Member

yankay commented Apr 15, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2025
@JesseStutler JesseStutler changed the title [WIP] Add Gang Scheduling Support for lws and volcano implementation example Add Gang Scheduling Support for lws and volcano implementation example Jul 22, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2025
@JesseStutler
Copy link
Contributor Author

/cc @Edwinhr716 @kerthcet
Please review it thanks, I'm still following the design in #496, using the ReplicaResourceProvider interface. The CreateHeadlessService and CreateResourceClaim methods are reserved, please check whether you want to delete these two methods and keep only the SchedulerProvider interface in this PR

@JesseStutler
Copy link
Contributor Author

JesseStutler commented Jul 22, 2025

I added a volcano implementation for you to refer to. I can also add coscheduling plugin implementation if we also need it in v0.7.0 😄

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2025
@kerthcet
Copy link
Contributor

kerthcet commented Aug 2, 2025

@JesseStutler can we finish this week? I don't want to last until next week. Thanks.

@JesseStutler
Copy link
Contributor Author

@JesseStutler can we finish this week? I don't want to last until next week. Thanks.

Sorry for the late reply, I have something to do on the weekend. Can I finish it on Monday and ask you to review again if it's okay? @kerthcet

@JesseStutler
Copy link
Contributor Author

@JesseStutler can we finish this week? I don't want to last until next week. Thanks.

@kerthcet You can review first about the e2e testing, I'll verify it on Monday. Is that OK?

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you fix the env error in the e2e test, I'll create a new ci workflow in test-infra, so we can verify the gang scheduling tests.

@JesseStutler
Copy link
Contributor Author

/retest

@JesseStutler JesseStutler requested a review from kerthcet August 4, 2025 11:31
@JesseStutler
Copy link
Contributor Author

Once you fix the env error in the e2e test, I'll create a new ci workflow in test-infra, so we can verify the gang scheduling tests.

@kerthcet Now I have fixed all the CIs, please help me create a workflow, thanks. Does this require codes to merge?

func ExpectValidPodGroups(ctx context.Context, k8sClient client.Client, provider schedulerprovider.ProviderType, lws *leaderworkerset.LeaderWorkerSet, expectedCount int) {
gvk := getPodGroupGVK(provider)
if gvk.Empty() {
ginkgo.Skip("Unsupported scheduler provider for PodGroup validation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return error here? Empty gvk is not allowed I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have changed to ginkgo.Fail, please check

// UpdatePodGroupAtIndex manually updates the PodGroup for a specific group index
func UpdatePodGroupAtIndex(ctx context.Context, k8sClient client.Client, providerType schedulerprovider.ProviderType, schedulerProvider schedulerprovider.SchedulerProvider, lws *leaderworkerset.LeaderWorkerSet, groupIndex string) {
gvk := getPodGroupGVK(providerType)
if gvk.Empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have changed to ginkgo.Fail, please check

// ExpectValidPodGroupAtIndex checks that a PodGroup exists for the specified group index and has the correct owner reference
func ExpectValidPodGroupAtIndex(ctx context.Context, k8sClient client.Client, provider schedulerprovider.ProviderType, lws *leaderworkerset.LeaderWorkerSet, groupIndex string) {
gvk := getPodGroupGVK(provider)
if gvk.Empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have changed to ginkgo.Fail, please check

return defaultNamespace
}

// CalculatePGMinResources calculates the minimum resources needed for an entire PodGroup [1 Leader + (size-1) Worker pods]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to have a unit test for this later, it could be a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll do it in later PR, thanks for your reminder

@kerthcet
Copy link
Contributor

kerthcet commented Aug 4, 2025

/lgtm
/approve

Thanks @JesseStutler

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler, kerthcet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2d0e76b into kubernetes-sigs:main Aug 4, 2025
13 checks passed
@ardaguclu
Copy link
Contributor

@JesseStutler I was expecting changes under /config directory which aligns with the changes under /charts. Is it missing or intentional?. For example, where is clusterrole changes for volcano under config directory?

@ardaguclu
Copy link
Contributor

Currently, I don't think gang scheduler is supported via kustomize, unless user manually applies required configurations.

@JesseStutler
Copy link
Contributor Author

Currently, I don't think gang scheduler is supported via kustomize, unless user manually applies required configurations.

Thanks for your reminder @ardaguclu I was also thinking about this question. I'm not very familiar with kustomize. Are the configurations under /config enabled by default? Since gang scheduling is turned off by default, I didn't add any volcano-related configurations under /config. I don't know if kustomize can work like Helm, allowing you to specify which to enable and which to disable?

@ardaguclu
Copy link
Contributor

ardaguclu commented Aug 6, 2025

For example, if the only requirement is to define a ClusterRole, you need to add it in here https://github.com/kubernetes-sigs/lws/tree/main/config/rbac. After that you need to add this file

in commented format by mentioning "if you want to enable gang scheduling, you need to comment out this section, etc."

@JesseStutler
Copy link
Contributor Author

For example, if the only requirement is to define a ClusterRole, you need to add it in here https://github.com/kubernetes-sigs/lws/tree/main/config/rbac. After that you need to add this file

in commented format by mentioning "if you want to enable gang scheduling, you need to comment out this section, etc."

Good idea, thanks! @ardaguclu I will add it in later PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants