Skip to content

Conversation

@lowang-bh
Copy link
Member

@lowang-bh lowang-bh commented Apr 6, 2024

/kind feature
fixes #2984
fixes #447
fixes #3317
There are several issues request this feature, such as #447 #2984 #3317

Motivation

We target to make scheduler net-topology aware so as to achieve the following:

  • best effort to schedule same job to same topology devices, such as same idc.

Goals

  • Support single key topology configuration, try to schedule job's all tasks to nodes which have same value with that key
  • Support multiple-key topology policies, the key at front get higher score

Non-Goals

  • Not to find the global solutions among nodes with all kind values of that key

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 6, 2024
@lowang-bh lowang-bh changed the title Network aware Network Topology Aware Plugin Apr 6, 2024
@lowang-bh
Copy link
Member Author

/assign @Monokaix @hwdef @william-wang

@lowang-bh lowang-bh closed this Apr 7, 2024
@lowang-bh lowang-bh reopened this Apr 7, 2024
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 7, 2024
@hwdef
Copy link
Member

hwdef commented Apr 8, 2024

we should talk this in weekly meeting.

@Thor-wl
Copy link
Contributor

Thor-wl commented Apr 11, 2024

Sounds interesting!But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first.

@Thor-wl Thor-wl requested review from hwdef and william-wang and removed request for archlitchi and hudson741 April 11, 2024 02:31
@Monokaix
Copy link
Member

@lowang-bh
Copy link
Member Author

Maybe considering with network performance metrics for this feature will be a good choice.

I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.

network delay varies between different nodes

Now it is not considered. This plugin just conserder the physical difference in topology.

@lowang-bh
Copy link
Member Author

lowang-bh commented Apr 11, 2024

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels
shortcoming:no bandwidth, no latency, etc.

@Monokaix
Copy link
Member

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware?

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming:no bandwidth, no latency, etc.

We should collect more user cases: )

3. If a node has multiple keys same as the configured list, the first key matching the configured keys has higher score

```go
nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){
Copy link
Member

@Monokaix Monokaix Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the original user demand, job should occupy nodes exclusively, so only nodeorder func seems cannot satisfy the original use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a predicateFn is need?

@lowang-bh lowang-bh closed this Apr 11, 2024
@lowang-bh lowang-bh reopened this Apr 11, 2024
@wangyizhi1
Copy link

The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated.

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

@elinx
Copy link

elinx commented Jul 30, 2024

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

can you elaborate the topology mapping algorithm?

@lowang-bh
Copy link
Member Author

lowang-bh commented Jul 30, 2024

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution.

@Monokaix
Copy link
Member

It's a great feature and we ca

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be and official release for a global optimal solution.

+1
If we implement it using dynamic programming algorithm, it will be more efficient, but maybe we need modify the allocate action framework, which changes a lot.

@Monokaix
Copy link
Member

The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: )

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 7, 2024
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 7, 2024
Signed-off-by: lowang-bh <[email protected]>

fix testcase when enable gang

Signed-off-by: lowang-bh <[email protected]>
@william-wang
Copy link
Member

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

@yeahdongcn
Copy link
Contributor

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

I'm currently working on using LLDP for comprehensive topology detection.

Since there’s no central management tool like ibnetdiscover in RoCE, we need to collect all LLDP information from each node and switch to aggregate these details and build a complete Slurm configuration.

@william-wang
Copy link
Member

william-wang commented Nov 13, 2024

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

I'm currently working on using LLDP for comprehensive topology detection.

Since there’s no central management tool like ibnetdiscover in RoCE, we need to collect all LLDP information from each node and switch to aggregate these details and build a complete Slurm configuration.

Good to know that. I am investigating the network aware scheduling these days as well. I think we can have a meeting to talk more about the requirement and design to cover more cases in this pr. @lowang-bh @yeahdongcn

@hwdef
Copy link
Member

hwdef commented Nov 13, 2024

I've been using @yeahdongcn solution in a production environment for a while and it's working fine so far

@OguzPastirmaci
Copy link

I have the same requirement for improving the placement/locality of GPU/RDMA workloads. We're labeling our nodes with the switch information during provisioning, so the static option will work for us. Looking forward to seeing it in v1.11, but I'll also try @yeahdongcn's solution in the meantime.

@yeahdongcn
Copy link
Contributor

I have the same requirement for improving the placement/locality of GPU/RDMA workloads. We're labeling our nodes with the switch information during provisioning, so the static option will work for us. Looking forward to seeing it in v1.11, but I'll also try @yeahdongcn's solution in the meantime.

Cool! Upon further investigation, we identified scenarios involving both intra-node (GPU <-> NIC) and inter-node (NODE <-> NODE) connections. To address these comprehensively, network topology details are stored in a ConfigMap using the Slurm configuration format, while PCIe topology information is applied as labels on the nodes.

@OguzPastirmaci
Copy link

Cool! Upon further investigation, we identified scenarios involving both intra-node (GPU <-> NIC) and inter-node (NODE <-> NODE) connections. To address these comprehensively, network topology details are stored in a ConfigMap using the Slurm configuration format, while PCIe topology information is applied as labels on the nodes.

Great! We have the option of deploying Slurm as well and we create the Slurm topology based on the switch/network block information available in our metadata service. Is static still going to be an option if we don't need to dynamically discover the network?

@yeahdongcn
Copy link
Contributor

Great! We have the option of deploying Slurm as well and we create the Slurm topology based on the switch/network block information available in our metadata service. Is static still going to be an option if we don't need to dynamically discover the network?

This may depend on your workload profile. For example, in LLM training workloads, network health and connectivity are critical factors. By employing dynamic topology, while we might not always select the shortest path for a job, we can ensure network connectivity—particularly important given the frequent occurrence of optical module failures in IB or RoCE environments.

@Monokaix
Copy link
Member

Monokaix commented Dec 7, 2024

Hi guys, afetr a long discussion with community users and we have proposed a revised doc here #3850, and we have also discussed the detail in the weekly meeting, which we believe can cover more use cases, and hope for your feedback!

@lowang-bh
Copy link
Member Author

close it as there is a more ideal design in #3850

@lowang-bh lowang-bh closed this Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet