Network Topology Aware Plugin #3388

lowang-bh · 2024-04-06T00:53:20Z

/kind feature
fixes #2984
fixes #447
fixes #3317
There are several issues request this feature, such as #447 #2984 #3317

Motivation

We target to make scheduler net-topology aware so as to achieve the following:

best effort to schedule same job to same topology devices, such as same idc.

Goals

Support single key topology configuration, try to schedule job's all tasks to nodes which have same value with that key
Support multiple-key topology policies, the key at front get higher score

Non-Goals

Not to find the global solutions among nodes with all kind values of that key

lowang-bh · 2024-04-07T06:33:44Z

/assign @Monokaix @hwdef @william-wang

hwdef · 2024-04-08T06:48:52Z

we should talk this in weekly meeting.

Thor-wl · 2024-04-11T02:17:55Z

Sounds interesting！But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first.

Monokaix · 2024-04-11T02:55:02Z

What's the difference between this and https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

lowang-bh · 2024-04-11T06:06:22Z

Maybe considering with network performance metrics for this feature will be a good choice.

I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.

network delay varies between different nodes

Now it is not considered. This plugin just conserder the physical difference in topology.

lowang-bh · 2024-04-11T06:29:35Z

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels
shortcoming：no bandwidth, no latency, etc.

Monokaix · 2024-04-11T08:13:15Z

https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware？

I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes.

advantage: more simply to use, just rely on node labels shortcoming：no bandwidth, no latency, etc.

We should collect more user cases: )

Monokaix · 2024-04-11T12:30:50Z

docs/design/net-aware.md

+3. If a node has multiple keys same as the configured list, the first key matching the configured keys has higher score
+
+```go
+nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){


From the original user demand, job should occupy nodes exclusively, so only nodeorder func seems cannot satisfy the original use case.

You mean a predicateFn is need?

pkg/scheduler/plugins/net-topology/netaware.go

wangyizhi1 · 2024-07-29T02:06:14Z

The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated.

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

elinx · 2024-07-30T03:21:14Z

We have some thoughts on network topology aware scheduling as well, BUT limited in the same IDC.

User Story

Saying we have a 10,000 GPU cluster and each node is connected to a leaf switch through IB or RoCE. Each leaf switch connects to spine switches and there might be multiple paths between leaf switches. The network topology is a fat-tree.

End users would like to submit an LLM training job to the cluster. Typically, the job is an MPI job and the communication between nodes is frequent. In most cases, the communication between nodes in the same leaf switch is faster than the communication between nodes in different leaf switches (fewer jumps, less latency).

End users would like to schedule the tasks to the nodes that have optimal network connectivity at the moment. For example, if the tasks are scheduled to the nodes that are connected to the same leaf switch, the communication between the tasks is faster and the job can finish earlier.

We want to schedule the tasks to the nodes that are connected to the same leaf switch, or at least to the nodes that are connected to the same spine switch. This can reduce the network latency and improve the performance.

Design

We can add a new plugin to the scheduler to support network topology-aware scheduling. The plugin can get the network topology information from the network devices and use the information to schedule the tasks.

Network Topology Information

The network topology information can be stored in a configmap. The information includes the network topology, the network devices, the connections between the devices, and the latency between the devices.

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

can you elaborate the topology mapping algorithm？

lowang-bh · 2024-07-30T04:21:55Z

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution.

Monokaix · 2024-08-15T08:45:13Z

It's a great feature and we ca

Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score?

Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be and official release for a global optimal solution.

+1
If we implement it using dynamic programming algorithm, it will be more efficient, but maybe we need modify the allocate action framework, which changes a lot.

Monokaix · 2024-08-15T08:52:32Z

The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: )

volcano-sh-bot · 2024-09-07T03:18:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: lowang-bh <[email protected]>

Signed-off-by: lowang-bh <[email protected]> fix testcase when enable gang Signed-off-by: lowang-bh <[email protected]>

william-wang · 2024-11-11T08:53:17Z

Similar to how to pick the topology-aware nodes in Slurm, we can use certain tools to discover the network topology alive. For example, we can use ibnetdiscover to discover the InfiniBand network topology.

We can have another controller to watch the node changes and spin up a job to discover the network topology on the nodes with IB. The controller can update the network topology information in the configmap.

Network Topology Aware Plugin

The network topology-aware plugin can get the network topology information from the configmap and use the information to schedule the tasks. It works like a score plugin to score the nodes based on the network topology information and pick the best node for a task.

Initial Code Change

Please see: https://github.com/yeahdongcn/volcano/tree/topo

This is a PoC to show how to add a network topology-aware plugin to the scheduler. See https://github.com/yeahdongcn/volcano/blob/topo/pkg/scheduler/actions/allocate/allocate_test.go#L160 for the test case.

If anyone is interested in this topic, we can have further discussion.

@shinytang6 for awareness.

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

yeahdongcn · 2024-11-11T09:23:46Z

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

I'm currently working on using LLDP for comprehensive topology detection.

Since there’s no central management tool like ibnetdiscover in RoCE, we need to collect all LLDP information from each node and switch to aggregate these details and build a complete Slurm configuration.

william-wang · 2024-11-13T09:52:59Z

ibnetdiscover is a tool to detect the IB networking, how about RoCE networking detecting? @yeahdongcn

I'm currently working on using LLDP for comprehensive topology detection.

Since there’s no central management tool like ibnetdiscover in RoCE, we need to collect all LLDP information from each node and switch to aggregate these details and build a complete Slurm configuration.

Good to know that. I am investigating the network aware scheduling these days as well. I think we can have a meeting to talk more about the requirement and design to cover more cases in this pr. @lowang-bh @yeahdongcn

hwdef · 2024-11-13T09:55:18Z

I've been using @yeahdongcn solution in a production environment for a while and it's working fine so far

OguzPastirmaci · 2024-11-16T21:14:05Z

I have the same requirement for improving the placement/locality of GPU/RDMA workloads. We're labeling our nodes with the switch information during provisioning, so the static option will work for us. Looking forward to seeing it in v1.11, but I'll also try @yeahdongcn's solution in the meantime.

yeahdongcn · 2024-11-17T01:25:17Z

I have the same requirement for improving the placement/locality of GPU/RDMA workloads. We're labeling our nodes with the switch information during provisioning, so the static option will work for us. Looking forward to seeing it in v1.11, but I'll also try @yeahdongcn's solution in the meantime.

Cool! Upon further investigation, we identified scenarios involving both intra-node (GPU <-> NIC) and inter-node (NODE <-> NODE) connections. To address these comprehensively, network topology details are stored in a ConfigMap using the Slurm configuration format, while PCIe topology information is applied as labels on the nodes.

OguzPastirmaci · 2024-11-17T20:07:56Z

Cool! Upon further investigation, we identified scenarios involving both intra-node (GPU <-> NIC) and inter-node (NODE <-> NODE) connections. To address these comprehensively, network topology details are stored in a ConfigMap using the Slurm configuration format, while PCIe topology information is applied as labels on the nodes.

Great! We have the option of deploying Slurm as well and we create the Slurm topology based on the switch/network block information available in our metadata service. Is static still going to be an option if we don't need to dynamically discover the network?

yeahdongcn · 2024-11-18T01:10:35Z

Great! We have the option of deploying Slurm as well and we create the Slurm topology based on the switch/network block information available in our metadata service. Is static still going to be an option if we don't need to dynamically discover the network?

This may depend on your workload profile. For example, in LLM training workloads, network health and connectivity are critical factors. By employing dynamic topology, while we might not always select the shortest path for a job, we can ensure network connectivity—particularly important given the frequent occurrence of optical module failures in IB or RoCE environments.

Monokaix · 2024-12-07T03:59:50Z

Hi guys, afetr a long discussion with community users and we have proposed a revised doc here #3850, and we have also discussed the detail in the weekly meeting, which we believe can cover more use cases, and hope for your feedback!

lowang-bh · 2024-12-22T11:19:16Z

close it as there is a more ideal design in #3850

volcano-sh-bot requested review from Thor-wl, archlitchi and hudson741 April 6, 2024 00:53

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 6, 2024

lowang-bh force-pushed the networkAware branch from 6dc1d54 to db20236 Compare April 6, 2024 01:02

lowang-bh changed the title ~~Network aware~~ Network Topology Aware Plugin Apr 6, 2024

lowang-bh force-pushed the networkAware branch from db20236 to 679bcd8 Compare April 6, 2024 01:43

lowang-bh mentioned this pull request Apr 6, 2024

Batch node allocation for AI training #2984

Closed

lowang-bh force-pushed the networkAware branch from 679bcd8 to b528c21 Compare April 6, 2024 01:59

lowang-bh mentioned this pull request Apr 6, 2024

Add pod affinity group scheduling capability, to improve the performance of AI training jobs. #3317

Open

volcano-sh-bot assigned hwdef, Monokaix and william-wang Apr 7, 2024

lowang-bh closed this Apr 7, 2024

lowang-bh reopened this Apr 7, 2024

volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 7, 2024

Thor-wl requested review from hwdef and william-wang and removed request for archlitchi and hudson741 April 11, 2024 02:31

Monokaix reviewed Apr 11, 2024

View reviewed changes

lowang-bh closed this Apr 11, 2024

lowang-bh reopened this Apr 11, 2024

wangyizhi1 reviewed Jul 25, 2024

View reviewed changes

pkg/scheduler/plugins/net-topology/netaware.go Show resolved Hide resolved

lowang-bh force-pushed the networkAware branch from 9b03bf8 to db43093 Compare September 7, 2024 03:18

volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 7, 2024

add docs about network-topology plugin

e16ef63

Signed-off-by: lowang-bh <[email protected]>

lowang-bh force-pushed the networkAware branch from db43093 to 9efc3a0 Compare September 7, 2024 03:22

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 7, 2024

add network topology plugin

cf1a187

Signed-off-by: lowang-bh <[email protected]> fix testcase when enable gang Signed-off-by: lowang-bh <[email protected]>

lowang-bh force-pushed the networkAware branch from 9efc3a0 to cf1a187 Compare September 7, 2024 03:24

qGentry mentioned this pull request Sep 9, 2024

How to schedule all pods only within single nodegroup? #3712

Closed

Monokaix mentioned this pull request Sep 20, 2024

Any way to know result of other plugins' predicateFn results within another plugin? #3736

Closed

Monokaix mentioned this pull request Dec 7, 2024

Propose to create a new development branch named network-topology #3860

Closed

lowang-bh closed this Dec 22, 2024

Network Topology Aware Plugin #3388

Network Topology Aware Plugin #3388

Uh oh!

Conversation

lowang-bh commented Apr 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Goals

Non-Goals

Uh oh!

lowang-bh commented Apr 7, 2024

Uh oh!

hwdef commented Apr 8, 2024

Uh oh!

Thor-wl commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Monokaix commented Apr 11, 2024

Uh oh!

lowang-bh commented Apr 11, 2024

Uh oh!

lowang-bh commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Monokaix commented Apr 11, 2024

Uh oh!

Monokaix Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowang-bh Apr 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangyizhi1 commented Jul 29, 2024

Uh oh!

elinx commented Jul 30, 2024

User Story

Design

Network Topology Information