-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Network Topology Aware Plugin #3388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/assign @Monokaix @hwdef @william-wang |
|
we should talk this in weekly meeting. |
|
Sounds interesting!But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering with network performance metrics for this feature will be a good choice. I think we should take a discussion in the community and complete the design first. |
|
What's the difference between this and https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/networkaware? |
I would like to recommend to treat the network performace metrics as a kind of load, so it's more like a loadaware scheduling.
Now it is not considered. This plugin just conserder the physical difference in topology. |
I think it is a lite one of that plugin, just consider several physical topology, such as idc, rock, switch, and depend on those labels on nodes. advantage: more simply to use, just rely on node labels |
We should collect more user cases: ) |
docs/design/net-aware.md
Outdated
| 3. If a node has multiple keys same as the configured list, the first key matching the configured keys has higher score | ||
|
|
||
| ```go | ||
| nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the original user demand, job should occupy nodes exclusively, so only nodeorder func seems cannot satisfy the original use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a predicateFn is need?
|
The first task returning a score of 0 for all nodes might cause issues. I suggest comparing the remaining nodes in each topology to see if they meet the job's requirements and assign a corresponding score. This score should be proportional to the maximum number of tasks that can be accommodated. Moreover, the current implementation of this plugin resembles a greedy algorithm, which aims to find the optimal node for each task. However, the greedy algorithm doesn't necessarily yield the optimal solution. I'm curious about how dynamic programming or backtracking could be implemented within the Volcano framework. Is there a way to perform multiple pre-scheduling attempts for a job and apply the one with the highest total score? |
can you elaborate the topology mapping algorithm? |
Yes, current one is just try the best to find a locally optimal solution and is a simple realization. There will be an official release for a global optimal solution. |
|
It's a great feature and we ca
+1 |
|
The best way is to use dynamic programming algorithm, but this needs a new design, which may beyond the scope of the issue, v.1.10 is about to release, so this feature can be released v1.11, even it's not a best way, but we can continue to optimize the algorithm: ) |
9b03bf8 to
db43093
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: lowang-bh <[email protected]>
db43093 to
9efc3a0
Compare
Signed-off-by: lowang-bh <[email protected]> fix testcase when enable gang Signed-off-by: lowang-bh <[email protected]>
9efc3a0 to
cf1a187
Compare
|
I'm currently working on using LLDP for comprehensive topology detection. Since there’s no central management tool like |
Good to know that. I am investigating the network aware scheduling these days as well. I think we can have a meeting to talk more about the requirement and design to cover more cases in this pr. @lowang-bh @yeahdongcn |
|
I've been using @yeahdongcn solution in a production environment for a while and it's working fine so far |
|
I have the same requirement for improving the placement/locality of GPU/RDMA workloads. We're labeling our nodes with the switch information during provisioning, so the static option will work for us. Looking forward to seeing it in v1.11, but I'll also try @yeahdongcn's solution in the meantime. |
Cool! Upon further investigation, we identified scenarios involving both intra-node ( |
Great! We have the option of deploying Slurm as well and we create the Slurm topology based on the switch/network block information available in our metadata service. Is static still going to be an option if we don't need to dynamically discover the network? |
This may depend on your workload profile. For example, in LLM training workloads, network health and connectivity are critical factors. By employing dynamic topology, while we might not always select the shortest path for a job, we can ensure network connectivity—particularly important given the frequent occurrence of optical module failures in IB or RoCE environments. |
|
Hi guys, afetr a long discussion with community users and we have proposed a revised doc here #3850, and we have also discussed the detail in the weekly meeting, which we believe can cover more use cases, and hope for your feedback! |
|
close it as there is a more ideal design in #3850 |
/kind feature
fixes #2984
fixes #447
fixes #3317
There are several issues request this feature, such as #447 #2984 #3317
Motivation
We target to make scheduler net-topology aware so as to achieve the following:
Goals
Non-Goals