Skip to content

Commit fa248cf

Browse files
committed
add docs about net-topology plugin
Signed-off-by: lowang-bh <[email protected]>
1 parent 9ea246b commit fa248cf

File tree

1 file changed

+90
-0
lines changed

1 file changed

+90
-0
lines changed
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Network Topology Aware Plugin
2+
3+
- [Network Topology Aware Plugin](#network-topology-aware-plugin)
4+
- [Backgrounds](#backgrounds)
5+
- [Motivation](#motivation)
6+
- [Proposal one](#proposal-one)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Design Action](#design-action)
10+
- [Pod scheduling process](#pod-scheduling-process)
11+
- [Usage](#usage)
12+
- [Drawbacks](#drawbacks)
13+
14+
## Backgrounds
15+
16+
A Kubernetes cluster typically comprises numerous nodes distributed across different IDCs, chassis, and switches.
17+
18+
Data transformations vary in performance across these different components.
19+
20+
For latency-sensitive workloads, it's crucial to execute tasks within the same IDC and ideally on the same chassis and switch.
21+
22+
## Motivation
23+
24+
The goal is to make the Kubernetes scheduler network-topology aware to achieve the following:
25+
26+
Ensure optimal scheduling of tasks from the same job onto nodes within the same topology, such as the same IDC, chassis, or switch.
27+
28+
There will be two types of network-topology aware
29+
30+
- **static**: `network-topology.type: static` is aiming to aware the network topology by nodes' labels
31+
- **dynamic**: `network-topology.type: dynamic` is aiming to use some tools to detect the network topology dynamically. For example, `ibnetdiscover` can be used to discover the InfiniBand network topology
32+
33+
## Proposal one
34+
35+
This proposal requires cluster administrators to manage network topology labels on Kubernetes (K8s) nodes.
36+
37+
Nodes can be labeled to indicate identical topologies with the same label value.
38+
39+
### Goals
40+
41+
- **Single-Key Topology Configuration**: Support scheduling all tasks of a job onto nodes that share the same value for a specified key.
42+
- **Multiple-Key Topology Policies**: Prioritize keys listed earlier for better scheduling preference.
43+
44+
### Non-Goals
45+
46+
- **Global Solutions**: This proposal does not aim to find solutions across nodes with all possible values of a topology key simultaneously.
47+
48+
### Design Action
49+
50+
#### Pod scheduling process
51+
52+
1. **Recording Topology Information**: When the first task of a job is assigned to a node, record the node's topology information in the scheduling plugin.
53+
2. **Scoring Nodes for Subsequent Tasks**: During scheduling of subsequent tasks, nodes with the same topology as the initially allocated task receive a higher score; others receive a score of zero.
54+
3. **Handling Multiple Keys**: If a node matches multiple keys from the configured list, the first key in the list is prioritized for scoring.
55+
56+
```go
57+
nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){
58+
...
59+
score := 0
60+
weight := np.weight
61+
tlabels := tNode.Node.Labels
62+
labels := node.Node.Labels
63+
lenth := len(np.topologyKeys)
64+
for i, key := range np.topologyKeys {
65+
if tlabels[key] == labels[key] {
66+
score += (lenth - i) // key with more priority at front of which with less priority
67+
break
68+
}
69+
}
70+
return float64(score * weight), nil
71+
}
72+
```
73+
74+
#### Usage
75+
76+
1. Label nodes with key-value pairs (e.g., `switch=NvLink-A100`, `rack=rack1,rack2`, `idc=bj,sh`) to partition nodes into different topology zones
77+
2. Add the `network-topology` plugin in the scheduler configuration to implement these policies.
78+
79+
```yaml
80+
- plugins:
81+
- name: network-topology
82+
arguments:
83+
network-topology.type: static # static means it will use the node's labels to aware network topology
84+
network-topology.keys: rack,switch,idc # required when type is static
85+
network-topology.weight: 10
86+
```
87+
88+
### Drawbacks
89+
90+
One drawback is that it's not a global solution that ensures all tasks of a job are placed on nodes within the same topology. For example, if nodes labeled with key-value1 lack sufficient resources while nodes labeled with key-value2 have them, and the first task is assigned to key-value1 nodes, subsequent tasks will still attempt to use key-value1 nodes, despite the resource constraints.

0 commit comments

Comments
 (0)