|
| 1 | +# Net Topology Aware Plugin |
| 2 | + |
| 3 | +- [Net Topology Aware Plugin](#net-topology-aware-plugin) |
| 4 | + - [Backgrounds](#backgrounds) |
| 5 | + - [Motivation](#motivation) |
| 6 | + - [Proposal one](#proposal-one) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | + - [Design Action](#design-action) |
| 10 | + - [Pod scheduling process](#pod-scheduling-process) |
| 11 | + - [Usage](#usage) |
| 12 | + - [Drawbacks](#drawbacks) |
| 13 | + |
| 14 | +## Backgrounds |
| 15 | + |
| 16 | +Usually, a kubernetes cluster has many nodes, and those nodes are in different idc, chassis, and even different switches. |
| 17 | +Data transformations across different idc, chassis and switches has different performance. Some latency sensitive workloads are need to run in same idc, even in same topology devices, such as chassis and switch. |
| 18 | + |
| 19 | +## Motivation |
| 20 | + |
| 21 | +We target to make scheduler net-topology aware so as to achieve the following: |
| 22 | + |
| 23 | +- best effort to schedule same job to same topology devices, such as same idc, chassis or switch. |
| 24 | + |
| 25 | +## Proposal one |
| 26 | + |
| 27 | +This proposal need cluster administrator to manage the network topology labels on k8s nodes. They can label those nodes |
| 28 | + |
| 29 | +within same topologies a same label value. |
| 30 | + |
| 31 | +### Goals |
| 32 | + |
| 33 | +- Support single-key topology configuration, try to schedule job's all tasks to nodes which have same value with that key |
| 34 | +- Support multiple-key topology policies, the key at front get higher score |
| 35 | + |
| 36 | +### Non-Goals |
| 37 | + |
| 38 | +- Not to find the global solutions among nodes with all values of that key |
| 39 | + |
| 40 | +### Design Action |
| 41 | + |
| 42 | +#### Pod scheduling process |
| 43 | + |
| 44 | +1. when the first task of a job is allocated to a node, record the node's topology information in the plugin |
| 45 | +2. when scheduling other tasks, a node with same topology as the allocated tasks, get a higher score. Otherwise, get a zero score. |
| 46 | +3. If a node has multiple keys same as what's in the configured list, the first key matching the configured keys has higher score |
| 47 | + |
| 48 | +```go |
| 49 | +nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){ |
| 50 | + ... |
| 51 | + score := 0 |
| 52 | + weight := np.weight |
| 53 | + tlabels := tNode.Node.Labels |
| 54 | + labels := node.Node.Labels |
| 55 | + lenth := len(np.topologyKeys) |
| 56 | + for i, key := range np.topologyKeys { |
| 57 | + if tlabels[key] == labels[key] { |
| 58 | + score += (lenth - i) // key with more priority at front of which with less priority |
| 59 | + break |
| 60 | + } |
| 61 | + } |
| 62 | + return float64(score * weight), nil |
| 63 | +} |
| 64 | +``` |
| 65 | + |
| 66 | +#### Usage |
| 67 | + |
| 68 | +1. label nodes with some key-value pairs, for example switch=NvLink-A100, rack=rack1,rack2, idc=bj,sh to partition nodes to different topology zones. |
| 69 | +2. add net-topology plugin in scheduler-configuration |
| 70 | + |
| 71 | +```yaml |
| 72 | +- plugins: |
| 73 | + - name: net-topology |
| 74 | + arguments: |
| 75 | + net-topology.type: static |
| 76 | + net-topology.keys: rack,switch,idc |
| 77 | + net-topology.weight: 10 |
| 78 | +``` |
| 79 | +
|
| 80 | +### Drawbacks |
| 81 | +
|
| 82 | +It is not a global solution which put a job's all tasks in same topology nodes. For example, nodes list with key-value1 has not enough resource, but nodes list with key-value2 does, if first task was bind to nodes with key-value1, then other tasks will all try that nodes list. |
0 commit comments