Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/design/network-topology-aware.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Network Topology Aware Plugin

- [Network Topology Aware Plugin](#network-topology-aware-plugin)
- [Backgrounds](#backgrounds)
- [Motivation](#motivation)
- [Proposal one](#proposal-one)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Design Action](#design-action)
- [Pod scheduling process](#pod-scheduling-process)
- [Usage](#usage)
- [Drawbacks](#drawbacks)

## Backgrounds

A Kubernetes cluster typically comprises numerous nodes distributed across different IDCs, chassis, and switches.

Data transformations vary in performance across these different components.

For latency-sensitive workloads, it's crucial to execute tasks within the same IDC and ideally on the same chassis and switch.

## Motivation

The goal is to make the Kubernetes scheduler network-topology aware to achieve the following:

Ensure optimal scheduling of tasks from the same job onto nodes within the same topology, such as the same IDC, chassis, or switch.

There will be two types of network-topology aware

- **static**: `network-topology.type: static` is aiming to aware the network topology by nodes' labels
- **dynamic**: `network-topology.type: dynamic` is aiming to use some tools to detect the network topology dynamically. For example, `ibnetdiscover` can be used to discover the InfiniBand network topology

## Proposal one

This proposal requires cluster administrators to manage network topology labels on Kubernetes (K8s) nodes.

Nodes can be labeled to indicate identical topologies with the same label value.

### Goals

- **Single-Key Topology Configuration**: Support scheduling all tasks of a job onto nodes that share the same value for a specified key.
- **Multiple-Key Topology Policies**: Prioritize keys listed earlier for better scheduling preference.

### Non-Goals

- **Global Solutions**: This proposal does not aim to find solutions across nodes with all possible values of a topology key simultaneously.

### Design Action

#### Pod scheduling process

1. **Recording Topology Information**: When the first task of a job is assigned to a node, record the node's topology information in the scheduling plugin.
2. **Scoring Nodes for Subsequent Tasks**: During scheduling of subsequent tasks, nodes with the same topology as the initially allocated task receive a higher score; others receive a score of zero.
3. **Handling Multiple Keys**: If a node matches multiple keys from the configured list, the first key in the list is prioritized for scoring.

```go
nodeOrderFn := func(task *api.TaskInfo, node *api.NodeInfo) (float64, error){
...
score := 0
weight := np.weight
tlabels := tNode.Node.Labels
labels := node.Node.Labels
lenth := len(np.topologyKeys)
for i, key := range np.topologyKeys {
if tlabels[key] == labels[key] {
score += (lenth - i) // key with more priority at front of which with less priority
break
}
}
return float64(score * weight), nil
}
```

#### Usage

1. Label nodes with key-value pairs (e.g., `switch=NvLink-A100`, `rack=rack1,rack2`, `idc=bj,sh`) to partition nodes into different topology zones
2. Add the `network-topology` plugin in the scheduler configuration to implement these policies.

```yaml
- plugins:
- name: network-topology
arguments:
network-topology.type: static # static means it will use the node's labels to aware network topology
network-topology.keys: rack,switch,idc # required when type is static
network-topology.weight: 10
```

### Drawbacks

One drawback is that it's not a global solution that ensures all tasks of a job are placed on nodes within the same topology. For example, if nodes labeled with key-value1 lack sufficient resources while nodes labeled with key-value2 have them, and the first task is assigned to key-value1 nodes, subsequent tasks will still attempt to use key-value1 nodes, despite the resource constraints.
2 changes: 2 additions & 0 deletions pkg/scheduler/plugins/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
"volcano.sh/volcano/pkg/scheduler/plugins/drf"
"volcano.sh/volcano/pkg/scheduler/plugins/extender"
"volcano.sh/volcano/pkg/scheduler/plugins/gang"
nettopology "volcano.sh/volcano/pkg/scheduler/plugins/network-topology"
"volcano.sh/volcano/pkg/scheduler/plugins/nodegroup"
"volcano.sh/volcano/pkg/scheduler/plugins/nodeorder"
"volcano.sh/volcano/pkg/scheduler/plugins/numaaware"
Expand Down Expand Up @@ -56,6 +57,7 @@ func init() {
framework.RegisterPluginBuilder(overcommit.PluginName, overcommit.New)
framework.RegisterPluginBuilder(sla.PluginName, sla.New)
framework.RegisterPluginBuilder(tasktopology.PluginName, tasktopology.New)
framework.RegisterPluginBuilder(nettopology.PluginName, nettopology.New)
framework.RegisterPluginBuilder(numaaware.PluginName, numaaware.New)
framework.RegisterPluginBuilder(cdp.PluginName, cdp.New)
framework.RegisterPluginBuilder(rescheduling.PluginName, rescheduling.New)
Expand Down
107 changes: 107 additions & 0 deletions pkg/scheduler/plugins/network-topology/netaware.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
/*
Copyright 2024 The Volcano Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package networktopology

import (
"strings"

"k8s.io/klog/v2"

"volcano.sh/volcano/pkg/scheduler/api"
"volcano.sh/volcano/pkg/scheduler/framework"
)

// PluginName indicates name of volcano scheduler plugin.
const (
PluginName = "network-topology"
networkTopologyWeight = "network-topology.weight"
networkTopologyType = "network-topology.type" // strategy type for network topology
// strategy value for network-topology.type
staticAware = "static"
dynamicAware = "dynamic"
)

type netTopPlugin struct {
pluginArguments framework.Arguments
weight int // network-topology plugin score weight
topologyType string // supported topology type: static, dynamic
staticTopAware staticTopAware // use node labels to generate topology
}

// New return gang plugin
func New(arguments framework.Arguments) framework.Plugin {
return &netTopPlugin{pluginArguments: arguments, weight: 1, staticTopAware: staticTopAware{records: map[api.JobID]string{}}}
}

func (np *netTopPlugin) Name() string {
return PluginName
}

func (np *netTopPlugin) parseArguments() {
np.pluginArguments.GetInt(&np.weight, networkTopologyWeight)
value, ok := np.pluginArguments[networkTopologyType]
if !ok {
klog.Warningf("%s is not set, use default strategy %s", networkTopologyType, staticAware)
np.topologyType = staticAware
return
}

v, ok := value.(string)
if !ok {
klog.Warningf("invalid value for %s, use default strategy %s", networkTopologyType, staticAware)
np.topologyType = staticAware
return
}
np.topologyType = strings.TrimSpace(v)
}

// parseStaticAwareArguments return a boolean value indicating whether staticAware is valid to be used
func (np *netTopPlugin) parseStaticAwareArguments() bool {
keys, exist := np.pluginArguments[networkTopologyKeys]
if !exist {
klog.Warningf("plugin %s (with type %s) arguments does not configure %s, skip", PluginName, np.topologyType, networkTopologyKeys)
return false
}
topKeys, ok := keys.(string)
if !ok {
klog.Warningf("plugin %s (with type %s) arguments %s should has a string value", PluginName, networkTopologyKeys, networkTopologyKeys)
return false
}
for _, key := range strings.Split(topKeys, ",") {
np.staticTopAware.topologyKeys = append(np.staticTopAware.topologyKeys, strings.TrimSpace(key))
}
np.staticTopAware.weight = np.weight
return true
}

func (np *netTopPlugin) OnSessionOpen(ssn *framework.Session) {
np.parseArguments()
if np.topologyType == staticAware {
valid := np.parseStaticAwareArguments()
if valid {
np.staticTopAware.OnSessionOpen(ssn)
}
} else {
klog.Warningf("strategy %s is not supported in plugin %s", np.topologyType, PluginName)
}
}

func (np *netTopPlugin) OnSessionClose(ssn *framework.Session) {
if np.topologyType == staticAware {
np.staticTopAware.OnSessionClose(ssn)
}
}
Loading