Skip to content

Commit e7d1686

Browse files
committed
General clarification and cleanning
Signed-off-by: Itamar Holder <iholder@redhat.com>
1 parent 9270b97 commit e7d1686

1 file changed

Lines changed: 47 additions & 31 deletions

File tree

enhancements/kubelet/virtualization-higher-workload-density.md

Lines changed: 47 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ status: implementable
2626

2727
Fit more workloads onto a given node - achieve a higher workload
2828
density - by overcommitting it's memory resources. Due to timeline
29-
needs a two-phased approach is considered.
29+
needs a multi-phased approach is considered.
3030

3131
## Motivation
3232

@@ -67,9 +67,6 @@ memory utilization per node, in order to reduce the cost per virtual machine.
6767
* Fit more virtual machines onto a node once higher workload density
6868
is enabled
6969
* Integrate well with [KSM] and [FPR]
70-
* **Technology Preview** - Enable higher density at all, limited
71-
support for stressed clusters
72-
* **General Availability** - Improve handling of stressed clusters
7370

7471
#### Usability
7572

@@ -108,7 +105,7 @@ We expect to mitigate the following situations
108105

109106
#### Scope
110107

111-
Memory over-committment, and as such swapping, will be initially limited to
108+
Memory over-commitment, and as such swapping, will be initially limited to
112109
virtual machines running in the burstable QoS class.
113110
Virtual machines in the guaranteed QoS classes are not getting over
114111
committed due to alignment with upstream Kubernetes. Virtual machines
@@ -163,7 +160,7 @@ virtual machine in a cluster.
163160
kubelet configuration via a `KubeletConfig` CR, in order to ensure
164161
that the kubelet will start once swap has been rolled out.
165162
a. The cluster admin is calculating the amount of swap space to
166-
provision based on the amount of physical ram and overcommittment
163+
provision based on the amount of physical ram and overcommitment
167164
ratio
168165
b. The cluster admin is creating a `MachineConfig` for provisioning
169166
swap on worker nodes
@@ -177,13 +174,15 @@ virtual machine in a cluster.
177174

178175
The cluster is now set up for higher workload density.
179176

177+
In phase 3, deploying the WASP agent will not be needed.
178+
180179
#### Workflow: Leveraging higher workload density
181180

182181
1. The VM Owner is creating a regular virtual machine and is launching it.
183182

184183
### API Extensions
185184

186-
Phase 1 does not require any Kubernetes, OpenShift, or OpenShift
185+
This proposal does not require any Kubernetes, OpenShift, or OpenShift
187186
Virtualization API changes.
188187

189188
### Topology Considerations
@@ -195,7 +194,7 @@ not provide the `MachineConfig` APIs.
195194

196195
#### Standalone Clusters
197196

198-
Standalone regular, and compact clusters are the primary use-cases for
197+
Standalone, regular and compact clusters are the primary use-cases for
199198
swap.
200199

201200
#### Single-node Deployments or MicroShift
@@ -228,8 +227,16 @@ The design is driven by the following guiding principles:
228227
An OCI Hook to enable swap by setting the containers cgroup
229228
`memory.swap.max=max`.
230229

231-
* **Technology Preview** - Limited to virt launcher pods
232-
* **General Availability** - Limited to burstable QoS class pods
230+
* **Technology Preview**
231+
* Limited to virt launcher pods.
232+
* Uses `UnlimitedSwap`.
233+
* **General Availability**
234+
* Limited to burstable QoS class pods.
235+
* Uses `LimitedSwap`.
236+
* Limited to non-high-priority pods.
237+
238+
For more info, refer to the upstream documentation on how to calculate
239+
[limited swap](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit).
233240

234241
###### Provisioning swap
235242

@@ -272,15 +279,17 @@ This is, because by default, no other slice is configured to have
272279

273280
###### Critical workload protection
274281

275-
Even critical pod workloads are run in burstable QoS class pods, thus
276-
at **General Availability** time they will be eligible to swap.
282+
Even critical pod workloads are run in burstable QoS class pods.
277283
However, swapping can lead to increased latencies and response times.
278284
For example, if a critical pod is depending on `LivenessProbe`s, then
279285
these checks can start to fail, once the pod is starting to swap.
280286

281287
This is undesirable and can put a node or a cluster (i.e. if a critical
282288
Operator is affected) at risk.
283289

290+
Therefore, at **General Availability** time they will not be eligible to swap.
291+
This is aligned with the upstream behavior.
292+
284293
In order to prevent this problem, swap will be selectively disabled
285294
for pod using the two well-known [critical `priorityClass`es]:
286295

@@ -299,23 +308,21 @@ Dealing with memory pressure on a node is differentiating the TP fom GA.
299308
* Pro
300309
* Simple to achieve.
301310
* Con
302-
* A lot of memory pressure has ot be present in order to trigger
311+
* A lot of memory pressure has to be present in order to trigger
303312
soft eviction.
313+
* Once `memory.high` is reached, the whole `kubepods.slice` is throttled
314+
and cannot allocate memory, which might lead to applications crashing.
304315

305-
* **General Availability** - Memory based soft and hard eviction is going to
306-
be disabled, in favor of enabling swap based hard evictions, based on new
316+
* **General Availability** - Memory-based soft eviction is going to
317+
be disabled, in favor of enabling swap-based hard evictions, based on new
307318
swap traffic and swap utilization eviction metrics.
308319

309320
* Pro
310-
* Simple mental model. With memory only, memory eviction is used.
311-
With swap, swap eviction is used.
321+
* Eviction on the basis of swap pressure, not only memory pressure.
312322
* [LLN] applies, because all pods share the nodes memory
313323
* Con
314-
* If there are no burstable QoS pods on a node, then no swapping
315-
can take place, and no swap related signal will be triggered.
316-
Only way to remove pressure is cgroup level OOM.
317-
This is considered to be an edge case and highly unlikely.
318-
Prometheus alerts for this edge case will be added.
324+
* Swap-based evictions are made through a 3rd party container, which means
325+
it has to be done through an API-initiated eviction.
319326

320327
###### Node memory reduction
321328

@@ -340,14 +347,14 @@ The mechanisms are:
340347

341348
##### Differences between Technology Preview vs GA
342349

343-
| | TP | GA |
344-
|------------------------------|---------------|--------------------|
345-
| SWAP Provisioning | MachineConfig | MachineConfig |
346-
| SWAP Eligibility | VM pods | burstable QoS pods |
347-
| Node service protection | Yes | Yes |
348-
| I/O saturation protection | Yes | Yes |
349-
| Critical workload protection | No | Yes |
350-
| Memory pressure handling | Memory based | Swap based |
350+
| | TP | GA |
351+
|------------------------------|---------------|---------------------|
352+
| SWAP Provisioning | MachineConfig | MachineConfig |
353+
| SWAP Eligibility | VM pods | burstable QoS pods |
354+
| Node service protection | Yes | Yes |
355+
| I/O saturation protection | Yes | Yes |
356+
| Critical workload protection | No | Yes |
357+
| Memory pressure handling | Memory based | Memory & Swap based |
351358

352359
### Risks and Mitigations
353360

@@ -359,7 +366,16 @@ The mechanisms are:
359366

360367
#### Phase 2
361368

362-
Handled by upstream Kubernetes.
369+
Swap is handled by upstream Kubernetes.
370+
371+
| Risk | Mitigation |
372+
|-----------------------------------------------------------|---------------------------------------------------|
373+
| Swap-based evictions are based on API-initiated evictions | Also rely on kubelet-level memory-based evictions |
374+
375+
#### Phase 3
376+
377+
Upstream Kubernetes handles both swap and evictions.
378+
Swap provision handled by OpenShift.
363379

364380
### Drawbacks
365381

0 commit comments

Comments
 (0)