@@ -26,7 +26,7 @@ status: implementable
2626
2727Fit more workloads onto a given node - achieve a higher workload
2828density - by overcommitting it's memory resources. Due to timeline
29- needs a two -phased approach is considered.
29+ needs a multi -phased approach is considered.
3030
3131## Motivation
3232
@@ -67,9 +67,6 @@ memory utilization per node, in order to reduce the cost per virtual machine.
6767* Fit more virtual machines onto a node once higher workload density
6868 is enabled
6969* Integrate well with [ KSM] and [ FPR]
70- * ** Technology Preview** - Enable higher density at all, limited
71- support for stressed clusters
72- * ** General Availability** - Improve handling of stressed clusters
7370
7471#### Usability
7572
@@ -108,7 +105,7 @@ We expect to mitigate the following situations
108105
109106#### Scope
110107
111- Memory over-committment , and as such swapping, will be initially limited to
108+ Memory over-commitment , and as such swapping, will be initially limited to
112109virtual machines running in the burstable QoS class.
113110Virtual machines in the guaranteed QoS classes are not getting over
114111committed due to alignment with upstream Kubernetes. Virtual machines
@@ -163,7 +160,7 @@ virtual machine in a cluster.
163160 kubelet configuration via a ` KubeletConfig ` CR, in order to ensure
164161 that the kubelet will start once swap has been rolled out.
165162 a. The cluster admin is calculating the amount of swap space to
166- provision based on the amount of physical ram and overcommittment
163+ provision based on the amount of physical ram and overcommitment
167164 ratio
168165 b. The cluster admin is creating a ` MachineConfig ` for provisioning
169166 swap on worker nodes
@@ -177,13 +174,15 @@ virtual machine in a cluster.
177174
178175The cluster is now set up for higher workload density.
179176
177+ In phase 3, deploying the WASP agent will not be needed.
178+
180179#### Workflow: Leveraging higher workload density
181180
1821811 . The VM Owner is creating a regular virtual machine and is launching it.
183182
184183### API Extensions
185184
186- Phase 1 does not require any Kubernetes, OpenShift, or OpenShift
185+ This proposal does not require any Kubernetes, OpenShift, or OpenShift
187186Virtualization API changes.
188187
189188### Topology Considerations
@@ -195,7 +194,7 @@ not provide the `MachineConfig` APIs.
195194
196195#### Standalone Clusters
197196
198- Standalone regular, and compact clusters are the primary use-cases for
197+ Standalone, regular and compact clusters are the primary use-cases for
199198swap.
200199
201200#### Single-node Deployments or MicroShift
@@ -228,8 +227,16 @@ The design is driven by the following guiding principles:
228227An OCI Hook to enable swap by setting the containers cgroup
229228` memory.swap.max=max ` .
230229
231- * ** Technology Preview** - Limited to virt launcher pods
232- * ** General Availability** - Limited to burstable QoS class pods
230+ * ** Technology Preview**
231+ * Limited to virt launcher pods.
232+ * Uses ` UnlimitedSwap ` .
233+ * ** General Availability**
234+ * Limited to burstable QoS class pods.
235+ * Uses ` LimitedSwap ` .
236+ * Limited to non-high-priority pods.
237+
238+ For more info, refer to the upstream documentation on how to calculate
239+ [ limited swap] ( https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit ) .
233240
234241###### Provisioning swap
235242
@@ -272,15 +279,17 @@ This is, because by default, no other slice is configured to have
272279
273280###### Critical workload protection
274281
275- Even critical pod workloads are run in burstable QoS class pods, thus
276- at ** General Availability** time they will be eligible to swap.
282+ Even critical pod workloads are run in burstable QoS class pods.
277283However, swapping can lead to increased latencies and response times.
278284For example, if a critical pod is depending on ` LivenessProbe ` s, then
279285these checks can start to fail, once the pod is starting to swap.
280286
281287This is undesirable and can put a node or a cluster (i.e. if a critical
282288Operator is affected) at risk.
283289
290+ Therefore, at ** General Availability** time they will not be eligible to swap.
291+ This is aligned with the upstream behavior.
292+
284293In order to prevent this problem, swap will be selectively disabled
285294for pod using the two well-known [ critical ` priorityClass ` es] :
286295
@@ -299,23 +308,21 @@ Dealing with memory pressure on a node is differentiating the TP fom GA.
299308 * Pro
300309 * Simple to achieve.
301310 * Con
302- * A lot of memory pressure has ot be present in order to trigger
311+ * A lot of memory pressure has to be present in order to trigger
303312 soft eviction.
313+ * Once ` memory.high ` is reached, the whole ` kubepods.slice ` is throttled
314+ and cannot allocate memory, which might lead to applications crashing.
304315
305- * ** General Availability** - Memory based soft and hard eviction is going to
306- be disabled, in favor of enabling swap based hard evictions, based on new
316+ * ** General Availability** - Memory- based soft eviction is going to
317+ be disabled, in favor of enabling swap- based hard evictions, based on new
307318 swap traffic and swap utilization eviction metrics.
308319
309320 * Pro
310- * Simple mental model. With memory only, memory eviction is used.
311- With swap, swap eviction is used.
321+ * Eviction on the basis of swap pressure, not only memory pressure.
312322 * [ LLN] applies, because all pods share the nodes memory
313323 * Con
314- * If there are no burstable QoS pods on a node, then no swapping
315- can take place, and no swap related signal will be triggered.
316- Only way to remove pressure is cgroup level OOM.
317- This is considered to be an edge case and highly unlikely.
318- Prometheus alerts for this edge case will be added.
324+ * Swap-based evictions are made through a 3rd party container, which means
325+ it has to be done through an API-initiated eviction.
319326
320327###### Node memory reduction
321328
@@ -340,14 +347,14 @@ The mechanisms are:
340347
341348##### Differences between Technology Preview vs GA
342349
343- | | TP | GA |
344- | ------------------------------| ---------------| --------------------|
345- | SWAP Provisioning | MachineConfig | MachineConfig |
346- | SWAP Eligibility | VM pods | burstable QoS pods |
347- | Node service protection | Yes | Yes |
348- | I/O saturation protection | Yes | Yes |
349- | Critical workload protection | No | Yes |
350- | Memory pressure handling | Memory based | Swap based |
350+ | | TP | GA |
351+ | ------------------------------| ---------------| --------------------- |
352+ | SWAP Provisioning | MachineConfig | MachineConfig |
353+ | SWAP Eligibility | VM pods | burstable QoS pods |
354+ | Node service protection | Yes | Yes |
355+ | I/O saturation protection | Yes | Yes |
356+ | Critical workload protection | No | Yes |
357+ | Memory pressure handling | Memory based | Memory & Swap based |
351358
352359### Risks and Mitigations
353360
@@ -359,7 +366,16 @@ The mechanisms are:
359366
360367#### Phase 2
361368
362- Handled by upstream Kubernetes.
369+ Swap is handled by upstream Kubernetes.
370+
371+ | Risk | Mitigation |
372+ | -----------------------------------------------------------| ---------------------------------------------------|
373+ | Swap-based evictions are based on API-initiated evictions | Also rely on kubelet-level memory-based evictions |
374+
375+ #### Phase 3
376+
377+ Upstream Kubernetes handles both swap and evictions.
378+ Swap provision handled by OpenShift.
363379
364380### Drawbacks
365381
0 commit comments