Skip to content

Commit 9783aa9

Browse files
cdowntorvalds
authored andcommitted
mm, memcg: proportional memory.{low,min} reclaim
cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [[email protected]: reflow block comments to fit in 80 cols] [[email protected]: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chris Down <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 518a867 commit 9783aa9

File tree

4 files changed

+115
-12
lines changed

4 files changed

+115
-12
lines changed

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -615,8 +615,8 @@ on an IO device and is an example of this type.
615615
Protections
616616
-----------
617617

618-
A cgroup is protected to be allocated upto the configured amount of
619-
the resource if the usages of all its ancestors are under their
618+
A cgroup is protected upto the configured amount of the resource
619+
as long as the usages of all its ancestors are under their
620620
protected levels. Protections can be hard guarantees or best effort
621621
soft boundaries. Protections can also be over-committed in which case
622622
only upto the amount available to the parent is protected among
@@ -1096,7 +1096,10 @@ PAGE_SIZE multiple when read back.
10961096
is within its effective min boundary, the cgroup's memory
10971097
won't be reclaimed under any conditions. If there is no
10981098
unprotected reclaimable memory available, OOM killer
1099-
is invoked.
1099+
is invoked. Above the effective min boundary (or
1100+
effective low boundary if it is higher), pages are reclaimed
1101+
proportionally to the overage, reducing reclaim pressure for
1102+
smaller overages.
11001103

11011104
Effective min boundary is limited by memory.min values of
11021105
all ancestor cgroups. If there is memory.min overcommitment
@@ -1118,7 +1121,10 @@ PAGE_SIZE multiple when read back.
11181121
Best-effort memory protection. If the memory usage of a
11191122
cgroup is within its effective low boundary, the cgroup's
11201123
memory won't be reclaimed unless memory can be reclaimed
1121-
from unprotected cgroups.
1124+
from unprotected cgroups. Above the effective low boundary (or
1125+
effective min boundary if it is higher), pages are reclaimed
1126+
proportionally to the overage, reducing reclaim pressure for
1127+
smaller overages.
11221128

11231129
Effective low boundary is limited by memory.low values of
11241130
all ancestor cgroups. If there is memory.low overcommitment
@@ -2482,8 +2488,10 @@ system performance due to overreclaim, to the point where the feature
24822488
becomes self-defeating.
24832489

24842490
The memory.low boundary on the other hand is a top-down allocated
2485-
reserve. A cgroup enjoys reclaim protection when it's within its low,
2486-
which makes delegation of subtrees possible.
2491+
reserve. A cgroup enjoys reclaim protection when it's within its
2492+
effective low, which makes delegation of subtrees possible. It also
2493+
enjoys having reclaim pressure proportional to its overage when
2494+
above its effective low.
24872495

24882496
The original high boundary, the hard limit, is defined as a strict
24892497
limit that can not budge, even if the OOM killer has to be called.

include/linux/memcontrol.h

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,6 +356,14 @@ static inline bool mem_cgroup_disabled(void)
356356
return !cgroup_subsys_enabled(memory_cgrp_subsys);
357357
}
358358

359+
static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
360+
{
361+
if (mem_cgroup_disabled())
362+
return 0;
363+
364+
return max(READ_ONCE(memcg->memory.emin), READ_ONCE(memcg->memory.elow));
365+
}
366+
359367
enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
360368
struct mem_cgroup *memcg);
361369

@@ -537,6 +545,8 @@ void mem_cgroup_handle_over_high(void);
537545

538546
unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
539547

548+
unsigned long mem_cgroup_size(struct mem_cgroup *memcg);
549+
540550
void mem_cgroup_print_oom_context(struct mem_cgroup *memcg,
541551
struct task_struct *p);
542552

@@ -829,6 +839,11 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
829839
{
830840
}
831841

842+
static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
843+
{
844+
return 0;
845+
}
846+
832847
static inline enum mem_cgroup_protection mem_cgroup_protected(
833848
struct mem_cgroup *root, struct mem_cgroup *memcg)
834849
{
@@ -968,6 +983,11 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
968983
return 0;
969984
}
970985

986+
static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
987+
{
988+
return 0;
989+
}
990+
971991
static inline void
972992
mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p)
973993
{

mm/memcontrol.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1567,6 +1567,11 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
15671567
return max;
15681568
}
15691569

1570+
unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
1571+
{
1572+
return page_counter_read(&memcg->memory);
1573+
}
1574+
15701575
static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
15711576
int order)
15721577
{

mm/vmscan.c

Lines changed: 76 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2459,17 +2459,80 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
24592459
*lru_pages = 0;
24602460
for_each_evictable_lru(lru) {
24612461
int file = is_file_lru(lru);
2462-
unsigned long size;
2462+
unsigned long lruvec_size;
24632463
unsigned long scan;
2464+
unsigned long protection;
2465+
2466+
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
2467+
protection = mem_cgroup_protection(memcg);
2468+
2469+
if (protection > 0) {
2470+
/*
2471+
* Scale a cgroup's reclaim pressure by proportioning
2472+
* its current usage to its memory.low or memory.min
2473+
* setting.
2474+
*
2475+
* This is important, as otherwise scanning aggression
2476+
* becomes extremely binary -- from nothing as we
2477+
* approach the memory protection threshold, to totally
2478+
* nominal as we exceed it. This results in requiring
2479+
* setting extremely liberal protection thresholds. It
2480+
* also means we simply get no protection at all if we
2481+
* set it too low, which is not ideal.
2482+
*/
2483+
unsigned long cgroup_size = mem_cgroup_size(memcg);
2484+
unsigned long baseline = 0;
2485+
2486+
/*
2487+
* During the reclaim first pass, we only consider
2488+
* cgroups in excess of their protection setting, but if
2489+
* that doesn't produce free pages, we come back for a
2490+
* second pass where we reclaim from all groups.
2491+
*
2492+
* To maintain fairness in both cases, the first pass
2493+
* targets groups in proportion to their overage, and
2494+
* the second pass targets groups in proportion to their
2495+
* protection utilization.
2496+
*
2497+
* So on the first pass, a group whose size is 130% of
2498+
* its protection will be targeted at 30% of its size.
2499+
* On the second pass, a group whose size is at 40% of
2500+
* its protection will be
2501+
* targeted at 40% of its size.
2502+
*/
2503+
if (!sc->memcg_low_reclaim)
2504+
baseline = lruvec_size;
2505+
scan = lruvec_size * cgroup_size / protection - baseline;
2506+
2507+
/*
2508+
* Don't allow the scan target to exceed the lruvec
2509+
* size, which otherwise could happen if we have >200%
2510+
* overage in the normal case, or >100% overage when
2511+
* sc->memcg_low_reclaim is set.
2512+
*
2513+
* This is important because other cgroups without
2514+
* memory.low have their scan target initially set to
2515+
* their lruvec size, so allowing values >100% of the
2516+
* lruvec size here could result in penalising cgroups
2517+
* with memory.low set even *more* than their peers in
2518+
* some cases in the case of large overages.
2519+
*
2520+
* Also, minimally target SWAP_CLUSTER_MAX pages to keep
2521+
* reclaim moving forwards.
2522+
*/
2523+
scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
2524+
} else {
2525+
scan = lruvec_size;
2526+
}
2527+
2528+
scan >>= sc->priority;
24642529

2465-
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
2466-
scan = size >> sc->priority;
24672530
/*
24682531
* If the cgroup's already been deleted, make sure to
24692532
* scrape out the remaining cache.
24702533
*/
24712534
if (!scan && !mem_cgroup_online(memcg))
2472-
scan = min(size, SWAP_CLUSTER_MAX);
2535+
scan = min(lruvec_size, SWAP_CLUSTER_MAX);
24732536

24742537
switch (scan_balance) {
24752538
case SCAN_EQUAL:
@@ -2489,7 +2552,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
24892552
case SCAN_ANON:
24902553
/* Scan one type exclusively */
24912554
if ((scan_balance == SCAN_FILE) != file) {
2492-
size = 0;
2555+
lruvec_size = 0;
24932556
scan = 0;
24942557
}
24952558
break;
@@ -2498,7 +2561,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
24982561
BUG();
24992562
}
25002563

2501-
*lru_pages += size;
2564+
*lru_pages += lruvec_size;
25022565
nr[lru] = scan;
25032566
}
25042567
}
@@ -2742,6 +2805,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
27422805
memcg_memory_event(memcg, MEMCG_LOW);
27432806
break;
27442807
case MEMCG_PROT_NONE:
2808+
/*
2809+
* All protection thresholds breached. We may
2810+
* still choose to vary the scan pressure
2811+
* applied based on by how much the cgroup in
2812+
* question has exceeded its protection
2813+
* thresholds (see get_scan_count).
2814+
*/
27452815
break;
27462816
}
27472817

0 commit comments

Comments
 (0)