Memory exhaustion test case takes an indeterminate amount of time to trigger Kernel panic.#11738
Conversation
|
Did a little research on the history of this original tetcase: Which indicate that the main purpose was to cause run out of the memory an expect that the Kernel to hit oom event to reboot the DUT. Given that the Kernel tries very hard to "survive" when running low on memory, it becomes harder to kill it even with the "tail /dev/zero" where Kernel tries to "survive" by utilize the disk space more in low memory condition and takes a lot longer to get killed. This may be easier for some platform while some other platforms may linger longer for the "slow death". This PR's change is to "assist" the platform condition by not allowing disk swap as a way to reduce the kernel from surviving longer which I think still fits to what this testcase is trying to verify. Alternatively, is to lengthen the time expecting that the DUT to crash to longer time. I feel either method is good as a way to address this flaky testcase issue while ensure that the test still meets the original test intent. With this in mind, can community please review and see if this change is ok to take in? |
|
@kenneth-arista @mlok-nokia please check in platforms too. |
|
Looks okay to me. Thanks @gechiang for the background |
|
@bmridul , I agree to this appraoch for this test. Can we also revert the DUT to it's original config when this testcase is done ? Is there a way to check if the swap is on/off on this DUT to start with? |
there is no need to revert as this will cause DUT to reboot and the change won't stick after it comes back up so we should be good. |
…trigger Kernel panic. (sonic-net#11738) * Turned swapping off so kernel catches OOM in a shorter time.
|
Cherry-pick PR to 202205: #12000 |
…trigger Kernel panic. (#11738) * Turned swapping off so kernel catches OOM in a shorter time.
…trigger Kernel panic. (sonic-net#11738) * Turned swapping off so kernel catches OOM in a shorter time.
|
Cherry-pick PR to 202305: #12104 |
…trigger Kernel panic. (sonic-net#11738) * Turned swapping off so kernel catches OOM in a shorter time.
|
Cherry-pick PR to 202311: #12105 |
…trigger Kernel panic. (#11738) * Turned swapping off so kernel catches OOM in a shorter time.
…trigger Kernel panic. (#11738) * Turned swapping off so kernel catches OOM in a shorter time.
Description of PR
Summary:
Fixes # 11737
Type of change
Back port request
Approach
What is the motivation for this PR?
How did you do it?
It seems the time it takes for the kernel to raise Out of Memory condition and trigger oom_killer is not very deterministic in this test case. Once memory is exhausted in the system, the node becomes very unresponsive as no new processes can be created. Under most cases the test does complete in 10 mts, however for some of the PIDs, the test takes 20, 30 mts or more.
It seems the issue is seen in Linux operation in other scenarios - https://unix.stackexchange.com/questions/373312/oom-killer-doesnt-work-properly-leads-to-a-frozen-os
The solution seems to be to disable the swapping so the kernel raises the OOM condition much faster.
https://askubuntu.com/questions/1188024/how-to-test-oom-killer-from-command-line
How did you verify/test it?
Ran the test case on a number of PIDs.