Skip to content

Memory exhaustion test case takes an indeterminate amount of time to trigger Kernel panic.#11738

Merged
judyjoseph merged 2 commits intosonic-net:masterfrom
bmridul:memory_exhaustion
Mar 14, 2024
Merged

Memory exhaustion test case takes an indeterminate amount of time to trigger Kernel panic.#11738
judyjoseph merged 2 commits intosonic-net:masterfrom
bmridul:memory_exhaustion

Conversation

@bmridul
Copy link
Contributor

@bmridul bmridul commented Feb 20, 2024

Description of PR

Summary:
Fixes # 11737

Type of change

  • [x ] Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205
  • 202305
  • 202311

Approach

What is the motivation for this PR?

How did you do it?

It seems the time it takes for the kernel to raise Out of Memory condition and trigger oom_killer is not very deterministic in this test case. Once memory is exhausted in the system, the node becomes very unresponsive as no new processes can be created. Under most cases the test does complete in 10 mts, however for some of the PIDs, the test takes 20, 30 mts or more.

It seems the issue is seen in Linux operation in other scenarios - https://unix.stackexchange.com/questions/373312/oom-killer-doesnt-work-properly-leads-to-a-frozen-os

The solution seems to be to disable the swapping so the kernel raises the OOM condition much faster.

https://askubuntu.com/questions/1188024/how-to-test-oom-killer-from-command-line

How did you verify/test it?

Ran the test case on a number of PIDs.

@bmridul bmridul requested a review from prgeor as a code owner February 20, 2024 19:46
Copy link
Contributor

@gechiang gechiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gechiang
Copy link
Contributor

Did a little research on the history of this original tetcase:
259f540
which I am copying it here:

[test gap] Add memory exhaustion test (https://github.com/sonic-net/sonic-mgmt/pull/5260)
What is the motivation for this PR?
Fix https://github.com/sonic-net/sonic-mgmt/issues/3619 
Validate kernel will panic and reboot the DUT when kernel runs out of memory and hit oom event.

How did you do it?
Use tail /dev/zero to run out of memory completely and check the DUT reboot as expected.

How did you verify/test it?
Run tests on vSONiC and physical DUTs.

Signed-off-by: Zhijian Li <zhijianli@microsoft.com>

Which indicate that the main purpose was to cause run out of the memory an expect that the Kernel to hit oom event to reboot the DUT.

Given that the Kernel tries very hard to "survive" when running low on memory, it becomes harder to kill it even with the "tail /dev/zero" where Kernel tries to "survive" by utilize the disk space more in low memory condition and takes a lot longer to get killed. This may be easier for some platform while some other platforms may linger longer for the "slow death". This PR's change is to "assist" the platform condition by not allowing disk swap as a way to reduce the kernel from surviving longer which I think still fits to what this testcase is trying to verify. Alternatively, is to lengthen the time expecting that the DUT to crash to longer time. I feel either method is good as a way to address this flaky testcase issue while ensure that the test still meets the original test intent. With this in mind, can community please review and see if this change is ok to take in?
Thanks!
I feel that this change to modify the testcase to disable

@gechiang gechiang requested a review from lizhijianrd February 22, 2024 22:12
@judyjoseph
Copy link
Contributor

@kenneth-arista @mlok-nokia please check in platforms too.

@kenneth-arista
Copy link
Contributor

Looks okay to me. Thanks @gechiang for the background

@judyjoseph
Copy link
Contributor

@bmridul , I agree to this appraoch for this test. Can we also revert the DUT to it's original config when this testcase is done ?

Is there a way to check if the swap is on/off on this DUT to start with?

@gechiang
Copy link
Contributor

@bmridul , I agree to this appraoch for this test. Can we also revert the DUT to it's original config when this testcase is done ?

Is there a way to check if the swap is on/off on this DUT to start with?

there is no need to revert as this will cause DUT to reboot and the change won't stick after it comes back up so we should be good.

@judyjoseph judyjoseph merged commit 7764393 into sonic-net:master Mar 14, 2024
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 14, 2024
…trigger Kernel panic. (sonic-net#11738)

* Turned swapping off so kernel catches OOM in a shorter time.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #12000

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 22, 2024
…trigger Kernel panic. (sonic-net#11738)

* Turned swapping off so kernel catches OOM in a shorter time.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #12104

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 22, 2024
…trigger Kernel panic. (sonic-net#11738)

* Turned swapping off so kernel catches OOM in a shorter time.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #12105

mssonicbld pushed a commit that referenced this pull request Mar 22, 2024
…trigger Kernel panic. (#11738)

* Turned swapping off so kernel catches OOM in a shorter time.
mssonicbld pushed a commit that referenced this pull request Mar 22, 2024
…trigger Kernel panic. (#11738)

* Turned swapping off so kernel catches OOM in a shorter time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants