pick memory_utilization related commits by lipxu · Pull Request #702 · Azure/sonic-mgmt.msft

lipxu · 2025-09-12T12:05:01Z

No description provided.

What is the motivation for this PR? Add the memory threshold How did you do it? Add the memory initial threshold, need to adjust based on nightly test results. How did you verify/test it? Run nightly pipeline

Approach What is the motivation for this PR? The memory utilization plugin operates at per test case level, it uses a pytest hook to collect the memory usage before and after each test case. It then calculates the memory diff and compares it against a predefined threshold, if the difference exceeds the threshold, a pytest failure will be triggered. Since "@pytest.hookimpl(tryfirst=True)", the hook will execute before all the teardown fixtures. This means that if the memory check fails, the hook will return failure directly, and cause an exception immediately, interrupting the tear down process, as a result, it will cause the next test case failure with following error message. AssertionError: previous item was not torn down properly How did you do it? Do not raise the failure in hook directly. Instead, save all the results and report any failures in the test case's teardown fixture to avoid affecting other test cases. How did you verify/test it? Run elastic test https://elastictest.org/scheduler/testplan/684004dba52da0ec6421c4ad?testcase=ecmp%2Ftest_ecmp_sai_value.py&type=console Co-authored-by: Liping Xu <108326363+lipxu@users.noreply.github.com>

What is the motivation for this PR? There are so many memory above threshold alarm in nightly test How did you do it? Update the FRR memory threshold and make the alarm more readable memory_increase_threshold, FRR has it's own memory management system, not return the memory to system immediately, increase the threshold. 1: top:zebra: update from 64 to 128M 2: frr_bgp: update from 32 to 64M 3: frr_zebra: update from 16 to 64M memory_high_threshold, frr bgp memory usage related to the count of neighbors, increase the threshold. we need to set the threshold according to the count of neighbors in the further. 1: frr_bgp: update from 128 to 256M How did you verify/test it? Run nightly test https://elastictest.org/scheduler/testplan/685ac58d2461750d1f5a11c9

What is the motivation for this PR? failed on teardown with "Failed: [ALARM]: monit:memory_usage, Previous memory usage 74.8 MB exceeds high threshold 70.0 MB (previous: 74.8 MB, current: 74.8 MB) How did you do it? Enhance the plugin, add a new type of percentage_points "type": "value": Absolute values in MB Example: {"type": "value", "value": 128} means 128 MB For example: top, free, frr_memory commands that return memory in megabytes "type": "percentage": Relative percentage of baseline value Example: {"type": "percentage", "value": "10%"} means 10% of current memory usage Calculation: If baseline is 100 MB, threshold becomes 10 MB For example: Dynamic thresholds that scale with current usage "type": "percentage_points": Absolute percentage values Example: {"type": "percentage_points", "value": 75} means 75% For example: monit, docker stats commands that return percentage data For increases: {"type": "percentage_points", "value": 10} means 10 percentage points (e.g., from 70% to 80%) How did you verify/test it? Hack threshold > pytest.fail(failure_message) E Failed: [ALARM]: monit:memory_usage, Previous memory usage 50.4% exceeds high threshold 40% (previous: 50.4%, current: 50.1%) E [ALARM]: monit:memory_usage, Current memory usage 50.1% exceeds high threshold 40% (previous: 50.4%, current: 50.1%) E [ALARM]: docker:database, Previous memory usage 1.6% exceeds high threshold 1% (previous: 1.6%, current: 1.6%) E [ALARM]: docker:database, Current memory usage 1.6% exceeds high threshold 1% (previous: 1.6%, current: 1.6%) E [ALARM]: frr_bgp:used, Previous memory usage 70.0 MB exceeds high threshold 16.0 MB (previous: 70.0 MB, current: 70.0 MB) E [ALARM]: frr_bgp:used, Current memory usage 70.0 MB exceeds high threshold 16.0 MB (previous: 70.0 MB, current: 70.0 MB) E [ALARM]: frr_zebra:used, Previous memory usage 17.0 MB exceeds high threshold 16.0 MB (previous: 17.0 MB, current: 17.0 MB) E [ALARM]: frr_zebra:used, Current memory usage 17.0 MB exceeds high threshold 16.0 MB (previous: 17.0 MB, current: 17.0 MB) Run elastic https://elastictest.org/scheduler/testplan/68804464edf1bbac5171814b

…d (#19786) In one pytest session, when all test case are skipped, then the teardown will not be executed, when there is some test case not skipped, then for the skipped test case, it will still run the teardown.

What is the motivation for this PR? disk/test_disk_exhaustion.py creates a 1.7G file in the test and deletes it at the end of the test. But "monit status" is configured to check only once every 60 secs in /etc/monit/monitrc. This provides a stale data resulting in memory high threshold getting breached. How did you do it? We should use "monit validate" instead of "monit status" How did you verify/test it? verified by running the test

Description of PR Summary: Fix following error in pretest: > pytest.fail(failure_message) E Failed: [ALARM]: frr_bgp:used, Previous memory usage 273.0 MB exceeds high threshold 256.0 MB (previous: 273.0 MB, current: 273.0 MB) E [ALARM]: frr_bgp:used, Current memory usage 273.0 MB exceeds high threshold 256.0 MB (previous: 273.0 MB, current: 273.0 MB) This is a new memory checking feature in 202505. Therefore, the fix is not applicable to 202412 branch. TH5 testbed may have 32 neighbors, and each has 6400 prefixes, the total memory after deploy-mg is around 260-300MB. increase to 384MB. signed-off-by: jianquanye@microsoft.com

Description of PR Summary: On Arista-7060X6-16PE-384C-B-t0-isolated-d96u32s2, bgp/test_bgp_gr_helper.py has a high but expected memory usage. The memory increase threshold is causing the test to fail. Relaxing the memory increase threshold. Fixes # (issue) Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Memory increase threshold causing test to fail. Actual usage is below memory_high_threshold. How did you do it? Relaxing the memory increase threshold by value. Before the test, 191.0MB is being used. During the test, the peak usage is 316.0MB. The increase threshold config fields dictates 2 different limits on what the maximum increase threshold can be - either an increase in X% of current usage, or increase in Y MB addition to current usage. It will use the higher of the two. Without the changes made in this review, the allowed increase would be max((50% * 191.0MB), (64MB)). Before the test, the actual memory usage is not near the memory_high_threshold (384MB) - relaxing the increase_threshold by percentage may lead to too much relaxation in other tests done. How did you verify/test it? Test no longer fails on

yutongzhang-microsoft · 2025-09-12T12:43:24Z

/azp run

azure-pipelines · 2025-09-12T12:43:35Z

Azure Pipelines successfully started running 1 pipeline(s).

This reverts commit c7b3b09.

Reverts #702

lipxu and others added 9 commits September 12, 2025 11:58

Add memory threshold config for memory utilization (#14642)

cb663a6

What is the motivation for this PR? Add the memory threshold How did you do it? Add the memory initial threshold, need to adjust based on nightly test results. How did you verify/test it? Run nightly pipeline

for hwsku specific config (#19541)

debe011

lipxu mentioned this pull request Sep 12, 2025

Add memory threshold config for memory utilization sonic-net/sonic-mgmt#14642

Merged

8 tasks

lipxu requested review from Pterosaur and yutongzhang-microsoft September 12, 2025 12:06

Pterosaur approved these changes Sep 15, 2025

View reviewed changes

Pterosaur merged commit c7b3b09 into Azure:202412 Sep 15, 2025
6 of 14 checks passed

dgsudharsan mentioned this pull request Sep 19, 2025

Bug: bgpd memory usage increased by 37% during generic hash/techsupport tests sonic-net/sonic-buildimage#24055

Open

Pterosaur added a commit that referenced this pull request Sep 20, 2025

Revert "pick memory_utilization related commits (#702)"

24e1eed

This reverts commit c7b3b09.

Pterosaur mentioned this pull request Sep 20, 2025

Revert "pick memory_utilization related commits" #721

Merged

Pterosaur added a commit that referenced this pull request Sep 20, 2025

Revert "pick memory_utilization related commits" (#721)

37d1319

Reverts #702

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pick memory_utilization related commits#702

pick memory_utilization related commits#702
Pterosaur merged 9 commits intoAzure:202412from
lipxu:20250912_msft_memory

lipxu commented Sep 12, 2025

Uh oh!

yutongzhang-microsoft commented Sep 12, 2025

Uh oh!

azure-pipelines bot commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

lipxu commented Sep 12, 2025

Uh oh!

yutongzhang-microsoft commented Sep 12, 2025

Uh oh!

azure-pipelines bot commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants