Skip to content

pick memory_utilization related commits#702

Merged
Pterosaur merged 9 commits intoAzure:202412from
lipxu:20250912_msft_memory
Sep 15, 2025
Merged

pick memory_utilization related commits#702
Pterosaur merged 9 commits intoAzure:202412from
lipxu:20250912_msft_memory

Conversation

@lipxu
Copy link

@lipxu lipxu commented Sep 12, 2025

No description provided.

lipxu and others added 9 commits September 12, 2025 11:58
What is the motivation for this PR?
Add the memory threshold

How did you do it?
Add the memory initial threshold, need to adjust based on nightly test results.

How did you verify/test it?
Run nightly pipeline
Approach
What is the motivation for this PR?
The memory utilization plugin operates at per test case level, it uses a pytest hook to collect the memory usage before and after each test case.
It then calculates the memory diff and compares it against a predefined threshold, if the difference exceeds the threshold, a pytest failure will be triggered.

Since "@pytest.hookimpl(tryfirst=True)", the hook will execute before all the teardown fixtures.
This means that if the memory check fails, the hook will return failure directly, and cause an exception immediately, interrupting the tear down process, as a result, it will cause the next test case failure with following error message.

AssertionError: previous item was not torn down properly

How did you do it?
Do not raise the failure in hook directly.
Instead, save all the results and report any failures in the test case's teardown fixture to avoid affecting other test cases.

How did you verify/test it?
Run elastic test
https://elastictest.org/scheduler/testplan/684004dba52da0ec6421c4ad?testcase=ecmp%2Ftest_ecmp_sai_value.py&type=console

Co-authored-by: Liping Xu <108326363+lipxu@users.noreply.github.com>
What is the motivation for this PR?
There are so many memory above threshold alarm in nightly test

How did you do it?
Update the FRR memory threshold and make the alarm more readable

memory_increase_threshold, FRR has it's own memory management system, not return the memory to system immediately, increase the threshold.
1: top:zebra: update from 64 to 128M
2: frr_bgp: update from 32 to 64M
3: frr_zebra: update from 16 to 64M

memory_high_threshold, frr bgp memory usage related to the count of neighbors, increase the threshold. we need to set the threshold according to the count of neighbors in the further.
1: frr_bgp: update from 128 to 256M

How did you verify/test it?
Run nightly test
https://elastictest.org/scheduler/testplan/685ac58d2461750d1f5a11c9
What is the motivation for this PR?
failed on teardown with "Failed: [ALARM]: monit:memory_usage, Previous memory usage 74.8 MB exceeds high threshold 70.0 MB (previous: 74.8 MB, current: 74.8 MB)

How did you do it?
Enhance the plugin, add a new type of percentage_points

"type": "value": Absolute values in MB

Example: {"type": "value", "value": 128} means 128 MB
For example: top, free, frr_memory commands that return memory in megabytes
"type": "percentage": Relative percentage of baseline value

Example: {"type": "percentage", "value": "10%"} means 10% of current memory usage
Calculation: If baseline is 100 MB, threshold becomes 10 MB
For example: Dynamic thresholds that scale with current usage
"type": "percentage_points": Absolute percentage values

Example: {"type": "percentage_points", "value": 75} means 75%
For example: monit, docker stats commands that return percentage data
For increases: {"type": "percentage_points", "value": 10} means 10 percentage points (e.g., from 70% to 80%)
How did you verify/test it?
Hack threshold

>               pytest.fail(failure_message)
E               Failed: [ALARM]: monit:memory_usage, Previous memory usage 50.4% exceeds high threshold 40% (previous: 50.4%, current: 50.1%)
E               [ALARM]: monit:memory_usage, Current memory usage 50.1% exceeds high threshold 40% (previous: 50.4%, current: 50.1%)
E               [ALARM]: docker:database, Previous memory usage 1.6% exceeds high threshold 1% (previous: 1.6%, current: 1.6%)
E               [ALARM]: docker:database, Current memory usage 1.6% exceeds high threshold 1% (previous: 1.6%, current: 1.6%)
E               [ALARM]: frr_bgp:used, Previous memory usage 70.0 MB exceeds high threshold 16.0 MB (previous: 70.0 MB, current: 70.0 MB)
E               [ALARM]: frr_bgp:used, Current memory usage 70.0 MB exceeds high threshold 16.0 MB (previous: 70.0 MB, current: 70.0 MB)
E               [ALARM]: frr_zebra:used, Previous memory usage 17.0 MB exceeds high threshold 16.0 MB (previous: 17.0 MB, current: 17.0 MB)
E               [ALARM]: frr_zebra:used, Current memory usage 17.0 MB exceeds high threshold 16.0 MB (previous: 17.0 MB, current: 17.0 MB)
Run elastic
https://elastictest.org/scheduler/testplan/68804464edf1bbac5171814b
…d (#19786)

In one pytest session, when all test case are skipped, then the teardown will not be executed, when there is some test case not skipped, then for the skipped test case, it will still run the teardown.
What is the motivation for this PR?
disk/test_disk_exhaustion.py creates a 1.7G file in the test and deletes it at the end of the test.
But "monit status" is configured to check only once every 60 secs in /etc/monit/monitrc.
This provides a stale data resulting in memory high threshold getting breached.

How did you do it?
We should use "monit validate" instead of "monit status"

How did you verify/test it?
verified by running the test
Description of PR
Summary:

Fix following error in pretest:

>               pytest.fail(failure_message)
E               Failed: [ALARM]: frr_bgp:used, Previous memory usage 273.0 MB exceeds high threshold 256.0 MB (previous: 273.0 MB, current: 273.0 MB)
E               [ALARM]: frr_bgp:used, Current memory usage 273.0 MB exceeds high threshold 256.0 MB (previous: 273.0 MB, current: 273.0 MB)
This is a new memory checking feature in 202505. Therefore, the fix is not applicable to 202412 branch.

TH5 testbed may have 32 neighbors, and each has 6400 prefixes, the total memory after deploy-mg is around 260-300MB. increase to 384MB.

signed-off-by: jianquanye@microsoft.com
Description of PR
Summary:
On Arista-7060X6-16PE-384C-B-t0-isolated-d96u32s2, bgp/test_bgp_gr_helper.py has a high but expected memory usage. The memory increase threshold is causing the test to fail.

Relaxing the memory increase threshold.
Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

Approach
What is the motivation for this PR?
Memory increase threshold causing test to fail. Actual usage is below memory_high_threshold.

How did you do it?
Relaxing the memory increase threshold by value. Before the test, 191.0MB is being used. During the test, the peak usage is 316.0MB.

The increase threshold config fields dictates 2 different limits on what the maximum increase threshold can be - either an increase in X% of current usage, or increase in Y MB addition to current usage. It will use the higher of the two.

Without the changes made in this review, the allowed increase would be max((50% * 191.0MB), (64MB)).

Before the test, the actual memory usage is not near the memory_high_threshold (384MB) - relaxing the increase_threshold by percentage may lead to too much relaxation in other tests done.

How did you verify/test it?
Test no longer fails on
@yutongzhang-microsoft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur merged commit c7b3b09 into Azure:202412 Sep 15, 2025
6 of 14 checks passed
Pterosaur added a commit that referenced this pull request Sep 20, 2025
Pterosaur added a commit that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants