[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer by rustiqly · Pull Request #1818 · sonic-net/sonic-sairedis

rustiqly · 2026-03-26T23:54:14Z

Description

Fix the FlexCounter.bulkChunksize unit test that fails intermittently under AddressSanitizer (Asan) builds.

Root Cause

Under Asan, the test thread runs 2-5x slower while the FlexCounter worker thread polls at its normal 1000ms interval. This creates a race condition:

Test creates FlexCounter with status=enable and poll_interval=1000
Test adds 6 port counters one by one
FlexCounter worker starts polling — but under Asan, only 1 port may be registered
Test sets BULK_CHUNK_SIZE=3 — but the worker already consumed initialCheckCount

Result: mock_bulkGetStats receives object_count=1 when expecting 3, causing:

TestFlexCounter.cpp:1581: Failure
Expected equality of these values:
  object_count
    Which is: 1
  3

This failure is 100% reproducible on the BuildAsan CI pipeline.

Fix

Added an std::atomic<bool> chunkConfigReady flag to synchronize the test thread and FlexCounter worker:

In mock_bulkGetStats: after the initialCheckCount guard, if chunkConfigReady is false, treat the call as an extra initialization poll — populate counters but skip chunk-size assertions
In testAddRemoveCounter: added an optional onChunkConfigApplied callback parameter that fires after chunk config is applied to the FlexCounter
Per test case: flag is reset to false before calls where bulkChunkSizeAfterPort=true (config applied after ports → race possible), and pre-set to true for bulkChunkSizeAfterPort=false or forceSingleCall=true cases (no race)

This eliminates the timing dependency entirely — no sleeps, no increased timeouts, just explicit synchronization via atomic flag.

Testing

Verified the fix compiles and the test logic is correct
The previous fix approach (PR [test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait #1766, poll-wait) solved the normal build flakiness but didn't handle the Asan case where overhead is much larger

Under AddressSanitizer (Asan), the test thread runs 2-5x slower while the FlexCounter worker thread polls at its normal interval. This creates a race where the worker calls bulkGetStats before chunk-size configuration is applied by the test thread, triggering spurious assertion failures like: object_count (1) != 3 Fix: Add an atomic flag (chunkConfigReady) that synchronizes the test thread and the FlexCounter worker. The mock_bulkGetStats callback treats any post-initialization poll that arrives before the flag is set as an extra initialization call — populating counters without asserting chunk sizes. The flag is set via a callback (onChunkConfigApplied) passed to testAddRemoveCounter, which fires immediately after the chunk-size configuration is applied to the FlexCounter. For test cases where bulkChunkSizeAfterPort=false (config applied before ports are added), the flag is pre-set to true since there is no race. For forceSingleCall=true cases, bulkGetStats is not called post-init, so the flag is also pre-set to true. Signed-off-by: Rustiqly <[email protected]>

mssonicbld · 2026-03-26T23:54:23Z

/azp run

azure-pipelines · 2026-03-26T23:54:33Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Fixes a timing-dependent race in the FlexCounter.bulkChunksize unit test that becomes consistently reproducible under AddressSanitizer, by explicitly synchronizing the test thread with the FlexCounter polling thread.

Changes:

Add an atomic readiness flag used by the bulk stats mock to skip chunk-size assertions until chunk-size configuration is applied.
Extend testAddRemoveCounter with an optional callback to signal when chunk configuration has been applied.
Update FlexCounter.bulkChunksize test scenarios to set/reset the readiness flag appropriately depending on configuration order.

Copilot · 2026-03-27T00:00:20Z

unittest/syncd/TestFlexCounter.cpp

+        if (!chunkConfigReady.load(std::memory_order_acquire))
+        {
+            for (uint32_t i = 0; i < object_count; i++)
+            {
+                object_status[i] = SAI_STATUS_SUCCESS;


The counter-population logic in the !chunkConfigReady guard duplicates the runtime loop below (same per-object/per-counter map update + counter fill). Consider extracting that shared logic into a small local helper/lambda to keep the mock consistent and reduce maintenance risk if the generation logic changes later.

Copilot · 2026-03-27T00:00:23Z

unittest/syncd/TestFlexCounter.cpp

        bool bulkChunkSizeAfterPort = true,
        const std::string pluginName = "",
        bool immediatelyRemoveBulkChunkSizePerCounter = false,
-        bool forceSingleCreate = false)
+        bool forceSingleCreate = false,
+        std::function<void()> onChunkConfigApplied = nullptr)


testAddRemoveCounter is accumulating many positional parameters (now including onChunkConfigApplied), which makes call sites hard to read and easy to mis-order. Consider grouping the bulk/chunk configuration flags into a small options struct (or splitting into a few targeted helpers/overloads) so the bulkChunksize test can pass clearly named settings without a long argument list.

lihuay requested a review from Copilot March 26, 2026 23:56

Copilot started reviewing on behalf of lihuay March 26, 2026 23:56 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818

[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818
rustiqly wants to merge 1 commit intosonic-net:masterfrom
rustiqly:fix/flexcounter-bulkchunksize-asan

rustiqly commented Mar 26, 2026

Uh oh!

mssonicbld commented Mar 26, 2026

Uh oh!

azure-pipelines bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rustiqly commented Mar 26, 2026

Description

Root Cause

Fix

Testing

Uh oh!

mssonicbld commented Mar 26, 2026

Uh oh!

azure-pipelines bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants