Skip to content

[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818

Open
rustiqly wants to merge 1 commit intosonic-net:masterfrom
rustiqly:fix/flexcounter-bulkchunksize-asan
Open

[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818
rustiqly wants to merge 1 commit intosonic-net:masterfrom
rustiqly:fix/flexcounter-bulkchunksize-asan

Conversation

@rustiqly
Copy link
Copy Markdown
Contributor

Description

Fix the FlexCounter.bulkChunksize unit test that fails intermittently under AddressSanitizer (Asan) builds.

Root Cause

Under Asan, the test thread runs 2-5x slower while the FlexCounter worker thread polls at its normal 1000ms interval. This creates a race condition:

  1. Test creates FlexCounter with status=enable and poll_interval=1000
  2. Test adds 6 port counters one by one
  3. FlexCounter worker starts polling — but under Asan, only 1 port may be registered
  4. Test sets BULK_CHUNK_SIZE=3 — but the worker already consumed initialCheckCount

Result: mock_bulkGetStats receives object_count=1 when expecting 3, causing:

TestFlexCounter.cpp:1581: Failure
Expected equality of these values:
  object_count
    Which is: 1
  3

This failure is 100% reproducible on the BuildAsan CI pipeline.

Fix

Added an std::atomic<bool> chunkConfigReady flag to synchronize the test thread and FlexCounter worker:

  • In mock_bulkGetStats: after the initialCheckCount guard, if chunkConfigReady is false, treat the call as an extra initialization poll — populate counters but skip chunk-size assertions
  • In testAddRemoveCounter: added an optional onChunkConfigApplied callback parameter that fires after chunk config is applied to the FlexCounter
  • Per test case: flag is reset to false before calls where bulkChunkSizeAfterPort=true (config applied after ports → race possible), and pre-set to true for bulkChunkSizeAfterPort=false or forceSingleCall=true cases (no race)

This eliminates the timing dependency entirely — no sleeps, no increased timeouts, just explicit synchronization via atomic flag.

Testing

Under AddressSanitizer (Asan), the test thread runs 2-5x slower while the
FlexCounter worker thread polls at its normal interval. This creates a race
where the worker calls bulkGetStats before chunk-size configuration is
applied by the test thread, triggering spurious assertion failures like:
  object_count (1) != 3

Fix: Add an atomic flag (chunkConfigReady) that synchronizes the test
thread and the FlexCounter worker. The mock_bulkGetStats callback treats
any post-initialization poll that arrives before the flag is set as an
extra initialization call — populating counters without asserting chunk
sizes. The flag is set via a callback (onChunkConfigApplied) passed to
testAddRemoveCounter, which fires immediately after the chunk-size
configuration is applied to the FlexCounter.

For test cases where bulkChunkSizeAfterPort=false (config applied before
ports are added), the flag is pre-set to true since there is no race.
For forceSingleCall=true cases, bulkGetStats is not called post-init,
so the flag is also pre-set to true.

Signed-off-by: Rustiqly <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a timing-dependent race in the FlexCounter.bulkChunksize unit test that becomes consistently reproducible under AddressSanitizer, by explicitly synchronizing the test thread with the FlexCounter polling thread.

Changes:

  • Add an atomic readiness flag used by the bulk stats mock to skip chunk-size assertions until chunk-size configuration is applied.
  • Extend testAddRemoveCounter with an optional callback to signal when chunk configuration has been applied.
  • Update FlexCounter.bulkChunksize test scenarios to set/reset the readiness flag appropriately depending on configuration order.

Comment on lines +1575 to +1579
if (!chunkConfigReady.load(std::memory_order_acquire))
{
for (uint32_t i = 0; i < object_count; i++)
{
object_status[i] = SAI_STATUS_SUCCESS;
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter-population logic in the !chunkConfigReady guard duplicates the runtime loop below (same per-object/per-counter map update + counter fill). Consider extracting that shared logic into a small local helper/lambda to keep the mock consistent and reduce maintenance risk if the generation logic changes later.

Copilot uses AI. Check for mistakes.
Comment on lines 234 to +238
bool bulkChunkSizeAfterPort = true,
const std::string pluginName = "",
bool immediatelyRemoveBulkChunkSizePerCounter = false,
bool forceSingleCreate = false)
bool forceSingleCreate = false,
std::function<void()> onChunkConfigApplied = nullptr)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testAddRemoveCounter is accumulating many positional parameters (now including onChunkConfigApplied), which makes call sites hard to read and easy to mis-order. Consider grouping the bulk/chunk configuration flags into a small options struct (or splitting into a few targeted helpers/overloads) so the bulkChunksize test can pass clearly named settings without a long argument list.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants