[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818
[test] Fix flaky FlexCounter.bulkChunksize under AddressSanitizer#1818rustiqly wants to merge 1 commit intosonic-net:masterfrom
Conversation
Under AddressSanitizer (Asan), the test thread runs 2-5x slower while the FlexCounter worker thread polls at its normal interval. This creates a race where the worker calls bulkGetStats before chunk-size configuration is applied by the test thread, triggering spurious assertion failures like: object_count (1) != 3 Fix: Add an atomic flag (chunkConfigReady) that synchronizes the test thread and the FlexCounter worker. The mock_bulkGetStats callback treats any post-initialization poll that arrives before the flag is set as an extra initialization call — populating counters without asserting chunk sizes. The flag is set via a callback (onChunkConfigApplied) passed to testAddRemoveCounter, which fires immediately after the chunk-size configuration is applied to the FlexCounter. For test cases where bulkChunkSizeAfterPort=false (config applied before ports are added), the flag is pre-set to true since there is no race. For forceSingleCall=true cases, bulkGetStats is not called post-init, so the flag is also pre-set to true. Signed-off-by: Rustiqly <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Fixes a timing-dependent race in the FlexCounter.bulkChunksize unit test that becomes consistently reproducible under AddressSanitizer, by explicitly synchronizing the test thread with the FlexCounter polling thread.
Changes:
- Add an atomic readiness flag used by the bulk stats mock to skip chunk-size assertions until chunk-size configuration is applied.
- Extend
testAddRemoveCounterwith an optional callback to signal when chunk configuration has been applied. - Update
FlexCounter.bulkChunksizetest scenarios to set/reset the readiness flag appropriately depending on configuration order.
| if (!chunkConfigReady.load(std::memory_order_acquire)) | ||
| { | ||
| for (uint32_t i = 0; i < object_count; i++) | ||
| { | ||
| object_status[i] = SAI_STATUS_SUCCESS; |
There was a problem hiding this comment.
The counter-population logic in the !chunkConfigReady guard duplicates the runtime loop below (same per-object/per-counter map update + counter fill). Consider extracting that shared logic into a small local helper/lambda to keep the mock consistent and reduce maintenance risk if the generation logic changes later.
| bool bulkChunkSizeAfterPort = true, | ||
| const std::string pluginName = "", | ||
| bool immediatelyRemoveBulkChunkSizePerCounter = false, | ||
| bool forceSingleCreate = false) | ||
| bool forceSingleCreate = false, | ||
| std::function<void()> onChunkConfigApplied = nullptr) |
There was a problem hiding this comment.
testAddRemoveCounter is accumulating many positional parameters (now including onChunkConfigApplied), which makes call sites hard to read and easy to mis-order. Consider grouping the bulk/chunk configuration flags into a small options struct (or splitting into a few targeted helpers/overloads) so the bulkChunksize test can pass clearly named settings without a long argument list.
Description
Fix the
FlexCounter.bulkChunksizeunit test that fails intermittently under AddressSanitizer (Asan) builds.Root Cause
Under Asan, the test thread runs 2-5x slower while the FlexCounter worker thread polls at its normal 1000ms interval. This creates a race condition:
status=enableandpoll_interval=1000BULK_CHUNK_SIZE=3— but the worker already consumedinitialCheckCountResult:
mock_bulkGetStatsreceivesobject_count=1when expecting3, causing:This failure is 100% reproducible on the BuildAsan CI pipeline.
Fix
Added an
std::atomic<bool> chunkConfigReadyflag to synchronize the test thread and FlexCounter worker:mock_bulkGetStats: after theinitialCheckCountguard, ifchunkConfigReadyis false, treat the call as an extra initialization poll — populate counters but skip chunk-size assertionstestAddRemoveCounter: added an optionalonChunkConfigAppliedcallback parameter that fires after chunk config is applied to the FlexCounterfalsebefore calls wherebulkChunkSizeAfterPort=true(config applied after ports → race possible), and pre-set totrueforbulkChunkSizeAfterPort=falseorforceSingleCall=truecases (no race)This eliminates the timing dependency entirely — no sleeps, no increased timeouts, just explicit synchronization via atomic flag.
Testing