[test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait#1766
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
9de2821 to
5179023
Compare
|
@lguohan Fixed — the test failure was |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR reduces CI flakiness in FlexCounter unit tests by replacing fixed-duration sleeps with deterministic poll-wait helpers that wait for COUNTERS_DB state to be populated.
Changes:
- Added
waitForCounterKeys()andwaitForCounterValue()helper functions for polling COUNTERS_DB with a timeout. - Replaced multiple
usleep(...)calls across FlexCounter-related tests with these poll-wait helpers. - Updated the aspell personal word list to include
usleep(used in code/comments).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
unittest/syncd/TestFlexCounter.cpp |
Adds DB poll-wait helpers and replaces hardcoded sleeps to make FlexCounter tests timing-independent. |
tests/aspell.en.pws |
Adds usleep to the dictionary to avoid spellcheck noise. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5179023 to
0fcba53
Compare
|
/azp run |
|
Addressed both Copilot review comments:
Force-pushed with both fixes. |
|
Azure Pipelines successfully started running 1 pipeline(s). |
0fcba53 to
baf5355
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Updated and ready for review. Summary of changes since last push:
Local test results: 91/92 pass — the only failure is |
unittest/syncd/TestFlexCounter.cpp
Outdated
|
|
||
| usleep(1000*2000); | ||
| waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100"); | ||
| countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", value); |
There was a problem hiding this comment.
this line seems unnecessary give 1055 has check the expected value.
There was a problem hiding this comment.
Good catch — removed the redundant hget + EXPECT_EQ since waitForCounterValue already validates the exact value.
unittest/syncd/TestFlexCounter.cpp
Outdated
| waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100"); | ||
| countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", value); | ||
| EXPECT_EQ(value, "100"); | ||
| countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_ERRORS", value); |
There was a problem hiding this comment.
why not follow same pattern as 1055 to do wait.
There was a problem hiding this comment.
Done — now uses waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_ERRORS", "200") matching the same pattern.
unittest/syncd/TestFlexCounter.cpp
Outdated
|
|
||
| usleep(1000*1050); | ||
| waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100"); | ||
| counterVerifyFunc(countersTable, |
There was a problem hiding this comment.
why do we still need this 1855?
There was a problem hiding this comment.
You're right — the value stays "100" from the previous step, so the wait returns immediately on the stale value anyway. Removed.
baf5355 to
51b6fec
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
51b6fec to
3cf9be1
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…oll-wait Replace hardcoded usleep(1000*1050) calls in FlexCounter unit tests with deterministic poll-wait helpers that check for expected counter keys or values in the DB. The fixed-sleep approach is inherently racy under CI load because the FlexCounter polling thread may not complete within the 1.05s window. Two helpers added: - waitForCounterKeys: polls until expected number of counter keys appear - waitForCounterValue: polls until a specific counter field is populated Both poll every 100ms with a 5-second timeout, eliminating timing sensitivity while keeping tests fast on unloaded machines. Fixes: sonic-net#1765 Signed-off-by: Rustiqly <[email protected]>
3cf9be1 to
b153d7a
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…oll-wait (sonic-net#1766) Signed-off-by: Vivek Reddy <[email protected]>
* [DPU] Add support for Flow bulk session get notifications Signed-off-by: Vivek Reddy <[email protected]> * Remove obvious comments Signed-off-by: Vivek Reddy <[email protected]> * Add processMetadata method to validate object id Signed-off-by: Vivek Reddy <[email protected]> * Handled comments Signed-off-by: Vivek Reddy <[email protected]> * Minor fixes Signed-off-by: Vivek Reddy <[email protected]> * Use get_stats_ext instead of get_stats for switch counters (#1757) Signed-off-by: Ryan Garofano <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * [test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait (#1766) Signed-off-by: Vivek Reddy <[email protected]> * Add .github/copilot-instructions.md for AI-assisted development (#1764) Add .github/copilot-instructions.md to provide AI-assisted development guidance for contributors using GitHub Copilot, Copilot Chat, and other AI coding tools. This file helps AI tools understand the repo's architecture, coding conventions, and contribution workflow, leading to more accurate suggestions and fewer review cycles. What's included: Repository architecture and component overview Coding standards and naming conventions Testing requirements and patterns Build system integration notes Common pitfalls and best practices This file has no impact on builds, tests, or runtime behavior — it is purely developer guidance metadata. Signed-off-by: Rustiqly <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * vpp: support binding multiple ACL tables by priority (#1732) why currently vpp doesn't support binding multiple ACL tables. Each table is appended with default permit-all rules. With multiple tables, this may cause acl matched by such rules and skip the actual rule to make in the tables after this one. what this PR does remove the default permit-all rules for each table If a table is empty, create a dummy rule that won't match any traffic because vpp doesn't allow empty table. The dummy rule matches dest-ip to 0.0.0.0/32 sort all the tables by priority in the table group. vpp doesn't support parallel matching added catch-all acl group to the end. vpp default behavior of no match is drop but sonic is accept. Fix sonic-vpp crashing due to race condition during stats pull. If the interface to get stats has been removed, stat_segment_ls_r returns null. Signed-off-by: Yue Gao <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * Update syncd/NotificationProcessor.cpp Co-authored-by: Copilot <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * Update syncd/FlowDump.cpp Co-authored-by: Copilot <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * Update syncd/FlowDump.cpp Co-authored-by: Copilot <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * Update meta/SaiSerialize.cpp Co-authored-by: Copilot <[email protected]> Signed-off-by: Vivek Reddy <[email protected]> * Handle co-pilot comments Signed-off-by: Vivek Reddy <[email protected]> --------- Signed-off-by: Vivek Reddy <[email protected]> Signed-off-by: Ryan Garofano <[email protected]> Signed-off-by: Rustiqly <[email protected]> Signed-off-by: Yue Gao <[email protected]> Co-authored-by: Ryan Garofano <[email protected]> Co-authored-by: rustiqly <[email protected]> Co-authored-by: yue-fred-gao <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Lihua Yuan <[email protected]>
Description
Fix flaky
FlexCounter.bulkChunksizeunit test that intermittently fails in CI due to timing-dependentusleep(1000*1050).Issue: #1765
Root Cause
The test uses
usleep(1000*1050)(1.05s) to wait for the FlexCounter polling thread (1s poll interval) to complete counter collection. Under CI load, the polling thread may not finish within this window, causing assertion failures like:This failure reproduces on completely unrelated PRs (#1763, #1764), confirming it is timing-dependent and not caused by code changes.
Fix
Replace all
usleep(1000*1050)calls with deterministic poll-wait helpers:waitForCounterKeys(table, expectedCount)— polls until the expected number of counter keys appear in COUNTERS_DBwaitForCounterValue(table, key, field)— polls until a specific counter field has a non-empty valueBoth helpers poll every 100ms with a 5-second timeout. This eliminates timing sensitivity while keeping tests fast on unloaded machines (typically completes in 1-2 polls).
Also replaces the similar
usleep(1000*1000)andusleep(1000*2000)inqueryCounterCapability/addRemoveCounterForPortwhich have the same timing vulnerability.Changes
unittest/syncd/TestFlexCounter.cpp:#include <chrono>forsteady_clockwaitForCounterKeys()helperwaitForCounterValue()helperusleep(1000*1050), 1xusleep(1000*1000), 1xusleep(1000*2000)acrosstestAddRemoveCounter,FlexCounter.counterIdChange,FlexCounter.addRemoveCounterForPort, andtestDashMeterAddRemoveCounterusleep(60*1000)inupdateSwitchDebugCounterIdList(50ms poll interval, not flaky)