Skip to content

[test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait#1766

Merged
rustiqly merged 1 commit intosonic-net:masterfrom
rustiqly:fix/flaky-flexcounter-bulkchunksize
Feb 12, 2026
Merged

[test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait#1766
rustiqly merged 1 commit intosonic-net:masterfrom
rustiqly:fix/flaky-flexcounter-bulkchunksize

Conversation

@rustiqly
Copy link
Copy Markdown
Contributor

Description

Fix flaky FlexCounter.bulkChunksize unit test that intermittently fails in CI due to timing-dependent usleep(1000*1050).

Issue: #1765

Root Cause

The test uses usleep(1000*1050) (1.05s) to wait for the FlexCounter polling thread (1s poll interval) to complete counter collection. Under CI load, the polling thread may not finish within this window, causing assertion failures like:

TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
  object_count
    Which is: 6
  unifiedBulkChunkSize
    Which is: 3

This failure reproduces on completely unrelated PRs (#1763, #1764), confirming it is timing-dependent and not caused by code changes.

Fix

Replace all usleep(1000*1050) calls with deterministic poll-wait helpers:

  • waitForCounterKeys(table, expectedCount) — polls until the expected number of counter keys appear in COUNTERS_DB
  • waitForCounterValue(table, key, field) — polls until a specific counter field has a non-empty value

Both helpers poll every 100ms with a 5-second timeout. This eliminates timing sensitivity while keeping tests fast on unloaded machines (typically completes in 1-2 polls).

Also replaces the similar usleep(1000*1000) and usleep(1000*2000) in queryCounterCapability / addRemoveCounterForPort which have the same timing vulnerability.

Changes

  • unittest/syncd/TestFlexCounter.cpp:
    • Add #include <chrono> for steady_clock
    • Add waitForCounterKeys() helper
    • Add waitForCounterValue() helper
    • Replace 8x usleep(1000*1050), 1x usleep(1000*1000), 1x usleep(1000*2000) across testAddRemoveCounter, FlexCounter.counterIdChange, FlexCounter.addRemoveCounterForPort, and testDashMeterAddRemoveCounter
    • Retain usleep(60*1000) in updateSwitchDebugCounterIdList (50ms poll interval, not flaky)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from 9de2821 to 5179023 Compare February 11, 2026 18:35
@rustiqly
Copy link
Copy Markdown
Contributor Author

@lguohan Fixed — the test failure was aspellcheck.pl (spellcheck), not a unit test failure. All 56 unit tests passed including our updated FlexCounter tests. Added 'usleep' to tests/aspell.en.pws. Force-pushed.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces CI flakiness in FlexCounter unit tests by replacing fixed-duration sleeps with deterministic poll-wait helpers that wait for COUNTERS_DB state to be populated.

Changes:

  • Added waitForCounterKeys() and waitForCounterValue() helper functions for polling COUNTERS_DB with a timeout.
  • Replaced multiple usleep(...) calls across FlexCounter-related tests with these poll-wait helpers.
  • Updated the aspell personal word list to include usleep (used in code/comments).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
unittest/syncd/TestFlexCounter.cpp Adds DB poll-wait helpers and replaces hardcoded sleeps to make FlexCounter tests timing-independent.
tests/aspell.en.pws Adds usleep to the dictionary to avoid spellcheck noise.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from 5179023 to 0fcba53 Compare February 11, 2026 19:24
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@rustiqly
Copy link
Copy Markdown
Contributor Author

Addressed both Copilot review comments:

  1. Silent timeout → ADD_FAILURE(): Both waitForCounterKeys() and waitForCounterValue() now call ADD_FAILURE() with diagnostics (expected vs actual keys, key+field) when the timeout expires, instead of silently returning.

  2. removeTimeStamp DB mutation in poll loop: Extracted countNonTimestampKeys() helper that filters TIME_STAMP from the in-memory vector without deleting it from the DB. No more write-per-iteration race with the producer thread.

Force-pushed with both fixes.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from 0fcba53 to baf5355 Compare February 11, 2026 20:16
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly
Copy link
Copy Markdown
Contributor Author

Updated and ready for review. Summary of changes since last push:

  • waitForCounterValue now takes an explicit expectedValue parameter (waits for exact match)
  • Added waitForNonZeroCounterValue helper for tests with bulk init phases that write zeros before real data
  • Fixed counterIdChange test: when switching between bulk/non-bulk modes, stale values persisted in Redis. Now waits for the specific expected value (e.g., IF_IN_UCAST_PKTS changing from "200" to "20") instead of just any non-empty value
  • Fixed addRemoveCounter: uses expectedValues[0] when available, falls back to waitForNonZeroCounterValue for tests with custom verify functions (like bulkChunksize)
  • Added usleep to aspell.en.pws dictionary

Local test results: 91/92 pass — the only failure is AttrVersionChecker.isSufficientVersion which is a pre-existing upstream issue.
All 12 FlexCounter tests pass (including bulkChunksize + counterIdChange).
aspellcheck + swsslogentercheck both pass.


usleep(1000*2000);
waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100");
countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems unnecessary give 1055 has check the expected value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — removed the redundant hget + EXPECT_EQ since waitForCounterValue already validates the exact value.

waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100");
countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", value);
EXPECT_EQ(value, "100");
countersTable.hget(expectedKey, "SAI_PORT_STAT_IF_IN_ERRORS", value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not follow same pattern as 1055 to do wait.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — now uses waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_ERRORS", "200") matching the same pattern.


usleep(1000*1050);
waitForCounterValue(countersTable, expectedKey, "SAI_PORT_STAT_IF_IN_OCTETS", "100");
counterVerifyFunc(countersTable,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still need this 1855?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the value stays "100" from the previous step, so the wait returns immediately on the stale value anyway. Removed.

@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from baf5355 to 51b6fec Compare February 11, 2026 20:58
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from 51b6fec to 3cf9be1 Compare February 11, 2026 21:57
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…oll-wait

Replace hardcoded usleep(1000*1050) calls in FlexCounter unit tests with
deterministic poll-wait helpers that check for expected counter keys or
values in the DB. The fixed-sleep approach is inherently racy under CI
load because the FlexCounter polling thread may not complete within the
1.05s window.

Two helpers added:
- waitForCounterKeys: polls until expected number of counter keys appear
- waitForCounterValue: polls until a specific counter field is populated

Both poll every 100ms with a 5-second timeout, eliminating timing
sensitivity while keeping tests fast on unloaded machines.

Fixes: sonic-net#1765

Signed-off-by: Rustiqly <[email protected]>
@rustiqly rustiqly force-pushed the fix/flaky-flexcounter-bulkchunksize branch from 3cf9be1 to b153d7a Compare February 12, 2026 01:53
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly rustiqly merged commit d482f26 into sonic-net:master Feb 12, 2026
16 checks passed
vivekrnv pushed a commit to vivekrnv/sonic-sairedis that referenced this pull request Feb 18, 2026
rustiqly added a commit to rustiqly/sonic-sairedis that referenced this pull request Mar 5, 2026
lihuay added a commit that referenced this pull request Mar 5, 2026
* [DPU] Add support for Flow bulk session get notifications

Signed-off-by: Vivek Reddy <[email protected]>

* Remove obvious comments

Signed-off-by: Vivek Reddy <[email protected]>

* Add processMetadata method to validate object id

Signed-off-by: Vivek Reddy <[email protected]>

* Handled comments

Signed-off-by: Vivek Reddy <[email protected]>

* Minor fixes

Signed-off-by: Vivek Reddy <[email protected]>

* Use get_stats_ext instead of get_stats for switch counters (#1757)

Signed-off-by: Ryan Garofano <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* [test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait (#1766)

Signed-off-by: Vivek Reddy <[email protected]>

* Add .github/copilot-instructions.md for AI-assisted development (#1764)

Add .github/copilot-instructions.md to provide AI-assisted development guidance for contributors using GitHub Copilot, Copilot Chat, and other AI coding tools.

This file helps AI tools understand the repo's architecture, coding conventions, and contribution workflow, leading to more accurate suggestions and fewer review cycles.

What's included:

Repository architecture and component overview
Coding standards and naming conventions
Testing requirements and patterns
Build system integration notes
Common pitfalls and best practices
This file has no impact on builds, tests, or runtime behavior — it is purely developer guidance metadata.

Signed-off-by: Rustiqly <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* vpp: support binding multiple ACL tables by priority (#1732)

why
currently vpp doesn't support binding multiple ACL tables. Each table is appended with default permit-all rules. With multiple tables, this may cause acl matched by such rules and skip the actual rule to make in the tables after this one.

what this PR does
remove the default permit-all rules for each table
If a table is empty, create a dummy rule that won't match any traffic because vpp doesn't allow empty table. The dummy rule matches dest-ip to 0.0.0.0/32
sort all the tables by priority in the table group. vpp doesn't support parallel matching
added catch-all acl group to the end. vpp default behavior of no match is drop but sonic is accept.
Fix sonic-vpp crashing due to race condition during stats pull. If the interface to get stats has been removed, stat_segment_ls_r returns null.

Signed-off-by: Yue Gao <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* Update syncd/NotificationProcessor.cpp

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* Update syncd/FlowDump.cpp

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* Update syncd/FlowDump.cpp

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* Update meta/SaiSerialize.cpp

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vivek Reddy <[email protected]>

* Handle co-pilot comments

Signed-off-by: Vivek Reddy <[email protected]>

---------

Signed-off-by: Vivek Reddy <[email protected]>
Signed-off-by: Ryan Garofano <[email protected]>
Signed-off-by: Rustiqly <[email protected]>
Signed-off-by: Yue Gao <[email protected]>
Co-authored-by: Ryan Garofano <[email protected]>
Co-authored-by: rustiqly <[email protected]>
Co-authored-by: yue-fred-gao <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Lihua Yuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants