Skip to content

[action] [PR:1766] [test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait#1823

Open
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/1766
Open

[action] [PR:1766] [test] Fix flaky FlexCounter.bulkChunksize by replacing usleep with poll-wait#1823
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/1766

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description

Fix flaky FlexCounter.bulkChunksize unit test that intermittently fails in CI due to timing-dependent usleep(1000*1050).

Issue: #1765

Root Cause

The test uses usleep(1000*1050) (1.05s) to wait for the FlexCounter polling thread (1s poll interval) to complete counter collection. Under CI load, the polling thread may not finish within this window, causing assertion failures like:

TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
  object_count
    Which is: 6
  unifiedBulkChunkSize
    Which is: 3

This failure reproduces on completely unrelated PRs (#1763, #1764), confirming it is timing-dependent and not caused by code changes.

Fix

Replace all usleep(1000*1050) calls with deterministic poll-wait helpers:

  • waitForCounterKeys(table, expectedCount) — polls until the expected number of counter keys appear in COUNTERS_DB
  • waitForCounterValue(table, key, field) — polls until a specific counter field has a non-empty value

Both helpers poll every 100ms with a 5-second timeout. This eliminates timing sensitivity while keeping tests fast on unloaded machines (typically completes in 1-2 polls).

Also replaces the similar usleep(1000*1000) and usleep(1000*2000) in queryCounterCapability / addRemoveCounterForPort which have the same timing vulnerability.

Changes

  • unittest/syncd/TestFlexCounter.cpp:
    • Add #include <chrono> for steady_clock
    • Add waitForCounterKeys() helper
    • Add waitForCounterValue() helper
    • Replace 8x usleep(1000*1050), 1x usleep(1000*1000), 1x usleep(1000*2000) across testAddRemoveCounter, FlexCounter.counterIdChange, FlexCounter.addRemoveCounterForPort, and testDashMeterAddRemoveCounter
    • Retain usleep(60*1000) in updateSwitchDebugCounterIdList (50ms poll interval, not flaky)

Signed-off-by: Sonic Build Admin sonicbld@microsoft.com

…oll-wait

## Description

Fix flaky `FlexCounter.bulkChunksize` unit test that intermittently fails in CI due to timing-dependent `usleep(1000*1050)`.

**Issue:** sonic-net#1765

## Root Cause

The test uses `usleep(1000*1050)` (1.05s) to wait for the FlexCounter polling thread (1s poll interval) to complete counter collection. Under CI load, the polling thread may not finish within this window, causing assertion failures like:

```
TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
  object_count
    Which is: 6
  unifiedBulkChunkSize
    Which is: 3
```

This failure reproduces on completely unrelated PRs ([sonic-net#1763](sonic-net#1763), [sonic-net#1764](sonic-net#1764)), confirming it is timing-dependent and not caused by code changes.

## Fix

Replace all `usleep(1000*1050)` calls with deterministic poll-wait helpers:

- **`waitForCounterKeys(table, expectedCount)`** — polls until the expected number of counter keys appear in COUNTERS_DB
- **`waitForCounterValue(table, key, field)`** — polls until a specific counter field has a non-empty value

Both helpers poll every 100ms with a 5-second timeout. This eliminates timing sensitivity while keeping tests fast on unloaded machines (typically completes in 1-2 polls).

Also replaces the similar `usleep(1000*1000)` and `usleep(1000*2000)` in `queryCounterCapability` / `addRemoveCounterForPort` which have the same timing vulnerability.

## Changes

- `unittest/syncd/TestFlexCounter.cpp`:
  - Add `#include <chrono>` for `steady_clock`
  - Add `waitForCounterKeys()` helper
  - Add `waitForCounterValue()` helper
  - Replace 8x `usleep(1000*1050)`, 1x `usleep(1000*1000)`, 1x `usleep(1000*2000)` across `testAddRemoveCounter`, `FlexCounter.counterIdChange`, `FlexCounter.addRemoveCounterForPort`, and `testDashMeterAddRemoveCounter`
  - Retain `usleep(60*1000)` in `updateSwitchDebugCounterIdList` (50ms poll interval, not flaky)

Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #1766

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant