[mmu probing] pr02.probe: Add core probing algorithms with essential data structures by XuChen-MSFT · Pull Request #22540 · sonic-net/sonic-mgmt

XuChen-MSFT · 2026-02-23T13:25:31Z

Description of PR

Summary:
Implement platform-independent probing algorithms and supporting data types:

Data Structures:

IterationOutcome: Enumeration for single probe iteration results (SUCCESS/DROP/XOFF)
ProbingResult: Data class for final threshold detection results
ProbingExecutorProtocol: Protocol defining algorithm-executor interface

Algorithms:

LowerBoundProbingAlgorithm: Binary search for lower threshold bound
UpperBoundProbingAlgorithm: Binary search for upper threshold bound
ThresholdPointProbingAlgorithm: Precise threshold point detection
ThresholdRangeProbingAlgorithm: Threshold range detection with tolerance

All algorithms follow the executor protocol pattern for platform abstraction, enabling both physical hardware and simulation testing.

Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

mssonicbld · 2026-02-23T13:25:39Z

/azp run

azure-pipelines · 2026-02-23T13:25:54Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

Deep review done. Core probing algorithms and data structures look consistent; no functional issues found.

mssonicbld · 2026-02-24T15:54:03Z

/azp run

azure-pipelines · 2026-02-24T15:54:19Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

Re-reviewed updates; no new issues found.

wsycqyz

lğtm

StormLiangMS

Review — PR #22540 (MMU Probing: Core Algorithms)

🔴 Bugs

1. Infinite loop in lower-bound algorithm when current reaches 1

current = max(current // 2, 1)  # line ~213

When current == 1 and threshold is still reached: max(1 // 2, 1) == max(0, 1) == 1. The while current >= 1 guard doesn't help because max(..., 1) clamps to 1. Loops until max_iterations hit.
Fix: Use current = current // 2 (allowing 0 to break loop) or add if current == 1: break.

2. Point algorithm continues on corrupted buffer after failure
When success == False in incremental mode, the code does continue but subsequent iterations still use drain_buffer=False. The buffer state is unknown after a failure, yet we add packets incrementally on top of it. Should abort or drain buffer on next iteration.

3. is_range property crashes on None bounds

def is_range(self) -> bool:
    return self.success and self.lower_bound < self.upper_bound

If bounds are None, raises TypeError. Add a None guard.

⚠️ Design

Range algorithm backtracking can oscillate — stack grows unboundedly on success, and after pop() on failure, re-probes the same midpoint. With noisy hardware, this could oscillate between two ranges for all 50 iterations.
Precision check fails when candidate_threshold == 0 — range_size <= candidate_threshold * 0.05 becomes range_size <= 0, forcing binary search all the way to range=1.
Step size > 1 overshoot undocumented — with step_size=2, returned "precise" point has ±step_size tolerance. Should document this.

Minor

Duplicate docstring parameter upper_bound in threshold_point_probing_algorithm.py
Upper-bound doubling can reach initial_value * 2^20 (~100B if initial is large)

mssonicbld · 2026-03-17T15:39:29Z

/azp run

XuChen-MSFT · 2026-03-17T15:39:46Z

@StormLiangMS Thanks for the review. All 3 bugs fixed + 1 additional fix found during analysis:

✅ Fixed (4 commits)

1. Lower-bound infinite loop at current=1 (ee789dc)
max(current // 2, 1) clamped to 1 forever. Changed to explicit if current <= 1: break since probing at 0 packets has no physical meaning.

2. is_range None guard (7efb4de)
Added lower_bound is not None and upper_bound is not None checks before < comparison. Prevents TypeError when success=True but bounds are None (inconsistent state).

3. Point algorithm buffer drain after failure (f734ef5)
After verification failure, buffer state is unknown but next iteration kept drain_buffer=False, sending incremental packets on corrupted state. Refactored to use drain_buffer as single state variable: starts True, set False on success, reset True on failure.

4. is_point None guard (40d0c97)
Same pattern as is_range — None == None returns True which is semantically wrong. Added None checks for consistency.

⚠️ Acknowledged (Design Issues)

Range algorithm oscillation, precision check at 0, step_size docs — will address in follow-up if needed.
Upper-bound doubling overflow is protected by max_iterations=20 safety limit.

📝 Related PRs Updated

PR [mmu probing] pr07.test: Add comprehensive unit tests for probe framework #22545: +10 unit tests covering all 4 fixes (UT 390→400)
PR [mmu probing] pr08.test: Add integration tests for end-to-end probing workflows #22546: +3 integration tests (boundary threshold=1/2, intermittent failure recovery) + Python 3.12 compatibility fix for scapy mock (IT 62→65)

azure-pipelines · 2026-03-17T15:39:46Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-18T08:41:38Z

/azp run

azure-pipelines · 2026-03-18T08:41:52Z

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT · 2026-03-18T08:42:36Z

Added anti-oscillation backtrack nudge to range algorithm (036f27c):

Problem: When verification fails at a specific candidate value, the algorithm pops back to the parent range and produces the same midpoint → same child → same failure → infinite oscillation until max_iterations (50).

Fix: On backtrack, nudge the parent boundary opposite to its last move direction, and merge boundaries to preserve the wider search space. Stack entries now carry direction metadata (init/left/right/nudge). Nudge size = max(1, range_size // 10).

Design choice: Stack-based backtrack with boundary nudge was chosen over a failed-candidates set because it handles a wider range of failure patterns (not just fixed bad values, but also region instability, timing issues, buffer state corruption).

Related:

BadSpot executors: PR [mmu probing] pr03.probe: Add probing executors and executor registry #22541 (dcc1c40) — deterministic failure simulation
UT coverage: PR [mmu probing] pr07.test: Add comprehensive unit tests for probe framework #22545 (d4fff81) — 7 backtrack scenario tests + 8 executor tests + registry fix
IT coverage: PR [mmu probing] pr08.test: Add integration tests for end-to-end probing workflows #22546 (4b59778) — 2 oscillation integration tests

Also addresses @StormLiangMS's design review: range algorithm oscillation concern.

azure-pipelines · 2026-03-18T14:17:59Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-18T14:43:57Z

/azp run

XuChen-MSFT · 2026-03-18T14:44:09Z

Added step_size precision documentation (ff14cc6):

Documented that step_size > 1 trades precision for speed (±step_size tolerance) in class docstring and __init__ arg
Removed duplicate upper_bound parameter in run() docstring

Addresses @StormLiangMS's review: step_size > 1 overshoot undocumented.

PR #22540 review items now fully addressed:

✅ 3 bugs fixed (infinite loop, is_range None, buffer drain)
✅ is_point None guard (found during analysis)
✅ Range oscillation anti-backtrack nudge
✅ Precision check max(1,...) guard
✅ step_size tolerance documented

azure-pipelines · 2026-03-18T14:44:14Z

Azure Pipelines successfully started running 1 pipeline(s).

Implement platform-independent probing algorithms and supporting data types: Data Structures: - IterationOutcome: Enumeration for single probe iteration results (SUCCESS/DROP/XOFF) - ProbingResult: Data class for final threshold detection results - ProbingExecutorProtocol: Protocol defining algorithm-executor interface Algorithms: - LowerBoundProbingAlgorithm: Binary search for lower threshold bound - UpperBoundProbingAlgorithm: Binary search for upper threshold bound - ThresholdPointProbingAlgorithm: Precise threshold point detection - ThresholdRangeProbingAlgorithm: Threshold range detection with tolerance All algorithms follow the executor protocol pattern for platform abstraction, enabling both physical hardware and simulation testing. Signed-off-by: Xu Chen <[email protected]>

Signed-off-by: Xu Chen <[email protected]>

When current=1 and threshold is still detected, max(current // 2, 1) always evaluates to 1, causing the loop to spin until max_iterations. Replace with explicit break when current <= 1, since probing at 0 packets has no physical meaning. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

When success=True but bounds are None (inconsistent state), the comparison lower_bound < upper_bound raises TypeError. Add explicit None checks before the comparison for defensive safety. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

After a failed verification, buffer state is unknown. The next iteration must drain and resend the full packet count instead of incrementally adding on top of corrupted state. Simplified by using drain_buffer as the single state variable: starts True, set False on success, reset True on failure. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

Same pattern as is_range fix: when success=True but bounds are None (inconsistent state), None == None returns True which is semantically wrong. Add explicit None checks for consistency. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

When verification fails, the algorithm pops back to the parent range. Without adjustment, the same midpoint produces the same failing child, causing infinite oscillation until max_iterations. Fix: on backtrack, nudge the parent boundary in the opposite direction of its last move (soften the move that produced the failing child), and merge boundaries to preserve the wider search space explored by the failed descendant. Stack entries now carry direction metadata (init/left/right/nudge) to determine nudge direction. Nudge size is proportional to range: max(1, range_size // 10). Unified direction and step labels — removed redundant next_step variable. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

When candidate_threshold is small (e.g. 10), precision target candidate * 0.05 = 0.5 < 1. With bad_spot at the threshold value, range_size stays at 1 but 1 <= 0.5 is never satisfied, burning all 50 max_iterations. Use max(1, ...) to ensure precision check can terminate when range narrows to 1 packet granularity. Validated by UT (PR sonic-net#22545) and IT (PR sonic-net#22546) — both FAIL without this fix (50 iterations), PASS with fix (~18 iterations). Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

Add note that step_size > 1 trades precision for speed (±step_size tolerance). Remove duplicate upper_bound parameter in run() docstring. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

mssonicbld · 2026-03-24T04:59:38Z

/azp run

azure-pipelines · 2026-03-24T04:59:51Z

Azure Pipelines successfully started running 1 pipeline(s).

tests/saitests/probe/probing_executor_protocol.py

+            src_port: Source port for traffic generation
+            dst_port: Destination port for threshold detection
+        """
+        ...


tests/saitests/probe/probing_executor_protocol.py

+                - success: True if verification completed without errors
+                - detected: True if threshold was triggered at this value
+        """
+        ...


StormLiangMS

⚠️ Approve with 3 findings — Solid algorithm design, a few edge cases to address.

[Critical] Lower Bound returns None instead of minimum value when threshold always triggered
lower_bound_probing_algorithm.py:215-220

When the threshold is triggered even at current=1, the algorithm breaks out of the loop and falls through to return (None, phase_time). The comment correctly identifies "Cannot reduce below 1 — threshold is reached even at minimum" but then treats this as a failure case. The lower bound should be 1 (or 0), not None.

if current <= 1:
    # Cannot reduce below 1 — threshold is reached even at minimum
    break
# ... falls through to:
return (None, phase_time)  # Should return (1, phase_time)

Suggested fix: Before the error return, check if the loop exited because current <= 1 with detected == True, and return (1, phase_time) instead of None.

[High] Backtrack nudge can produce negative or zero range start
threshold_range_probing_algorithm.py:778

The anti-oscillation backtrack logic subtracts a nudge from merged_start without bounds checking:

if parent_dir in ('right', 'init'):
    merged_start -= nudge  # Can go negative if merged_start is small

Scenario: merged_start=1, nudge=2 → merged_start=-1, creating an invalid search range.

Suggested fix: merged_start = max(0, merged_start - nudge)

[Medium] Float precision in termination check
threshold_range_probing_algorithm.py:744

precision_reached = range_size <= max(1, candidate_threshold * self.precision_target_ratio)

int * float = float. For typical MMU thresholds (thousands–millions) this won't matter, but for consistency: max(1, int(candidate_threshold * self.precision_target_ratio)).

@StormLiangMS

…t minimum When threshold is triggered even at current=1, the algorithm should return (1, phase_time) as a valid lower bound, not (None, phase_time) which incorrectly signals failure. This allows Phase 3 to continue narrowing the range [1, upper_bound] instead of aborting the entire probe. Addresses @StormLiangMS review (2026-03-25): lower bound returns None instead of minimum value when threshold always triggered. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

mssonicbld · 2026-03-26T06:34:49Z

/azp run

azure-pipelines · 2026-03-26T06:35:03Z

Azure Pipelines successfully started running 1 pipeline(s).

3 new/updated UTs for LowerBoundProbingAlgorithm: - test_lower_bound_returns_1_when_threshold_always_triggered (order 8350) - test_lower_bound_returns_1_with_various_upper_bounds (order 8351) - Updated test_run_no_infinite_loop_at_current_one (order 8340): assert 1 not None - Updated test_run_maximum_iterations_exceeded (order 8260): assert 1 not None - Updated test_run_reaches_minimum_value (order 8270): assert 1 not None 1 new IT for PfcXoffProbing: - test_pfc_xoff_lower_bound_returns_value_not_none: threshold=1, verify probe succeeds with lower_bound=1 instead of failing Validates fix in PR sonic-net#22540: lower bound returns 1 instead of None when threshold triggered at minimum packet count. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

XuChen-MSFT · 2026-03-26T06:36:23Z

@StormLiangMS Re: [Critical] Lower Bound returns None instead of minimum value

Fixed (2d1592b): When threshold is triggered at current=1, the algorithm now returns (1, phase_time) instead of breaking out and falling through to return (None, phase_time). This allows Phase 3 to continue narrowing the range [1, upper_bound] instead of aborting the entire probe.

UT/IT coverage: PR #23341 — 2 new UTs + 3 updated UTs + 1 new IT (threshold=1 boundary test).

@StormLiangMS

1. Backtrack nudge: max(0, merged_start - nudge) prevents negative range start when nudging near lower_bound=0. 2. Precision check: int(candidate * ratio) ensures integer comparison with integer range_size for type consistency. Addresses @StormLiangMS review (2026-03-25): backtrack nudge negative range start + float precision in termination check. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

mssonicbld · 2026-03-26T07:40:43Z

/azp run

azure-pipelines · 2026-03-26T07:40:56Z

Azure Pipelines successfully started running 1 pipeline(s).

…istency 2 new UTs for ThresholdRangeProbingAlgorithm: - test_backtrack_nudge_bounds_check (order 8890): verifies max(0,...) prevents negative merged_start when nudging near lower_bound=0 - test_precision_int_consistency (order 8891): verifies int() wrapper on precision target for type-consistent comparison 1 updated UT: - test_precision_check_at_small_threshold_with_bad_spot (order 8870): updated expectation for max(0,...) bounds check interaction with backtrack near 0 Validates fix in PR sonic-net#22540: backtrack nudge bounds + precision int. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

XuChen-MSFT · 2026-03-26T07:41:33Z

@StormLiangMS Re: [High] Backtrack nudge negative range + [Medium] Float precision

Fixed (00a5df7):

merged_start = max(0, merged_start - nudge) — prevents negative range start when nudging near lower_bound=0
int(candidate_threshold * self.precision_target_ratio) — ensures integer comparison with integer range_size

UT coverage: PR #23341 (cd5a61b) — 2 new UTs (bounds check + int consistency) + 1 updated UT.

XuChen-MSFT requested review from StormLiangMS, bingwang-ms, kperumalbfn, wsycqyz and yxieca February 23, 2026 13:26

yxieca previously approved these changes Feb 23, 2026

View reviewed changes

XuChen-MSFT dismissed yxieca’s stale review via 578a67d February 24, 2026 15:52

yxieca previously approved these changes Feb 24, 2026

View reviewed changes

wsycqyz previously approved these changes Feb 25, 2026

View reviewed changes

StormLiangMS reviewed Mar 17, 2026

View reviewed changes

XuChen-MSFT dismissed stale reviews from wsycqyz and yxieca via 40d0c97 March 17, 2026 15:39

XuChen-MSFT and others added 9 commits March 24, 2026 12:59

fix pre-commit errors

07c2900

Signed-off-by: Xu Chen <[email protected]>

XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr02-algorithms branch from ff14cc6 to 24b3fdb Compare March 24, 2026 04:59

github-advanced-security bot found potential problems Mar 24, 2026

View reviewed changes

StormLiangMS reviewed Mar 25, 2026

View reviewed changes

XuChen-MSFT mentioned this pull request Mar 26, 2026

[mmu probing] pr10.test: Add supplementary UTs and IT for review findings #23341

Open

12 tasks

Copilot AI mentioned this pull request Mar 27, 2026

Review all human-authored PRs opened in the past 24 hours in sonic-net/sonic-mgmt #23365

Draft

Conversation

XuChen-MSFT commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Feb 23, 2026

Uh oh!

azure-pipelines bot commented Feb 23, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Feb 24, 2026

Uh oh!

azure-pipelines bot commented Feb 24, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Uh oh!

wsycqyz left a comment

Choose a reason for hiding this comment

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Review — PR #22540 (MMU Probing: Core Algorithms)

🔴 Bugs

⚠️ Design

Minor

Uh oh!

mssonicbld commented Mar 17, 2026

Uh oh!

XuChen-MSFT commented Mar 17, 2026

✅ Fixed (4 commits)

⚠️ Acknowledged (Design Issues)

📝 Related PRs Updated

Uh oh!

azure-pipelines bot commented Mar 17, 2026

Uh oh!

mssonicbld commented Mar 18, 2026

Uh oh!

azure-pipelines bot commented Mar 18, 2026

Uh oh!

XuChen-MSFT commented Mar 18, 2026

Uh oh!

azure-pipelines bot commented Mar 18, 2026

Uh oh!

mssonicbld commented Mar 18, 2026

Uh oh!

XuChen-MSFT commented Mar 18, 2026

Uh oh!

azure-pipelines bot commented Mar 18, 2026

Uh oh!

mssonicbld commented Mar 24, 2026

Uh oh!

azure-pipelines bot commented Mar 24, 2026

Uh oh!

Check notice

Check notice

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Mar 26, 2026

Uh oh!

azure-pipelines bot commented Mar 26, 2026

Uh oh!

XuChen-MSFT commented Mar 26, 2026

Uh oh!

mssonicbld commented Mar 26, 2026

Uh oh!

azure-pipelines bot commented Mar 26, 2026

Uh oh!

XuChen-MSFT commented Feb 23, 2026 •

edited

Loading