Skip to content

[mmu probing] pr02.probe: Add core probing algorithms with essential data structures#22540

Open
XuChen-MSFT wants to merge 11 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr02-algorithms
Open

[mmu probing] pr02.probe: Add core probing algorithms with essential data structures#22540
XuChen-MSFT wants to merge 11 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr02-algorithms

Conversation

@XuChen-MSFT
Copy link
Contributor

@XuChen-MSFT XuChen-MSFT commented Feb 23, 2026

Description of PR

Summary:
Implement platform-independent probing algorithms and supporting data types:

Data Structures:

  • IterationOutcome: Enumeration for single probe iteration results (SUCCESS/DROP/XOFF)
  • ProbingResult: Data class for final threshold detection results
  • ProbingExecutorProtocol: Protocol defining algorithm-executor interface

Algorithms:

  • LowerBoundProbingAlgorithm: Binary search for lower threshold bound
  • UpperBoundProbingAlgorithm: Binary search for upper threshold bound
  • ThresholdPointProbingAlgorithm: Precise threshold point detection
  • ThresholdRangeProbingAlgorithm: Threshold range detection with tolerance

All algorithms follow the executor protocol pattern for platform abstraction, enabling both physical hardware and simulation testing.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 23, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review done. Core probing algorithms and data structures look consistent; no functional issues found.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 24, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed updates; no new issues found.

wsycqyz
wsycqyz previously approved these changes Feb 25, 2026
Copy link
Contributor

@wsycqyz wsycqyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lğtm

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #22540 (MMU Probing: Core Algorithms)

🔴 Bugs

1. Infinite loop in lower-bound algorithm when current reaches 1

current = max(current // 2, 1)  # line ~213

When current == 1 and threshold is still reached: max(1 // 2, 1) == max(0, 1) == 1. The while current >= 1 guard doesn't help because max(..., 1) clamps to 1. Loops until max_iterations hit.
Fix: Use current = current // 2 (allowing 0 to break loop) or add if current == 1: break.

2. Point algorithm continues on corrupted buffer after failure
When success == False in incremental mode, the code does continue but subsequent iterations still use drain_buffer=False. The buffer state is unknown after a failure, yet we add packets incrementally on top of it. Should abort or drain buffer on next iteration.

3. is_range property crashes on None bounds

def is_range(self) -> bool:
    return self.success and self.lower_bound < self.upper_bound

If bounds are None, raises TypeError. Add a None guard.

⚠️ Design

  • Range algorithm backtracking can oscillate — stack grows unboundedly on success, and after pop() on failure, re-probes the same midpoint. With noisy hardware, this could oscillate between two ranges for all 50 iterations.

  • Precision check fails when candidate_threshold == 0range_size <= candidate_threshold * 0.05 becomes range_size <= 0, forcing binary search all the way to range=1.

  • Step size > 1 overshoot undocumented — with step_size=2, returned "precise" point has ±step_size tolerance. Should document this.

Minor

  • Duplicate docstring parameter upper_bound in threshold_point_probing_algorithm.py
  • Upper-bound doubling can reach initial_value * 2^20 (~100B if initial is large)

@XuChen-MSFT XuChen-MSFT dismissed stale reviews from wsycqyz and yxieca via 40d0c97 March 17, 2026 15:39
@mssonicbld
Copy link
Collaborator

/azp run

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Thanks for the review. All 3 bugs fixed + 1 additional fix found during analysis:

✅ Fixed (4 commits)

1. Lower-bound infinite loop at current=1 (ee789dc)
max(current // 2, 1) clamped to 1 forever. Changed to explicit if current <= 1: break since probing at 0 packets has no physical meaning.

2. is_range None guard (7efb4de)
Added lower_bound is not None and upper_bound is not None checks before < comparison. Prevents TypeError when success=True but bounds are None (inconsistent state).

3. Point algorithm buffer drain after failure (f734ef5)
After verification failure, buffer state is unknown but next iteration kept drain_buffer=False, sending incremental packets on corrupted state. Refactored to use drain_buffer as single state variable: starts True, set False on success, reset True on failure.

4. is_point None guard (40d0c97)
Same pattern as is_rangeNone == None returns True which is semantically wrong. Added None checks for consistency.

⚠️ Acknowledged (Design Issues)

  • Range algorithm oscillation, precision check at 0, step_size docs — will address in follow-up if needed.
  • Upper-bound doubling overflow is protected by max_iterations=20 safety limit.

📝 Related PRs Updated

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Contributor Author

Added anti-oscillation backtrack nudge to range algorithm (036f27c):

Problem: When verification fails at a specific candidate value, the algorithm pops back to the parent range and produces the same midpoint → same child → same failure → infinite oscillation until max_iterations (50).

Fix: On backtrack, nudge the parent boundary opposite to its last move direction, and merge boundaries to preserve the wider search space. Stack entries now carry direction metadata (init/left/right/nudge). Nudge size = max(1, range_size // 10).

Design choice: Stack-based backtrack with boundary nudge was chosen over a failed-candidates set because it handles a wider range of failure patterns (not just fixed bad values, but also region instability, timing issues, buffer state corruption).

Related:

Also addresses @StormLiangMS's design review: range algorithm oscillation concern.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@XuChen-MSFT
Copy link
Contributor Author

Added step_size precision documentation (ff14cc6):

  • Documented that step_size > 1 trades precision for speed (±step_size tolerance) in class docstring and __init__ arg
  • Removed duplicate upper_bound parameter in run() docstring

Addresses @StormLiangMS's review: step_size > 1 overshoot undocumented.

PR #22540 review items now fully addressed:

  • ✅ 3 bugs fixed (infinite loop, is_range None, buffer drain)
  • ✅ is_point None guard (found during analysis)
  • ✅ Range oscillation anti-backtrack nudge
  • ✅ Precision check max(1,...) guard
  • ✅ step_size tolerance documented

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT and others added 9 commits March 24, 2026 12:59
Implement platform-independent probing algorithms and supporting data types:

Data Structures:
- IterationOutcome: Enumeration for single probe iteration results (SUCCESS/DROP/XOFF)
- ProbingResult: Data class for final threshold detection results
- ProbingExecutorProtocol: Protocol defining algorithm-executor interface

Algorithms:
- LowerBoundProbingAlgorithm: Binary search for lower threshold bound
- UpperBoundProbingAlgorithm: Binary search for upper threshold bound
- ThresholdPointProbingAlgorithm: Precise threshold point detection
- ThresholdRangeProbingAlgorithm: Threshold range detection with tolerance

All algorithms follow the executor protocol pattern for platform abstraction,
enabling both physical hardware and simulation testing.

Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
When current=1 and threshold is still detected, max(current // 2, 1)
always evaluates to 1, causing the loop to spin until max_iterations.
Replace with explicit break when current <= 1, since probing at 0
packets has no physical meaning.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
When success=True but bounds are None (inconsistent state), the
comparison lower_bound < upper_bound raises TypeError. Add explicit
None checks before the comparison for defensive safety.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
After a failed verification, buffer state is unknown. The next iteration
must drain and resend the full packet count instead of incrementally
adding on top of corrupted state. Simplified by using drain_buffer as
the single state variable: starts True, set False on success, reset
True on failure.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Same pattern as is_range fix: when success=True but bounds are None
(inconsistent state), None == None returns True which is semantically
wrong. Add explicit None checks for consistency.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
When verification fails, the algorithm pops back to the parent range.
Without adjustment, the same midpoint produces the same failing child,
causing infinite oscillation until max_iterations.

Fix: on backtrack, nudge the parent boundary in the opposite direction
of its last move (soften the move that produced the failing child),
and merge boundaries to preserve the wider search space explored by
the failed descendant. Stack entries now carry direction metadata
(init/left/right/nudge) to determine nudge direction.

Nudge size is proportional to range: max(1, range_size // 10).
Unified direction and step labels — removed redundant next_step variable.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
When candidate_threshold is small (e.g. 10), precision target
candidate * 0.05 = 0.5 < 1. With bad_spot at the threshold value,
range_size stays at 1 but 1 <= 0.5 is never satisfied, burning all
50 max_iterations. Use max(1, ...) to ensure precision check can
terminate when range narrows to 1 packet granularity.

Validated by UT (PR sonic-net#22545) and IT (PR sonic-net#22546) — both FAIL without
this fix (50 iterations), PASS with fix (~18 iterations).

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Add note that step_size > 1 trades precision for speed (±step_size
tolerance). Remove duplicate upper_bound parameter in run() docstring.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@XuChen-MSFT XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr02-algorithms branch from ff14cc6 to 24b3fdb Compare March 24, 2026 04:59
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src_port: Source port for traffic generation
dst_port: Destination port for threshold detection
"""
...

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.
- success: True if verification completed without errors
- detected: True if threshold was triggered at this value
"""
...

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.
Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Approve with 3 findings — Solid algorithm design, a few edge cases to address.


[Critical] Lower Bound returns None instead of minimum value when threshold always triggered
lower_bound_probing_algorithm.py:215-220

When the threshold is triggered even at current=1, the algorithm breaks out of the loop and falls through to return (None, phase_time). The comment correctly identifies "Cannot reduce below 1 — threshold is reached even at minimum" but then treats this as a failure case. The lower bound should be 1 (or 0), not None.

if current <= 1:
    # Cannot reduce below 1 — threshold is reached even at minimum
    break
# ... falls through to:
return (None, phase_time)  # Should return (1, phase_time)

Suggested fix: Before the error return, check if the loop exited because current <= 1 with detected == True, and return (1, phase_time) instead of None.


[High] Backtrack nudge can produce negative or zero range start
threshold_range_probing_algorithm.py:778

The anti-oscillation backtrack logic subtracts a nudge from merged_start without bounds checking:

if parent_dir in ('right', 'init'):
    merged_start -= nudge  # Can go negative if merged_start is small

Scenario: merged_start=1, nudge=2merged_start=-1, creating an invalid search range.

Suggested fix: merged_start = max(0, merged_start - nudge)


[Medium] Float precision in termination check
threshold_range_probing_algorithm.py:744

precision_reached = range_size <= max(1, candidate_threshold * self.precision_target_ratio)

int * float = float. For typical MMU thresholds (thousands–millions) this won't matter, but for consistency: max(1, int(candidate_threshold * self.precision_target_ratio)).

…t minimum

When threshold is triggered even at current=1, the algorithm should return
(1, phase_time) as a valid lower bound, not (None, phase_time) which
incorrectly signals failure. This allows Phase 3 to continue narrowing
the range [1, upper_bound] instead of aborting the entire probe.

Addresses @StormLiangMS review (2026-03-25): lower bound returns None
instead of minimum value when threshold always triggered.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 26, 2026
3 new/updated UTs for LowerBoundProbingAlgorithm:
- test_lower_bound_returns_1_when_threshold_always_triggered (order 8350)
- test_lower_bound_returns_1_with_various_upper_bounds (order 8351)
- Updated test_run_no_infinite_loop_at_current_one (order 8340): assert 1 not None
- Updated test_run_maximum_iterations_exceeded (order 8260): assert 1 not None
- Updated test_run_reaches_minimum_value (order 8270): assert 1 not None

1 new IT for PfcXoffProbing:
- test_pfc_xoff_lower_bound_returns_value_not_none: threshold=1, verify
  probe succeeds with lower_bound=1 instead of failing

Validates fix in PR sonic-net#22540: lower bound returns 1 instead of None when
threshold triggered at minimum packet count.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: [Critical] Lower Bound returns None instead of minimum value

Fixed (2d1592b): When threshold is triggered at current=1, the algorithm now returns (1, phase_time) instead of breaking out and falling through to return (None, phase_time). This allows Phase 3 to continue narrowing the range [1, upper_bound] instead of aborting the entire probe.

UT/IT coverage: PR #23341 — 2 new UTs + 3 updated UTs + 1 new IT (threshold=1 boundary test).

1. Backtrack nudge: max(0, merged_start - nudge) prevents negative range
   start when nudging near lower_bound=0.
2. Precision check: int(candidate * ratio) ensures integer comparison
   with integer range_size for type consistency.

Addresses @StormLiangMS review (2026-03-25): backtrack nudge negative
range start + float precision in termination check.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 26, 2026
…istency

2 new UTs for ThresholdRangeProbingAlgorithm:
- test_backtrack_nudge_bounds_check (order 8890): verifies max(0,...) prevents
  negative merged_start when nudging near lower_bound=0
- test_precision_int_consistency (order 8891): verifies int() wrapper on
  precision target for type-consistent comparison

1 updated UT:
- test_precision_check_at_small_threshold_with_bad_spot (order 8870): updated
  expectation for max(0,...) bounds check interaction with backtrack near 0

Validates fix in PR sonic-net#22540: backtrack nudge bounds + precision int.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: [High] Backtrack nudge negative range + [Medium] Float precision

Fixed (00a5df7):

  1. merged_start = max(0, merged_start - nudge) — prevents negative range start when nudging near lower_bound=0
  2. int(candidate_threshold * self.precision_target_ratio) — ensures integer comparison with integer range_size

UT coverage: PR #23341 (cd5a61b) — 2 new UTs (bounds check + int consistency) + 1 updated UT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants