[mmu probing] pr02.probe: Add core probing algorithms with essential data structures#22540
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures#22540XuChen-MSFT wants to merge 11 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
Deep review done. Core probing algorithms and data structures look consistent; no functional issues found.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
Re-reviewed updates; no new issues found.
StormLiangMS
left a comment
There was a problem hiding this comment.
Review — PR #22540 (MMU Probing: Core Algorithms)
🔴 Bugs
1. Infinite loop in lower-bound algorithm when current reaches 1
current = max(current // 2, 1) # line ~213When current == 1 and threshold is still reached: max(1 // 2, 1) == max(0, 1) == 1. The while current >= 1 guard doesn't help because max(..., 1) clamps to 1. Loops until max_iterations hit.
Fix: Use current = current // 2 (allowing 0 to break loop) or add if current == 1: break.
2. Point algorithm continues on corrupted buffer after failure
When success == False in incremental mode, the code does continue but subsequent iterations still use drain_buffer=False. The buffer state is unknown after a failure, yet we add packets incrementally on top of it. Should abort or drain buffer on next iteration.
3. is_range property crashes on None bounds
def is_range(self) -> bool:
return self.success and self.lower_bound < self.upper_boundIf bounds are None, raises TypeError. Add a None guard.
⚠️ Design
-
Range algorithm backtracking can oscillate — stack grows unboundedly on success, and after
pop()on failure, re-probes the same midpoint. With noisy hardware, this could oscillate between two ranges for all 50 iterations. -
Precision check fails when
candidate_threshold == 0—range_size <= candidate_threshold * 0.05becomesrange_size <= 0, forcing binary search all the way to range=1. -
Step size > 1 overshoot undocumented — with
step_size=2, returned "precise" point has ±step_size tolerance. Should document this.
Minor
- Duplicate docstring parameter
upper_boundinthreshold_point_probing_algorithm.py - Upper-bound doubling can reach
initial_value * 2^20(~100B if initial is large)
|
/azp run |
|
@StormLiangMS Thanks for the review. All 3 bugs fixed + 1 additional fix found during analysis: ✅ Fixed (4 commits)1. Lower-bound infinite loop at current=1 ( 2. 3. Point algorithm buffer drain after failure ( 4.
|
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Added anti-oscillation backtrack nudge to range algorithm ( Problem: When verification fails at a specific candidate value, the algorithm pops back to the parent range and produces the same midpoint → same child → same failure → infinite oscillation until max_iterations (50). Fix: On backtrack, nudge the parent boundary opposite to its last move direction, and merge boundaries to preserve the wider search space. Stack entries now carry direction metadata (init/left/right/nudge). Nudge size = max(1, range_size // 10). Design choice: Stack-based backtrack with boundary nudge was chosen over a failed-candidates set because it handles a wider range of failure patterns (not just fixed bad values, but also region instability, timing issues, buffer state corruption). Related:
Also addresses @StormLiangMS's design review: range algorithm oscillation concern. |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Added step_size precision documentation (
Addresses @StormLiangMS's review: step_size > 1 overshoot undocumented. PR #22540 review items now fully addressed:
|
|
Azure Pipelines successfully started running 1 pipeline(s). |
Implement platform-independent probing algorithms and supporting data types: Data Structures: - IterationOutcome: Enumeration for single probe iteration results (SUCCESS/DROP/XOFF) - ProbingResult: Data class for final threshold detection results - ProbingExecutorProtocol: Protocol defining algorithm-executor interface Algorithms: - LowerBoundProbingAlgorithm: Binary search for lower threshold bound - UpperBoundProbingAlgorithm: Binary search for upper threshold bound - ThresholdPointProbingAlgorithm: Precise threshold point detection - ThresholdRangeProbingAlgorithm: Threshold range detection with tolerance All algorithms follow the executor protocol pattern for platform abstraction, enabling both physical hardware and simulation testing. Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
When current=1 and threshold is still detected, max(current // 2, 1) always evaluates to 1, causing the loop to spin until max_iterations. Replace with explicit break when current <= 1, since probing at 0 packets has no physical meaning. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
When success=True but bounds are None (inconsistent state), the comparison lower_bound < upper_bound raises TypeError. Add explicit None checks before the comparison for defensive safety. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
After a failed verification, buffer state is unknown. The next iteration must drain and resend the full packet count instead of incrementally adding on top of corrupted state. Simplified by using drain_buffer as the single state variable: starts True, set False on success, reset True on failure. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
Same pattern as is_range fix: when success=True but bounds are None (inconsistent state), None == None returns True which is semantically wrong. Add explicit None checks for consistency. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
When verification fails, the algorithm pops back to the parent range. Without adjustment, the same midpoint produces the same failing child, causing infinite oscillation until max_iterations. Fix: on backtrack, nudge the parent boundary in the opposite direction of its last move (soften the move that produced the failing child), and merge boundaries to preserve the wider search space explored by the failed descendant. Stack entries now carry direction metadata (init/left/right/nudge) to determine nudge direction. Nudge size is proportional to range: max(1, range_size // 10). Unified direction and step labels — removed redundant next_step variable. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
When candidate_threshold is small (e.g. 10), precision target candidate * 0.05 = 0.5 < 1. With bad_spot at the threshold value, range_size stays at 1 but 1 <= 0.5 is never satisfied, burning all 50 max_iterations. Use max(1, ...) to ensure precision check can terminate when range narrows to 1 packet granularity. Validated by UT (PR sonic-net#22545) and IT (PR sonic-net#22546) — both FAIL without this fix (50 iterations), PASS with fix (~18 iterations). Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
Add note that step_size > 1 trades precision for speed (±step_size tolerance). Remove duplicate upper_bound parameter in run() docstring. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
ff14cc6 to
24b3fdb
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| src_port: Source port for traffic generation | ||
| dst_port: Destination port for threshold detection | ||
| """ | ||
| ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note test
| - success: True if verification completed without errors | ||
| - detected: True if threshold was triggered at this value | ||
| """ | ||
| ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note test
StormLiangMS
left a comment
There was a problem hiding this comment.
[Critical] Lower Bound returns None instead of minimum value when threshold always triggered
lower_bound_probing_algorithm.py:215-220
When the threshold is triggered even at current=1, the algorithm breaks out of the loop and falls through to return (None, phase_time). The comment correctly identifies "Cannot reduce below 1 — threshold is reached even at minimum" but then treats this as a failure case. The lower bound should be 1 (or 0), not None.
if current <= 1:
# Cannot reduce below 1 — threshold is reached even at minimum
break
# ... falls through to:
return (None, phase_time) # Should return (1, phase_time)Suggested fix: Before the error return, check if the loop exited because current <= 1 with detected == True, and return (1, phase_time) instead of None.
[High] Backtrack nudge can produce negative or zero range start
threshold_range_probing_algorithm.py:778
The anti-oscillation backtrack logic subtracts a nudge from merged_start without bounds checking:
if parent_dir in ('right', 'init'):
merged_start -= nudge # Can go negative if merged_start is smallScenario: merged_start=1, nudge=2 → merged_start=-1, creating an invalid search range.
Suggested fix: merged_start = max(0, merged_start - nudge)
[Medium] Float precision in termination check
threshold_range_probing_algorithm.py:744
precision_reached = range_size <= max(1, candidate_threshold * self.precision_target_ratio)int * float = float. For typical MMU thresholds (thousands–millions) this won't matter, but for consistency: max(1, int(candidate_threshold * self.precision_target_ratio)).
…t minimum When threshold is triggered even at current=1, the algorithm should return (1, phase_time) as a valid lower bound, not (None, phase_time) which incorrectly signals failure. This allows Phase 3 to continue narrowing the range [1, upper_bound] instead of aborting the entire probe. Addresses @StormLiangMS review (2026-03-25): lower bound returns None instead of minimum value when threshold always triggered. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
3 new/updated UTs for LowerBoundProbingAlgorithm: - test_lower_bound_returns_1_when_threshold_always_triggered (order 8350) - test_lower_bound_returns_1_with_various_upper_bounds (order 8351) - Updated test_run_no_infinite_loop_at_current_one (order 8340): assert 1 not None - Updated test_run_maximum_iterations_exceeded (order 8260): assert 1 not None - Updated test_run_reaches_minimum_value (order 8270): assert 1 not None 1 new IT for PfcXoffProbing: - test_pfc_xoff_lower_bound_returns_value_not_none: threshold=1, verify probe succeeds with lower_bound=1 instead of failing Validates fix in PR sonic-net#22540: lower bound returns 1 instead of None when threshold triggered at minimum packet count. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
@StormLiangMS Re: [Critical] Lower Bound returns None instead of minimum value Fixed ( UT/IT coverage: PR #23341 — 2 new UTs + 3 updated UTs + 1 new IT (threshold=1 boundary test). |
1. Backtrack nudge: max(0, merged_start - nudge) prevents negative range start when nudging near lower_bound=0. 2. Precision check: int(candidate * ratio) ensures integer comparison with integer range_size for type consistency. Addresses @StormLiangMS review (2026-03-25): backtrack nudge negative range start + float precision in termination check. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…istency 2 new UTs for ThresholdRangeProbingAlgorithm: - test_backtrack_nudge_bounds_check (order 8890): verifies max(0,...) prevents negative merged_start when nudging near lower_bound=0 - test_precision_int_consistency (order 8891): verifies int() wrapper on precision target for type-consistent comparison 1 updated UT: - test_precision_check_at_small_threshold_with_bad_spot (order 8870): updated expectation for max(0,...) bounds check interaction with backtrack near 0 Validates fix in PR sonic-net#22540: backtrack nudge bounds + precision int. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
@StormLiangMS Re: [High] Backtrack nudge negative range + [Medium] Float precision Fixed (
UT coverage: PR #23341 ( |
Description of PR
Summary:
Implement platform-independent probing algorithms and supporting data types:
Data Structures:
Algorithms:
All algorithms follow the executor protocol pattern for platform abstraction, enabling both physical hardware and simulation testing.
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
qos refactoring
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates