BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms#1788
Conversation
…_st_capability at runtime Signed-off-by: Liping Xu <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Liping Xu <[email protected]> Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…file key BROADCOM_LEGACY_SAI_COMPAT: SAI_STATS_ST_CAPABILITY_SUPPORTED=0 in sai.profile should be processed during apiInitialize without breaking initialization. Co-authored-by: Copilot <[email protected]> Signed-off-by: Liping Xu <[email protected]>
697b92e to
1e1a92f
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]> Signed-off-by: Liping Xu <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Gfrom2016
left a comment
There was a problem hiding this comment.
Both PRs look good. Clean runtime guards via sai.profile keys -- no impact on XGS platforms that don't set these keys.
PR #1788: Minimal and correct. The #ifdef guard + null checks are proper. Unit test covers the profile key path.
PR #1789: Good design -- virtual method in SaiInterface.h keeps the change isolated to VendorSai. FlexCounter change is surgical (only COUNTER_TYPE_SWITCH context affected). 3 unit tests covering default, disabled, and st_capability scenarios.
Confirmed merge order: #1788 first, then rebase #1789.
LGTM on both -- approving.
StormLiangMS
left a comment
There was a problem hiding this comment.
LGTM ✅ — Clean runtime guard for the sai_query_stats_st_capability crash on TH1.
Positives:
- Properly
#ifdef HAVE_SAI_QUERY_STATS_ST_CAPABILITYguarded — no impact on platforms without the symbol - Null-checks both
m_globalApis.query_stats_st_capabilityandprofile_get_valuebefore dereferencing SWSS_LOG_NOTICEfor observability when the guard activates- Unit test covers the profile key path
BROADCOM_LEGACY_SAI_COMPATcomment tag for future searchability is a nice touch
Minor item:
- The PR description mentions restoring
sai_query_stats_st_capabilitytoAC_CHECK_FUNCSinconfigure.acand removing theAH_TEMPLATEworkaround, but this change doesn't appear in the diff. Was it dropped during a rebase, or is it in a separate commit? Please verify.
Overall: well-structured fix with proper guards, good comments, and test coverage.
… (BCM56960) legacy platforms (#1789) Problem On Arista 7060cx (BCM56960_B1 / Tomahawk-1, broadcom-legacy platform), syncd crashes during FlexCounter polling with a SIGSEGV when collecting switch counters. Root cause: PR #1775 added context->use_sai_stats_ext = true for the COUNTER_TYPE_SWITCH FlexCounter context, forcing sai_get_stats_ext to be used instead of sai_get_stats for switch objects. While this is required for TH5, on TH1 (broadcom-legacy) sai_get_stats_ext for switch objects hits uninitialized internal state in the legacy SAI binary -> SIGSEGV. Crash confirmed at the same address 0x8ab6120 in libsai.so via GDB analysis. Fix Add a runtime guard controlled by sai.profile key SAI_STATS_EXT_SWITCH_SUPPORTED. If set to 0, FlexCounter::createCounterContext() uses sai_get_stats instead of sai_get_stats_ext for switch objects. Implementation: meta/SaiInterface.h: Add virtual bool isSwitchStatsExtSupported() const { return true; } (default enabled for all platforms) syncd/VendorSai.h: Declare override + bool m_switchStatsExtSupported private member syncd/VendorSai.cpp: In apiInitialize(), read SAI_STATS_EXT_SWITCH_SUPPORTED from sai.profile; implement accessor syncd/FlexCounter.cpp: Use m_vendorSai->isSwitchStatsExtSupported() for COUNTER_TYPE_SWITCH context XGS platforms (TH2/TH3/TH4/TH5) are unaffected - they do not set this key so sai_get_stats_ext remains enabled. All changes are tagged BROADCOM_LEGACY_SAI_COMPAT for future searchability. Changes meta/SaiInterface.h: +7 lines - virtual isSwitchStatsExtSupported() with default true syncd/VendorSai.h: +4 lines - override declaration + private member syncd/VendorSai.cpp: +24 lines - sai.profile key read in apiInitialize() + method implementation syncd/FlexCounter.cpp: -1/+3 lines - conditional use_sai_stats_ext for COUNTER_TYPE_SWITCH Testing Arista 7060cx (BCM56960_B1, broadcom-legacy): SAI_STATS_EXT_SWITCH_SUPPORTED=0 in sai.profile -> syncd starts and FlexCounter runs without crash TH5 (XGS): No sai.profile key set -> sai_get_stats_ext still used for switch counters Related Fixes regression introduced by PR [action] [PR:1757] Fix switch stat counters by using get_stats_ext instead of get_stats #1775 Companion to BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms #1788 - sai_query_stats_st_capability fix (BROADCOM_LEGACY_SAI_COMPAT Issue 1); merge BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms #1788 first, then rebase this PR Companion sai.profile key change: BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on TH1 legacy image sonic-buildimage#26014 (SAI_STATS_EXT_SWITCH_SUPPORTED=0) Signed-off-by: Liping Xu <[email protected]> Co-authored-by: Copilot <[email protected]>
|
Conflict 202511, picked manually #1795 |
Problem
On Arista 7060cx (BCM56960_B1 / Tomahawk-1,
broadcom-legacyplatform), syncd crashes at startup with a SIGSEGV insidebrcm_sai_st_pd_ctr_cap_list_get+0x10.Root cause: The SONiC build always uses XGS SAI headers for all Broadcom syncd builds (buildimage issue #23387). After commit
4f1d7d99restoredsai_query_stats_st_capabilitytoAC_CHECK_FUNCS, the symbol is detected against XGS SAI 13.2.1+ →HAVE_SAI_QUERY_STATS_ST_CAPABILITY=1is defined → syncd calls the function at runtime. On TH1, the streaming telemetry platform driver (p_pdapi_st) is uninitialized → NULL vtable dereference at offset +0x10 → SIGSEGV.Crash confirmed via GDB core dump: crash address
0x8ab6120=brcm_sai_st_pd_ctr_cap_list_get+0x10in libsai.so.Fix
Add a runtime guard in
VendorSai::apiInitialize()that readsSAI_STATS_ST_CAPABILITY_SUPPORTEDfromsai.profile. If set to0, thequery_stats_st_capabilityfunction pointer is nulled before it can be called. This replaces the previous blunt compile-timeAH_TEMPLATEworkaround (commit4a96de71) with a proper per-platform runtime opt-out, and also restoressai_query_stats_st_capabilitytoAC_CHECK_FUNCSinconfigure.ac.XGS platforms (TH2/TH3/TH4/TH5) are unaffected — they do not set this key so the API remains enabled.
All changes are tagged with the comment marker
BROADCOM_LEGACY_SAI_COMPATfor future searchability.Changes
configure.ac: Restoresai_query_stats_st_capabilitytoAC_CHECK_FUNCS; removeAH_TEMPLATEworkaround; addBROADCOM_LEGACY_SAI_COMPATcommentsyncd/VendorSai.cpp: InapiInitialize(), readSAI_STATS_ST_CAPABILITY_SUPPORTEDfromsai.profileand nullm_globalApis.query_stats_st_capabilityif set to0Testing
SAI_STATS_ST_CAPABILITY_SUPPORTED=0insai.profile→ syncd starts without crash ✅/var/core/syncd.*.core.gzRelated
4f1d7d99sai.profilekey change: sonic-net/sonic-buildimage (TBD)BROADCOM_LEGACY_SAI_COMPATIssue 2 —sai_get_stats_extfor switch objects