Skip to content

BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms#1788

Merged
StormLiangMS merged 4 commits intosonic-net:masterfrom
lipxu:fix/brcm-legacy-compat-master-issue1
Mar 16, 2026
Merged

BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms#1788
StormLiangMS merged 4 commits intosonic-net:masterfrom
lipxu:fix/brcm-legacy-compat-master-issue1

Conversation

@lipxu
Copy link
Copy Markdown
Contributor

@lipxu lipxu commented Mar 11, 2026

Problem

On Arista 7060cx (BCM56960_B1 / Tomahawk-1, broadcom-legacy platform), syncd crashes at startup with a SIGSEGV inside brcm_sai_st_pd_ctr_cap_list_get+0x10.

Root cause: The SONiC build always uses XGS SAI headers for all Broadcom syncd builds (buildimage issue #23387). After commit 4f1d7d99 restored sai_query_stats_st_capability to AC_CHECK_FUNCS, the symbol is detected against XGS SAI 13.2.1+ → HAVE_SAI_QUERY_STATS_ST_CAPABILITY=1 is defined → syncd calls the function at runtime. On TH1, the streaming telemetry platform driver (p_pdapi_st) is uninitialized → NULL vtable dereference at offset +0x10 → SIGSEGV.

Crash confirmed via GDB core dump: crash address 0x8ab6120 = brcm_sai_st_pd_ctr_cap_list_get+0x10 in libsai.so.

Fix

Add a runtime guard in VendorSai::apiInitialize() that reads SAI_STATS_ST_CAPABILITY_SUPPORTED from sai.profile. If set to 0, the query_stats_st_capability function pointer is nulled before it can be called. This replaces the previous blunt compile-time AH_TEMPLATE workaround (commit 4a96de71) with a proper per-platform runtime opt-out, and also restores sai_query_stats_st_capability to AC_CHECK_FUNCS in configure.ac.

XGS platforms (TH2/TH3/TH4/TH5) are unaffected — they do not set this key so the API remains enabled.

All changes are tagged with the comment marker BROADCOM_LEGACY_SAI_COMPAT for future searchability.

Changes

  • configure.ac: Restore sai_query_stats_st_capability to AC_CHECK_FUNCS; remove AH_TEMPLATE workaround; add BROADCOM_LEGACY_SAI_COMPAT comment
  • syncd/VendorSai.cpp: In apiInitialize(), read SAI_STATS_ST_CAPABILITY_SUPPORTED from sai.profile and null m_globalApis.query_stats_st_capability if set to 0

Testing

  • Arista 7060cx (BCM56960_B1, broadcom-legacy): SAI_STATS_ST_CAPABILITY_SUPPORTED=0 in sai.profile → syncd starts without crash ✅
  • Crash on images without this fix confirmed via GDB analysis of /var/core/syncd.*.core.gz

Related

  • Fixes regression introduced by commit 4f1d7d99
  • Companion sai.profile key change: sonic-net/sonic-buildimage (TBD)
  • See also: BROADCOM_LEGACY_SAI_COMPAT Issue 2 — sai_get_stats_ext for switch objects

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Liping Xu <[email protected]>
Co-authored-by: Copilot <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…file key

BROADCOM_LEGACY_SAI_COMPAT: SAI_STATS_ST_CAPABILITY_SUPPORTED=0 in sai.profile
should be processed during apiInitialize without breaking initialization.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Liping Xu <[email protected]>
@lipxu lipxu force-pushed the fix/brcm-legacy-compat-master-issue1 branch from 697b92e to 1e1a92f Compare March 11, 2026 22:06
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@Gfrom2016 Gfrom2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both PRs look good. Clean runtime guards via sai.profile keys -- no impact on XGS platforms that don't set these keys.

PR #1788: Minimal and correct. The #ifdef guard + null checks are proper. Unit test covers the profile key path.

PR #1789: Good design -- virtual method in SaiInterface.h keeps the change isolated to VendorSai. FlexCounter change is surgical (only COUNTER_TYPE_SWITCH context affected). 3 unit tests covering default, disabled, and st_capability scenarios.

Confirmed merge order: #1788 first, then rebase #1789.

LGTM on both -- approving.

Copy link
Copy Markdown
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅ — Clean runtime guard for the sai_query_stats_st_capability crash on TH1.

Positives:

  • Properly #ifdef HAVE_SAI_QUERY_STATS_ST_CAPABILITY guarded — no impact on platforms without the symbol
  • Null-checks both m_globalApis.query_stats_st_capability and profile_get_value before dereferencing
  • SWSS_LOG_NOTICE for observability when the guard activates
  • Unit test covers the profile key path
  • BROADCOM_LEGACY_SAI_COMPAT comment tag for future searchability is a nice touch

Minor item:

  • The PR description mentions restoring sai_query_stats_st_capability to AC_CHECK_FUNCS in configure.ac and removing the AH_TEMPLATE workaround, but this change doesn't appear in the diff. Was it dropped during a rebase, or is it in a separate commit? Please verify.

Overall: well-structured fix with proper guards, good comments, and test coverage.

@StormLiangMS StormLiangMS merged commit 7847032 into sonic-net:master Mar 16, 2026
19 checks passed
StormLiangMS pushed a commit that referenced this pull request Mar 16, 2026
… (BCM56960) legacy platforms (#1789)

Problem
On Arista 7060cx (BCM56960_B1 / Tomahawk-1, broadcom-legacy platform), syncd crashes during FlexCounter polling with a SIGSEGV when collecting switch counters.

Root cause: PR #1775 added context->use_sai_stats_ext = true for the COUNTER_TYPE_SWITCH FlexCounter context, forcing sai_get_stats_ext to be used instead of sai_get_stats for switch objects. While this is required for TH5, on TH1 (broadcom-legacy) sai_get_stats_ext for switch objects hits uninitialized internal state in the legacy SAI binary -> SIGSEGV.

Crash confirmed at the same address 0x8ab6120 in libsai.so via GDB analysis.

Fix
Add a runtime guard controlled by sai.profile key SAI_STATS_EXT_SWITCH_SUPPORTED. If set to 0, FlexCounter::createCounterContext() uses sai_get_stats instead of sai_get_stats_ext for switch objects.

Implementation:

meta/SaiInterface.h: Add virtual bool isSwitchStatsExtSupported() const { return true; } (default enabled for all platforms)
syncd/VendorSai.h: Declare override + bool m_switchStatsExtSupported private member
syncd/VendorSai.cpp: In apiInitialize(), read SAI_STATS_EXT_SWITCH_SUPPORTED from sai.profile; implement accessor
syncd/FlexCounter.cpp: Use m_vendorSai->isSwitchStatsExtSupported() for COUNTER_TYPE_SWITCH context
XGS platforms (TH2/TH3/TH4/TH5) are unaffected - they do not set this key so sai_get_stats_ext remains enabled.

All changes are tagged BROADCOM_LEGACY_SAI_COMPAT for future searchability.

Changes
meta/SaiInterface.h: +7 lines - virtual isSwitchStatsExtSupported() with default true
syncd/VendorSai.h: +4 lines - override declaration + private member
syncd/VendorSai.cpp: +24 lines - sai.profile key read in apiInitialize() + method implementation
syncd/FlexCounter.cpp: -1/+3 lines - conditional use_sai_stats_ext for COUNTER_TYPE_SWITCH
Testing
Arista 7060cx (BCM56960_B1, broadcom-legacy): SAI_STATS_EXT_SWITCH_SUPPORTED=0 in sai.profile -> syncd starts and FlexCounter runs without crash
TH5 (XGS): No sai.profile key set -> sai_get_stats_ext still used for switch counters
Related
Fixes regression introduced by PR [action] [PR:1757] Fix switch stat counters by using get_stats_ext instead of get_stats #1775
Companion to BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms #1788 - sai_query_stats_st_capability fix (BROADCOM_LEGACY_SAI_COMPAT Issue 1); merge BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms #1788 first, then rebase this PR
Companion sai.profile key change: BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on TH1 legacy image sonic-buildimage#26014 (SAI_STATS_EXT_SWITCH_SUPPORTED=0)

Signed-off-by: Liping Xu <[email protected]>
Co-authored-by: Copilot <[email protected]>
@lipxu
Copy link
Copy Markdown
Contributor Author

lipxu commented Mar 17, 2026

Conflict 202511, picked manually #1795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants