Skip to content

hftelorch: minor improvements and cleanups for high frequency telemetry#4315

Open
Pterosaur wants to merge 10 commits intosonic-net:masterfrom
Pterosaur:fix/hft-orchagent-bugs
Open

hftelorch: minor improvements and cleanups for high frequency telemetry#4315
Pterosaur wants to merge 10 commits intosonic-net:masterfrom
Pterosaur:fix/hft-orchagent-bugs

Conversation

@Pterosaur
Copy link
Contributor

@Pterosaur Pterosaur commented Mar 9, 2026

What I did
A collection of minor improvements and cleanups for the high frequency telemetry orchagent code:

  1. Add handleSaiCreateStatus() calls in createNetlinkChannel() for consistency with createTAM().
  2. Add missing attrs.push_back() for SAI_TAM_REPORT_ATTR_REPORT_INTERVAL_UNIT.
  3. Use find() instead of operator[] in clearGroup() to avoid inserting default entries.
  4. Adjust clearGroup() cleanup order to delete counter subscriptions before telemetry type/report objects.
  5. Replace raw const char* fields in CounterNameMapUpdater::Message with owned std::string state for robustness.
  6. Initialize CounterNameMapUpdater::Message::m_oid to SAI_NULL_OBJECT_ID and keep it SET-only by contract.
  7. Use rvalue reference for updateStatsIDs() to enable true move semantics.
  8. Use u32 for SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID to match the SAI metadata type.
  9. Clean up partially created hostif objects in createNetlinkChannel() if a later HOSTIF create step fails.
  10. Remove unused CONSTANTS_FILE macro and clean up commented-out code.

Why I did it
Identified during code review. Most of these are defensive robustness/maintainability fixes. The createNetlinkChannel() follow-up also improves failure-path behavior by avoiding leaked partially created HOSTIF objects when later create steps fail.

How I verified it
Code review only. CI will validate build and unit tests.

Details if related
N/A

Remove the unused CONSTANTS_FILE macro with a typo in path.

Add handleSaiCreateStatus() calls to createNetlinkChannel() for
create_hostif, create_hostif_user_defined_trap, and
create_hostif_table_entry. Previously these SAI calls did not check
return values, which could lead to using invalid object IDs if
creation failed.

Also remove commented-out trap group code.

Signed-off-by: Ze Gan <[email protected]>
The SAI_TAM_REPORT_ATTR_REPORT_INTERVAL_UNIT attribute was set but
never pushed to the attrs vector, so it was not passed to the SAI
create call. While the SAI default happens to be USEC, this should
be explicitly set for correctness.

Signed-off-by: Ze Gan <[email protected]>
clearGroup() used operator[] to access m_sai_tam_tel_type_objs which
inserts a default entry when the key doesn't exist, potentially
polluting the map. Use find() instead and only erase the state entry
when the key is actually present.

Signed-off-by: Ze Gan <[email protected]>
Replace the union of SetPayload/DelPayload (containing raw const char*
pointers) with a std::string member that owns the counter name. This
prevents potential dangling pointer issues if the caller's local string
goes out of scope before the message is fully processed.

Update locallyNotify() in HFTelOrch to use the new Message fields.

Signed-off-by: Ze Gan <[email protected]>
Change updateStatsIDs() parameter from const reference to rvalue
reference to enable true move semantics. The previous code used
std::move on a const reference which silently fell back to copy.

Update callers in HFTelProfile::setStatsIDs() to std::move the
local set into the call.

Signed-off-by: Ze Gan <[email protected]>
@Pterosaur Pterosaur force-pushed the fix/hft-orchagent-bugs branch from 0e0a618 to b67cef0 Compare March 9, 2026 23:12
@mssonicbld
Copy link
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur changed the title hftelorch: fix multiple bugs in high frequency telemetry orchagent hftelorch: minor improvements and cleanups for high frequency telemetry Mar 9, 2026
SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID is defined as sai_uint32_t
in the SAI metadata. Use attr.value.u32 instead of attr.value.oid
when creating TAM counter subscription objects.

Signed-off-by: Ze Gan <[email protected]>
Adjust clearGroup() cleanup order to erase TAM counter subscription
objects before removing the TAM telemetry type and report objects.
This better matches the object dependency order during teardown.

Signed-off-by: Ze Gan <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean PR with several meaningful bug fixes. No blocking issues found.

Key fixes verified:

  • Dangling pointer fix in Message struct (union of const char*std::string)
  • Missing attrs.push_back(attr) for report interval unit attribute
  • Type mismatch fix (oidu32) for stat_id
  • Safe lookup in clearGroup preventing default-insertion via operator[]
  • Added handleSaiCreateStatus error checking on previously unchecked SAI calls
  • Removed dead code and unused macro with typo (/et/sonic/)

LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean set of fixes — the dangling-pointer fix, missing push_back, stat_id type correction, and defensive clearGroup are all solid. A few observations below, none blocking.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean improvements — good bug fixes (missing push_back, oid→u32 type mismatch, clearGroup defensive lookup) and solid safety improvement replacing the union with std::string. One concern about partial-creation cleanup in createNetlinkChannel; otherwise LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@Pterosaur Pterosaur marked this pull request as ready for review March 12, 2026 13:42
@Pterosaur Pterosaur requested a review from prsunny as a code owner March 12, 2026 13:42
Copilot AI review requested due to automatic review settings March 12, 2026 13:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the orchagent high-frequency telemetry (HFT) implementation with small robustness, correctness, and cleanup improvements across profile/group management, SAI TAM object creation, and counter-name update plumbing.

Changes:

  • Improve TAM/hostif creation and report configuration correctness (add missing report-interval-unit attribute; use handleSaiCreateStatus() for hostif creates).
  • Make group/profile bookkeeping safer (avoid operator[] insertion in clearGroup(), adjust cleanup ordering; enable actual move semantics for stats ID updates).
  • Harden counter-name update messaging by owning the counter name (std::string) instead of using non-owning const char* payloads.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
orchagent/high_frequency_telemetry/hftelprofile.cpp Moves stats ID sets into groups, fixes clearGroup() map handling/cleanup order, and pushes missing report interval unit attribute; adjusts counter subscription stat-id attribute type.
orchagent/high_frequency_telemetry/hftelorch.cpp Uses handleSaiCreateStatus() for hostif object creations and updates local notification handling to use the new message shape.
orchagent/high_frequency_telemetry/hftelgroup.h Changes updateStatsIDs() signature to accept an rvalue set to enable move semantics.
orchagent/high_frequency_telemetry/hftelgroup.cpp Implements moved updateStatsIDs() and uses std::move into the member set.
orchagent/high_frequency_telemetry/counternameupdater.h Replaces union payload with owned std::string counter name + m_oid field.
orchagent/high_frequency_telemetry/counternameupdater.cpp Updates message construction accordingly and removes dependence on temporary-string c_str() lifetimes.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fixes look correct and well-structured. Key changes verified:

  • Dangling pointer fix: Message struct now owns data via std::string for both m_table_name and m_counter_name. Union eliminated.
  • Missing push_back: SAI_TAM_REPORT_ATTR_REPORT_INTERVAL_UNIT now correctly added to attrs vector.
  • Type mismatch: SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID correctly uses u32 instead of oid.
  • clearGroup safety: find() replaces operator[] preventing spurious map insertions.
  • Partial failure cleanup: createNetlinkChannel() now checks each SAI create and calls deleteNetlinkChannel() on failure.
  • Move semantics: updateStatsIDs() rvalue-ref parameter enables true move instead of silent copy.
  • Dead code removal: Unused CONSTANTS_FILE macro (with typo) and commented-out code removed.

No new issues found.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean set of bug fixes for the high-frequency telemetry subsystem. The latest iteration addresses all prior feedback well.

Key fixes verified:

  • Message struct now owns its strings via std::string, eliminating dangling-pointer hazards from the old const char*/union design
  • Missing attrs.push_back(attr) for SAI_TAM_REPORT_ATTR_REPORT_INTERVAL_UNIT — silent data loss bug
  • attr.value.oidattr.value.u32 for SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID — type mismatch fix
  • clearGroup defensive find-before-erase prevents accidental default-insertion into the map
  • createNetlinkChannel now checks return status and calls deleteNetlinkChannel() on partial failure
  • Dead code removal (CONSTANTS_FILE macro with typo, commented-out trap group code)
  • Move semantics for updateStatsIDs — correct and all call sites use std::move

No new issues found. LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid bug-fix PR. All changes are correct and well-targeted:

  • Dangling pointer fix (Message struct): Replacing raw const char* + union with owning std::string members eliminates a real use-after-scope hazard. Default-initializing m_oid to SAI_NULL_OBJECT_ID is clean.
  • Missing attrs.push_back for report interval unit: Clear bug — the attribute was configured but never added to the vector.
  • stat_id type mismatch (oidu32): Correct per SAI spec; SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID is a u32 enum, not an OID.
  • clearGroup defensive lookup: Using find-then-erase instead of operator[] prevents silent default-insertion of a zero OID key.
  • createNetlinkChannel error handling: Checking handleSaiCreateStatus return and calling deleteNetlinkChannel() on partial failure prevents SAI object leaks.
  • Move semantics for updateStatsIDs: Appropriate since all call sites build a temporary set.
  • Dead code removal: Typo-laden CONSTANTS_FILE macro and commented-out trap group code cleaned up.

LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid bugfix PR. All issues from prior review rounds have been addressed:

  • Dangling pointer fix: Message struct now owns data via std::string for both m_table_name and m_counter_name. m_oid properly default-initialized.
  • Missing attrs.push_back: SAI_TAM_REPORT_ATTR_REPORT_INTERVAL_UNIT was silently dropped — now correctly pushed.
  • Type mismatch: SAI_TAM_COUNTER_SUBSCRIPTION_ATTR_STAT_ID correctly uses u32 instead of oid.
  • clearGroup defensive lookup: operator[] replaced with find() to avoid spurious default-insertion.
  • createNetlinkChannel error handling: SAI create calls now checked with handleSaiCreateStatus and rollback via deleteNetlinkChannel on partial failure.
  • Move semantics: updateStatsIDs correctly takes rvalue ref, callers use std::move.
  • Dead code cleanup: Removed unused CONSTANTS_FILE macro and commented-out code.

No new issues found. LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fixes look correct. The dangling-pointer hazard (union of raw const char* → owning std::string members) is properly resolved, m_oid has a safe default initializer, the missing attrs.push_back for the report interval unit is fixed, the stat_id type mismatch (oidu32) is corrected, the defensive find-before-erase in clearGroup prevents accidental default-insertion, and createNetlinkChannel now properly checks SAI return status with rollback via deleteNetlinkChannel() on partial failure. Clean set of bug fixes.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid set of bug fixes. The union-to-struct migration eliminates dangling pointer hazards, the missing attrs.push_back for report interval unit was a real bug, the u32 vs oid type correction aligns with SAI spec, the defensive find before erase in clearGroup prevents silent default-insertion, and the handleSaiCreateStatus additions with rollback via deleteNetlinkChannel improve error handling. Previous review concerns (default m_oid init, m_table_name ownership) have been addressed in c470f11. LGTM.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@mssonicbld
Copy link
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vmittal-msft
Copy link
Contributor

@Pterosaur please help with branch rebase and passing PR checkers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants