Automated agent pool migration for branch master#1
Open
mssonicbld wants to merge 445 commits intomasterfrom
Open
Automated agent pool migration for branch master#1mssonicbld wants to merge 445 commits intomasterfrom
mssonicbld wants to merge 445 commits intomasterfrom
Conversation
* Remove RIF from m_rifsToAdd before deleting RIF What I did I extended the RIF removal functionality to also remove the port from the m_rifsToAdd list. Why I did it Typically, the counter and object handling logic follows a strict sequence: Create an object, then start counter polling. Stop counter polling, then remove the object. However, there is deferred logic for RIF counters, where counter polling starts based on a timer rather than immediately. This process generally works as follows: Create an object and add it to a list upon receiving an APP_DB update. Start counter polling for all objects in the list during the timer event. Stop counter polling for an object. Remove the object. If RIF creation and removal occur frequently, removal can happen before the timer event. As a result, the timer may start counter polling for an object that has just been removed, causing the following error message: ERR syncd#SDK: :- processFlexCounterEvent: port VID oid:0x600000000099d, was not found (probably port was removed/splitted) and will remove from counters now
…onic-net#3482) What I did use --add-tracefile option in debian/rules and tests/conftest.py to sanitize coverage.info generated by lcov Why I did it lcov generates an initial coverage.info file based on collected .gcno and .gcda files, this .info file contains coverage information for different source files (marked as SF). Sometimes you would observe that the same SF appears multiple times, it means lcov gets multiple copies of coverage information for this file, since this file may have appeared in multiple compilation units, and for each copy, the hit times of its lines are different. Then lcov_cobertura generates coverage.xml based on coverage.info. However, it can't deal with duplicate SF in coverage.info properly. If it sees duplicate coverage information for a source file from coverage.info, it always overwrites the old copy with the new copy, hence only the last copy would be counted. However, if the last copy considers the functions as missing, the function is considered as missing in coverage.xml, which is used to determine whether the new PR passes the coverage threshold. The proper way is to add the hit times of all the copies, which could be achieved by lcov add-tracefile option.
* Add heart beat interval parameter * Disable feature when interval is 0 Why I did it Make this feature can be disable, because log spam issue on small disk device: sonic-net/sonic-buildimage#21157 Work item tracking Microsoft ADO: 30594076
… passes an invalid timestamp (sonic-net#3446) - What I did Prevent orchagent from being segment fault when it receives a timestamp indicating a time in the far future (2^31 years later) in the ASIC/SDK health event from the vendor SAI. It's vendor SAI's failure to pass such a large timestamp but we need to protect such invalid input. - Why I did it - How I verified it Mock test - Details if related In case vendor SAI passed a very large timestamp, put_time can cause segment fault which can not be caught by try/catch infra We check the difference between the timestamp from SAI and the current time and force to use current time if the gap is too large By doing so, we can avoid the segment fault Signed-off-by: Stephen Sun <stephens@nvidia.com>
Git ignore .gcda and .gcno in all folders (sonic-net#3479) * Ignore .gcda and .gcno in all folders to avoid seeing a large number of untracked files in git status
…al IPv6 addresses of VLAN and Bridge interfaces (sonic-net#3476) What I did Added code to bring down the Bridge interface before changing the MAC address of a VLAN interface and the Bridge, and then starting up the Bridge after the MAC is changed. This automatically brings down and starts up all VLAN interfaces added to the Bridge. Added code to bring down and then immediately start up the Bridge after the dummy interface is started up. This is useful to ensure that MAC and link-local IPv6 addresses of the Bridge are consistent in case no VLANs are added to the Bridge later. Added a VS test to verify these behaviors. Why I did it After PR sonic-net#3370, the Bridge becomes operationally UP after the dummy interface is started up. As a result, all VLAN interfaces created under the Bridge are immediately operationally UP after creation. This can cause an issue later if their MAC address is changed since the kernel does not update the link-local IPv6 address of an interface if it is operationally UP. This behavior caused the IPv6 version of the following test cases in sonic-mgmt to fail for the dualtor topology, which were temporarily skipped: arp/test_arp_dualtor.py::test_arp_update_for_failed_standby_neighbor arp/test_arp_dualtor.py::test_standby_unsolicited_neigh_learning arp/test_arp_extended.py::test_proxy_arp
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com> Co-authored-by: Prince Sunny <prince.sunny@microsoft.com> Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
- What I did Added additional values for debug log - Why I did it Enhance debug prints with additional inof needed for offline debug
…et#3391) Optimize the counter-polling performance in terms of polling interval accuracy Enable bulk counter-polling to run at a smaller chunk size There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group. Collect the time stamp immediately after vendor SAI API returns. Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc.
What I did Implementing code changes for sonic-net/SONiC#1425 Why I did it add nexthop group feature to fpmsyncd. How I verified it enable/disable nexthop group feature Klish will call REST API to configure feature next-hop-group enable. FEATURE|nexthop_group will be created in CONFIG_DB template zebra.conf.j2 will generate zebra.conf with fpm use-next-hop-groups if FEATURE|nexthop_group exists in CONFIG_DB. Else, it will generate zebra.conf with no fpm use-next-hop-groups (default behavior) Do config save comman and write to /etc/sonic/config_db.json restart SONiC: virsh reboot sonic-nhg /etc/frr/zebra.conf has fpm use-next-hop-groups instead of no fpm use-next-hop-groups
) * [neighsync] VXLAN EVPN neighbors not in NEIGH_TABLE VXLAN EVPN learned routes are not entered into NEIGH_TABLE as per Issue sonic-net#3384. The EVPN VXLAN HLD specifically states this should be populated so it triggers an update to the SAI database: https://github.com/sonic-net/SONiC/blob/master/doc/vxlan/EVPN/EVPN_VXLAN_HLD.md#438-mac-ip-route-handling
* [orchagent] implement ring buffer feature with a flag What I did add a ring thread for orchdaemon, which would be kicked off if gRingMode is turned on support ring buffer feature, currently only enabled for route table executor, which has a scaled use case fix the covariant return type issue of swss::TableBase* Consumer::getConsumerTable() const override it should return swss::ConsumerTableBase * Why I did it increase the speed for APP_ROUTE_TABLE consumers doing tasks
…et#3406) * bfdorch changes to support software bfd sessions What I did Added logic in bfdorch to check for switch_type value of dpu and if it's dpu, program BFD sessions in a new software BFD session table in STATE_DB, instead of programming sessions in the HW through ASIC_DB. This table will be monitored by bgpcfgd which will program BFD sessions in FRR accordingly. Added pytest testcases Why I did it As part of the Smartswitch project, BFD sessions need to be run between DPU and NPU but DPU doesn't currently support BFD hardware offload. HLD: https://github.com/kperumalbfn/SONiC/blob/kperumal/bfd/doc/smart-switch/BFD/SmartSwitchDpuLivenessUsingBfd.md
What I did Address the SRv6 test issue Why I did it The creation of SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER may be too slow
* [FC] process FC after apply view What I did Simplify approach to delaying counters on warm boot and fast boot. Removed FLEX_COUNTER_DELAY_STATUS_FIELD and instead postpone all FC processing to happen after apply view to not delay data plane configuration. The CONFIG_DB should not be updated in runtime anymore for counters to be delayed. Why I did it To address sonic-net/sonic-buildimage#20302. How I verified it Run warm-boot - make sure FC orch runs only after APPLY_VIEW.
* SRv6: add dscp_mode configuration for MySID entry * add a sync with CONFIG_DB to store MySID entry dscp mode * create a tunnel/tunnel term entry for uDT46 MySID entry (the tunnel is reused for the same dscp_mode) * add a new vs test Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: set MySID behavior flavor only when required Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: update to align with the latest configuration schema * align with the latest MySID config db schema * use reverse locator lookup to derive the locator in case of ambiguity Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: update to use the default values for SRV6_MY_LOCATORS Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: align with the latest spec for static configuration * align with new CONFIG_DB key format * use decap_dscp_mode for uN entry * update vs tests Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: fix MySID prefix mask calculation * use func_len to calculate MySID entry prefix for CONFIG_DB key * update the vstest to test different func_len values * add a test for the "locator reverse lookup" Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: fix log format Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: remove a skip condition for the DSCP mode vs tests Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> * SRv6: fix tunnels info bug Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> --------- Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com> Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com> Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
…net#3452) * [BufferOrch] Use SAI bulk API to configure port, PG and queue What I did Make use of SAI set bulk API to improve switch boot up performance, especially in warm-boot and fast-boot scenarios. The general concept: First, tasks are processed one by one by corresponding process* methods which add the SAI operation with a context to a bulk buffer. Bulk buffers are split by DB operation. Bulk buffer is flushed to syncd using SAI bulk API, first DELETE operations are pushed in bulk then SET operations are pushed. Status code for each operation is updated in the task context structure. Lastly, corresponding process*Post methods are invoked to handle SAI status code and perform post set operations like enabling FC counter for a PG/queue upon success. This design allows re-use of all existing code that is written to handle one task at a time and a small change is needed to maintain task context persistence throughout steps 1-3.
…-net#3505) * portsorch: don't call updateDbPortOperStatus on all port types PORT_TABLE contains PortChannel oper_status entries which are not expected by portsorch which leads to warm/fastreboot failures like: ``` 2025 Feb 10 09:33:07.111055 sonic NOTICE swss#orchagent: :- bake: foundPortConfigDone = 1 2025 Feb 10 09:33:07.111080 sonic NOTICE swss#orchagent: :- bake: foundPortInitDone = 1 2025 Feb 10 09:33:07.111395 sonic NOTICE swss#orchagent: :- bake: m_portTable->getKeys 263 2025 Feb 10 09:33:07.111403 sonic NOTICE swss#orchagent: :- bake: portCount = 257, m_portCount = 0 2025 Feb 10 09:33:07.111403 sonic ERR swss#orchagent: :- bake: Invalid port table: portCount, expecting 257, got 261 ``` Fixes sonic-net/sonic-buildimage#21688
*sonic-swss: Code changes for WRED and ECN statistics (sonic-net#2750) New flex counter group for per-Queue WRED and ECN statistics New flex counter group for per-Port WRED and ECN statistics Why I did it Implemented as per the HLD : https://github.com/sonic-net/SONiC/blob/master/doc/qos/ECN_and_WRED_statistics_HLD.md How I verified it Verfied it using Marvell DUT and SWSS unit tests. Details if related Two new flex counters added for per-Queue and per-Port WRED ECN statistics. Build dependency on sonic-swss-common pull request : sonic-net/sonic-swss-common#777
…r ECMP/LAG switch hash configuration (sonic-net#3481) * added SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL to the hash-field table Why I did it Need to support SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL parameters for hash calculation How I verified it Configure SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL via CLI, check /var/log/syslog
* Code owners update for bufferorch, muxorch and acl
…agMember for strip tag (sonic-net#3343) What I did Added child_ports check in addLagMember and removeLagMember for strip tag Why I did it portorch sets LAG member's strip tag when adding subport: // Change hostif vlan tag for the parent port only when a first subport is created if (parentPort.m_child_ports.empty()) { if (!setHostIntfsStripTag(parentPort, SAI_HOSTIF_VLAN_TAG_KEEP)) but if a new member is added later, in addLagMember function, it does not handle strip tag anymore. Cause the new added lag member has wrong tag mode.
…-net#3520) *What I did: Added Change to Skip Route Programming if NH is link/oper down. With Scale Route testing of 60K+ routes when we toggle all the interfaces[14+ interface back to back] as done here: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_multi_po_flap.py we see because of slowness of FRR Route APP_DB processing compare to Link Notification Handling where we have updated the Nexthop Group as part of Link Notification handling to point to default route via sonic-net#3389 [if eligible] FRR slowness can reprogram the Route back to Nexthop which is link down. This change is similar to sonic-net#3394 which was done for Nexthop Group.
…-net#3517) * Set Port UPDATE_DSCP attribute when TC_TO_DSCP map is attached What I did Set Port SAI attribute SAI_PORT_ATTR_UPDATE_DSCP when TC_TO_DSCP map is attached to the port. Why I did it Some vendor SAI expects Sonic to set this attribute explicitly when TC_TO_DSCP map is attached to the port to modify DSCP value of the packet.
* Add appliance entry validation (sonic-net#3494) - Do not allow more than 1 entry in DASH Appliance table. - Do not allow DASH VNET creation before DASH Appliance entry creation. - DASH ENI already has similar check for Appliance entry.
* [smartswitch] Add support for ENI Based Forwarding HLD: sonic-net/SONiC#1842 Requires sonic-net/sonic-swss-common#976 Add DashEniFwdOrch which installs ACL rules to Redirect the DASH packet to corresponding DPU
What I did Initialize the port error status map only once Why I did it Any change in CONFIG_DB PORT table was resulting in updating the port error status map resulting in change in error count = 0 How I verified it Verified by raising RF and LF error events on the port
…3502) * Use non-zero trap priority for default trap group This priority is used internally in some vendor SAI implementations and causes undesirable packet trapping behavior. How I verified it By running copp tests which include rate-limiting tests for TTL 1 packets on on Cisco/Mellanox/Arista platforms. By manual tests with TTL 1 packets generated by scapy
…r poll calls and communication between swss/sairedis (sonic-net#3504) * Use flex counter manager for the following counter groups - priority group watermark - priority group drop - queue watermark - port counter group Signed-off-by: Stephen Sun <stephens@nvidia.com> * Fix compiling error Signed-off-by: Stephen Sun <stephens@nvidia.com> * Support bulk create also for WRED/ECN counter groups Signed-off-by: Stephen Sun <stephens@nvidia.com> * Fix review comments Signed-off-by: Stephen Sun <stephens@nvidia.com> --------- Signed-off-by: Stephen Sun <stephens@nvidia.com>
What I did Provide an explicit value for send_sci in the macsec vstest Why I did it The kernel 5.15 requires the send_sci to be true if the sci value was provided explicitly.
Add SAI MACSec POST support in SWSS When FIPS is enabled in SONiC, enable MACSecPOST in switch creation. If MACSec POST is only supported in MACSec init, create MACSec objects and enable POST when initializing MACSecOrch. Set SAI POST status in StateDB accordingly based on SAI POST status notificaiton. MACSecMgr does not process any MACSec configuration if SAI POST fails. With the PR, the only case that Orchagent declares SAI MACSec POST to fail is: FIPS is enabled in SONiC; AND SAI supports MACSec POST; AND SAI returns failure on MACSec POST. We particularly verified the following on switch with current BRCM SAI (13.2.1.0) that does not support MACSec POST yet: MACSec ports came up fine when FIPS is not enabled in SONiC; MACSec ports came up fine when FIPS is enabled in SONiC and SAI does not support MACSec POST.
sonic-net#3979) * With ZMQ enabled between fpmsyncd to orchagent, default routes are set to DROP Issue# sonic-net#3978 Recently ZMQ was enabled in sonic-mgmt (refer to sonic-net/sonic-mgmt@111e635). fpmsyncd sends a DEL+SET for every route. This was coalesced by the producerStateTable infra. But with ZMQ enabled, this coalescing does not happen. As a result, orchagent gets a DEL+SET for every route. This works well normally. But for default routes, there is a bug in orchagent, where it adds a DROP action when it receives the DEL. But when the subsequent SET is received, it does not reset it to FORWARD action. This is due to a bug in its checking code. * ZMQ configuration does not work when mgmt VRF is configured
When FRR was bumped up to 10.3 we started getting kernel routes for eth1-midplane. Similar to sonic-net#1606 we'll skip these routes to avoid failures like sonic-net/sonic-mgmt#18505
…net#3987) Co-authored-by: StormLiangMS <89824293+StormLiangMS@users.noreply.github.com>
What I did Fix issue reported in sonic-net#3961. Regression happened after move to bulk implementation. The error handling of no PGs, queues was incorrect. Bulk array preparation and reading should use a consistent approach. For example, if a port is skipped in the input array, it should also be skipped when reading from the output array. Why I did it Fix issue reported in sonic-net#3961.
* dot3 Stats collection What I did Implement RFC3635 dot3 statistics collection. Used by sonic-net/sonic-snmpagent#350 Fixes sonic-net/sonic-buildimage#22359 Why I did it RFC1284 defines dot3 stats that most switch vendors support. This RFC was superseded by RFC3635 which includes 64bit "HC" counters. We need to collect these statistics for use by sonic_snmpagent.
What I did Avoided updating the NHGroup with mux neighbor nexthops during mux state updates. Why I did it When a mux neighbor in a ECMP NexthopGroup changes state to standby we point the prefix route originally pointing to the ECMP nexthop group to any available active neighbor NH or tunnel NH and the original ECMP nexthopgroup remains unused, so this update is useless and uncessarily takes extra SAI calls. This fix also avoid creation of nexthop group with a mix of different types of NextHops which will allow platform that do not support such nexthop groups.
…et#4010) What I did Temporarily skip this to unblock PRs in multiple repos Why I did it test_port_add_remove is failing consistently since Nov 11 How I verified it By verifying tests via swss pipeline Details if related
sonic-net#3960) What I did Allow state db to take modified entries made to the tunnel decap table Why I did it Prevent DB inconsistency between state, asic and appl db How I verified it redis-cli -n 6 HGETALL "TUNNEL_DECAP_TABLE|IPINIP_V6_TUNNEL" "tunnel_type" "IPINIP" "dscp_mode" "pipe" "ecn_mode" "standard" "ttl_mode" "pipe" /var/log/syslog.1:2025 Nov 10 19:18:32.730385 bjw-can-2700-2 NOTICE swss#orchagent: :- doDecapTunnelTask: Fields for TUNNEL_DECAP_TABLE entry 'IPINIP_TUNNEL' have been synchronised in STATE_DB /var/log/syslog.1:2025 Nov 10 19:18:32.740298 bjw-can-2700-2 NOTICE swss#orchagent: :- doDecapTunnelTask: Fields for TUNNEL_DECAP_TABLE entry 'IPINIP_V6_TUNNEL' have been synchronised in STATE_DB
…sonic-net#3605) What I did Add related handling functions for SRv6 VPN Route and PIC Context in RouteSync. And add some associated fields for database. Why I did it To supplement the functionality of SRv6 VPN Route and PIC Context. How I verified it This change is part of the PhoenixWing project. The functionality has already been tested and running in the PhoenixWing. To improve coverage, unit tests using mock objects for the internal processing logic of Srv6 Vpn Route and Pic Context have been added.
What I did Support checking the capabilities of ingress/egress mirror before setting it to SAI. Switch orchagent fetches the mirror capabilities from SAI during initialization and exposes to field PORT_INGRESS_MIRROR_CAPABLE and PORT_EGRESS_MIRROR_CAPABLE in STATE_DB table SWITCH_CAPABILITY. Mirror orchagent check whether the direction is supported before calling the corresponding SAI API. Why I did it This is to avoid SAI returning error on the platforms that the mirror is not supported on a direction. It will collect SAI SDK dump on receiving a SAI error message, which is unnecessary.
…#3933) If a fabric port repeatedly and rapidly transitions between the isolate and unisolate states, resulting in instability, the algorithm places the link in a permanent isolated state. Currently, the threshold for triggering this condition is when a link flaps three times within a two-hour period. Recovery from this state requires manual user intervention via a CLI command: config fabric port unisolate -n asicX --force HLD change is at:
…ic-net#3847) What I did Portorch, Neighorch, Intfsorch are updated to not access chassis app DB. Chassis app DB is accessed only if the chassisdb.conf is present indicating its a real chassis (not a FS VOQ) For lag creation, system lags are created as the switch is a VOQ swtich. Why I did it For a fixed system, there is no chassis DB present How I verified it Ran sonic-mgmt tests to verify BGP, LAG, functionality
What I did Fix sonic-net/sonic-buildimage#24342 Why I did it Variable is uninitialized which is causing the error logs mentioned in the issue during startup
…on (sonic-net#3878) What I did Adding rx_monitor_timer and tx_monitor_timer handling per HLD: https://github.com/sonic-net/SONiC/blob/master/doc/vxlan/Overlay%20ECMP%20ehancements.md sign-off: Jing Zhang zhangjing@microsoft.com Why I did it It's needed for SSW HA scenario as DPU side bfd is a software solution, interval must be set to a reasonable value.
…net#3958) * [fpmsyncd]: Fix uA SID programming for link-local adjacencies A uA SID performs a shift and cross-connect to a direct neighbor over a specific interface. It is defined by two parameters: an interface and a nexthop IPv6 address. When FRR sends a uA SID to SONiC's fpmsyncd, it includes both of these parameters. However, fpmsyncd currently only extracts the nexthop IPv6 address. It then creates an entry in the SRV6_MY_SID_TABLE of ApplDB with action=ua and adj=<nexthop_ipv6_address>. Subsequently, OrchAgent retrieves this entry and attempts to resolve the adjacency to program the SID in the ASIC. The issue is that fpmsyncd extracts the nexthop IPv6 address from the message but does not extract the interface. In cases where the nexthop IPv6 address is a link-local address, the interface is essential for successful nexthop resolution. Without it, the resolution fails, and the SID is not programmed in the ASIC. For example, the following syslog messages show OrchAgent failing to resolve a link-local nexthop because the interface is missing: ``` Oct 27 08:31:19.345821 1cdc490d8ce2 INFO #orchagent: :- doTask: table name : SRV6_MY_SID_TABLE Oct 27 08:31:19.345895 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: MY SID STRING fcbb:bbbb:1:fe10:: Oct 27 08:31:19.345912 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: MySid: sid fcbb:bbbb:1:fe10::, action ua, vrf , block 32, node 16, func 16, arg 0 dt_vrf , adj fe80::e822:daff:feab:3ee9 Oct 27 08:31:19.345946 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: Adjacency fe80::e822:daff:feab:3ee9 Oct 27 08:31:19.345965 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: Nexthop for adjacency fe80::e822:daff:feab:3ee9 doesn't exist in DB yet Oct 27 08:31:19.345983 1cdc490d8ce2 ERR #orchagent: :- doTaskMySidTable: Failed to create/update my_sid entry for sid 32:16:16:0:fcbb:bbbb:1:fe10:: ```
This commit fixes a 10-second startup delay during fast-reboot in dynamic buffer mode. What I did Add check for fast-reboot done flag m_bufferCompletelyInitialized in checkSharedBufferPoolSize() to avoid early calculation. This can save about 7s. But still keep one calculation, because buffer pool need to be ready before buffer profile creation. Also skip headroom validate in startup phase. This can save about 2s. Why I did it There is a 10-second startup delay during fast-reboot. Because in fast-reboot, m_mmuSize is immediately available from STATE_DB (persisted from previous boot), causing checkSharedBufferPoolSize() to execute expensive Redis operations before buffer system init completes.
This PR introduces OpenTelemetry (OTEL) support for exporting SAI (Switch Abstraction Interface) statistics to observability systems. It implements conversion logic from SAI statistics to OTLP gauge metrics and adds an actor for exporting these metrics. What I did New OTEL message types for converting SAI statistics to OpenTelemetry gauge format Initiated the integration of OpenTelemetry (OTEL) into the HFT components of sonic-swss OtelActor implementation for receiving SAI stats and exporting to OTEL collectors Command-line arguments for enabling and configuring OTEL export Established configuration for exporting metrics and traces to OTEL collector Why I did it To improve observability and monitoring of HFT processes within SONiC SWSS. OpenTelemetry provides standardized and extensible tracing which helps with debugging, performance analysis, and future integrations.
Adjust headroom calculation on SN6600 platform *Set egress mirror headroom to 0 on SN6600 platform (sonic-net#4005)
What I did Set flow_reconcile_pending, activate_role_pending, and brainsplit_recover_pending back to false after receiving the appropriate notification. For flow_reconcile_pending and activate_role_pending, this is after controller sends operation approval. For brainsplit, clearing the flag after DPU enters stable state again. Why I did it flow_reconcile_pending, activate_role_pending, and brainsplit_recover_pending were not being reset to false after being set true for the first time.
What I did Enable SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_OUTPUT_QUEUE_STATS to TAM_TEL_TYPE Why I did it This attribute is needed if the HFT want to support stats of IPG
What I did Supporting update to peer ip in ha set config. Why I did it In repairing process peer ip needs to be updated to the new DPU to be paired with.
sonic-platform-daemon side change: sonic-net/sonic-platform-daemons#643 What I did Support SAI_PORT_SERDES_ATTR_CUSTOM_COLLECTION (i.e. custom serdes attributes represented in JSON based string format) Use boost::variant for the serdes_attr map value to support both std::vector<uint32_t> and std::string in a type-safe way, with easy extensibility to add more types later.
This PR updates DASH (Data Processing Unit) orchagents to use the DPU application database instead of the standard application database for all orchestrator components. Changed database connection from APPL_DB to DPU_APPL_DB across DASH components Updated test infrastructure to support DPU application database validation Modified orchestrator daemon initialization to use DPU-specific database connections
What I did Add support for platform based on Clounix asic.The Clounix asic platform code has been merged into sonic-buildimage repo. Please see details in PR: sonic-buildimage clounix PR How I did it Add support for platform based on Clounix asic
sonic-net#3982) What I did Added cleanup of COUNTERS_*_NAME_MAP entries for a port during its deinit phase, and regenerated the NAME_MAP tables with fresh OIDs during port init when queue flex counters are already enabled. Why I did it After dynamic port breakout of a port, queue name map tables in the COUNTERS_DB table are not regenerated leaving stale entries resulting in CLI crash
…Port is removed and created when the Speed is changed dynamically via GCU (sonic-net#3976) What I did Fixed the issue reported in sonic-net/sonic-buildimage#24417 Added code to populate the system_port information in the New Port structure in orchagent portsorch after the Port is removed and created when the Port speed is changed via GCU patch. Why I did it When the switch is created, swss queries all the SYSTEM_PORTS from SAI and updates the PORT class/structure with the corresponding system_port info after the PortInitDone event is received from portsyncd. Then the port speed is changed with 4 Lanes via GCU patch, the port is removed from SAI and created again in swss by calling deInitPort and initPort. But in initPort, the system_port info is not updated in the new PORT structure. So when the RIF is created on local interface, the voqSyncAddIntf adds an entry in SYSTEM_INTERFACE table in CHASSIS_APP_DB with empty key since the system_port info is not populated for the local port. For the same reason, the SYSTEM_NEIGH info is also not updated in CHASSIS_APP_DB. This breaks the basic VOQ functionality
What I did Support FNIC pipeline changes for DashEniFwd Orch (No match on Tunnel VNI and only match on INNER_DST_MAC) Update DashEniFwdOrch to handle the updated schema for DPU, VDPU, REMOTE_DPU & DASH_ENI_FORWARD_TABLE Delete the ACL Table if all the ENI acl rules are deleted While creating Tunnel NH, also pass VNI. Update Aclorch to accept Tunnel NH in the following format: endpoint_ip@tunnel_name[,vni][,mac] Simplify the logic by removing the reference counting logic for Remote NH Tunnel tracking Added REQ_T_STRING_LIST option to Request parser Allow Relaxed Attribute parsing to Request parser This is needed because the Request parser expects a strict schema of field-value pairs.
Signed-off-by: Sonic Automation <sonicbld@microsoft.com>
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is created for automated agent pool migration across branches.
Agent pools to be migrated:
Branches processed: