Skip to content

Automated agent pool migration for branch master#1

Open
mssonicbld wants to merge 445 commits intomasterfrom
migrate-agent-pool-master
Open

Automated agent pool migration for branch master#1
mssonicbld wants to merge 445 commits intomasterfrom
migrate-agent-pool-master

Conversation

@mssonicbld
Copy link
Copy Markdown
Owner

This PR is created for automated agent pool migration across branches.
Agent pools to be migrated:

  • sonicso1ES-amd64 -> justForTesting

Branches processed:

  • master
  • main

tshalvi and others added 30 commits January 30, 2025 10:00
* Remove RIF from m_rifsToAdd before deleting RIF

What I did
I extended the RIF removal functionality to also remove the port from the m_rifsToAdd list.

Why I did it
Typically, the counter and object handling logic follows a strict sequence:

Create an object, then start counter polling.
Stop counter polling, then remove the object.
However, there is deferred logic for RIF counters, where counter polling starts based on a timer rather than immediately.

This process generally works as follows:

Create an object and add it to a list upon receiving an APP_DB update.
Start counter polling for all objects in the list during the timer event.
Stop counter polling for an object.
Remove the object.
If RIF creation and removal occur frequently, removal can happen before the timer event. As a result, the timer may start counter polling for an object that has just been removed, causing the following error message:
ERR syncd#SDK: :- processFlexCounterEvent: port VID oid:0x600000000099d, was not found (probably port was removed/splitted) and will remove from counters now
…onic-net#3482)

What I did
use --add-tracefile option in debian/rules and tests/conftest.py to sanitize coverage.info generated by lcov

Why I did it
lcov generates an initial coverage.info file based on collected .gcno and .gcda files, this .info file contains coverage information for different source files (marked as SF). Sometimes you would observe that the same SF appears multiple times, it means lcov gets multiple copies of coverage information for this file, since this file may have appeared in multiple compilation units, and for each copy, the hit times of its lines are different.

Then lcov_cobertura generates coverage.xml based on coverage.info. However, it can't deal with duplicate SF in coverage.info properly. If it sees duplicate coverage information for a source file from coverage.info, it always overwrites the old copy with the new copy, hence only the last copy would be counted. However, if the last copy considers the functions as missing, the function is considered as missing in coverage.xml, which is used to determine whether the new PR passes the coverage threshold.

The proper way is to add the hit times of all the copies, which could be achieved by lcov add-tracefile option.
* Add heart beat interval parameter
* Disable feature when interval is 0

Why I did it
Make this feature can be disable, because log spam issue on small disk device:
sonic-net/sonic-buildimage#21157

Work item tracking
Microsoft ADO: 30594076
… passes an invalid timestamp (sonic-net#3446)

- What I did
Prevent orchagent from being segment fault when it receives a timestamp indicating a time in the far future (2^31 years later) in the ASIC/SDK health event from the vendor SAI.
It's vendor SAI's failure to pass such a large timestamp but we need to protect such invalid input.

- Why I did it

- How I verified it
Mock test

- Details if related
In case vendor SAI passed a very large timestamp, put_time can cause segment fault which can not be caught by try/catch infra
We check the difference between the timestamp from SAI and the current time and force to use current time if the gap is too large
By doing so, we can avoid the segment fault

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Git ignore .gcda and .gcno in all folders (sonic-net#3479)

* Ignore .gcda and .gcno in all folders to avoid seeing a large number of untracked files in git status
…al IPv6 addresses of VLAN and Bridge interfaces (sonic-net#3476)

What I did

Added code to bring down the Bridge interface before changing the MAC address of a VLAN interface and the Bridge, and then starting up the Bridge after the MAC is changed. This automatically brings down and starts up all VLAN interfaces added to the Bridge.
Added code to bring down and then immediately start up the Bridge after the dummy interface is started up. This is useful to ensure that MAC and link-local IPv6 addresses of the Bridge are consistent in case no VLANs are added to the Bridge later.
Added a VS test to verify these behaviors.
Why I did it
After PR sonic-net#3370, the Bridge becomes operationally UP after the dummy interface is started up. As a result, all VLAN interfaces created under the Bridge are immediately operationally UP after creation. This can cause an issue later if their MAC address is changed since the kernel does not update the link-local IPv6 address of an interface if it is operationally UP. This behavior caused the IPv6 version of the following test cases in sonic-mgmt to fail for the dualtor topology, which were temporarily skipped:

arp/test_arp_dualtor.py::test_arp_update_for_failed_standby_neighbor
arp/test_arp_dualtor.py::test_standby_unsolicited_neigh_learning
arp/test_arp_extended.py::test_proxy_arp
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
Co-authored-by: Prince Sunny <prince.sunny@microsoft.com>
Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
- What I did
Added additional values for debug log

- Why I did it
Enhance debug prints with additional inof needed for offline debug
…et#3391)

Optimize the counter-polling performance in terms of polling interval accuracy

Enable bulk counter-polling to run at a smaller chunk size
There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group.

Collect the time stamp immediately after vendor SAI API returns.
Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc.
What I did
Implementing code changes for sonic-net/SONiC#1425

Why I did it
add nexthop group feature to fpmsyncd.

How I verified it

enable/disable nexthop group feature
Klish will call REST API to configure feature next-hop-group enable.
FEATURE|nexthop_group will be created in CONFIG_DB
template zebra.conf.j2 will generate zebra.conf with fpm use-next-hop-groups if FEATURE|nexthop_group exists in CONFIG_DB. Else, it will generate zebra.conf with no fpm use-next-hop-groups (default behavior)
Do config save comman and write to /etc/sonic/config_db.json
restart SONiC: virsh reboot sonic-nhg
/etc/frr/zebra.conf has fpm use-next-hop-groups instead of no fpm use-next-hop-groups
)

* [neighsync] VXLAN EVPN neighbors not in NEIGH_TABLE

VXLAN EVPN learned routes are not entered into NEIGH_TABLE as per
Issue sonic-net#3384.

The EVPN VXLAN HLD specifically states this should be populated so it triggers
an update to the SAI database:

https://github.com/sonic-net/SONiC/blob/master/doc/vxlan/EVPN/EVPN_VXLAN_HLD.md#438-mac-ip-route-handling
* [orchagent] implement ring buffer feature with a flag
What I did

add a ring thread for orchdaemon, which would be kicked off if gRingMode is turned on
support ring buffer feature, currently only enabled for route table executor, which has a scaled use case
fix the covariant return type issue of swss::TableBase* Consumer::getConsumerTable() const override
it should return swss::ConsumerTableBase *
Why I did it

increase the speed for APP_ROUTE_TABLE consumers doing tasks
…et#3406)

* bfdorch changes to support software bfd sessions

What I did

Added logic in bfdorch to check for switch_type value of dpu and if it's dpu, program BFD sessions in a new software BFD session table in STATE_DB, instead of programming sessions in the HW through ASIC_DB. This table will be monitored by bgpcfgd which will program BFD sessions in FRR accordingly.
Added pytest testcases
Why I did it

As part of the Smartswitch project, BFD sessions need to be run between DPU and NPU but DPU doesn't currently support BFD hardware offload.
HLD: https://github.com/kperumalbfn/SONiC/blob/kperumal/bfd/doc/smart-switch/BFD/SmartSwitchDpuLivenessUsingBfd.md
What I did

Address the SRv6 test issue

Why I did it

The creation of SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER may be too slow
* [FC] process FC after apply view

What I did
Simplify approach to delaying counters on warm boot and fast boot. Removed FLEX_COUNTER_DELAY_STATUS_FIELD and instead postpone all FC processing to happen after apply view to not delay data plane configuration.

The CONFIG_DB should not be updated in runtime anymore for counters to be delayed.

Why I did it
To address sonic-net/sonic-buildimage#20302.

How I verified it
Run warm-boot - make sure FC orch runs only after APPLY_VIEW.
* SRv6: add dscp_mode configuration for MySID entry

* add a sync with CONFIG_DB to store MySID entry dscp mode
* create a tunnel/tunnel term entry for uDT46 MySID entry (the tunnel is reused for the same dscp_mode)
* add a new vs test

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: set MySID behavior flavor only when required

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: update to align with the latest configuration schema

* align with the latest MySID config db schema
* use reverse locator lookup to derive the locator in case of ambiguity

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: update to use the default values for SRV6_MY_LOCATORS

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: align with the latest spec for static configuration

* align with new CONFIG_DB key format
* use decap_dscp_mode for uN entry
* update vs tests

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: fix MySID prefix mask calculation

* use func_len to calculate MySID entry prefix for CONFIG_DB key
* update the vstest to test different func_len values
* add a test for the "locator reverse lookup"

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: fix log format

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: remove a skip condition for the DSCP mode vs tests

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

* SRv6: fix tunnels info bug

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>

---------

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
…net#3452)

* [BufferOrch] Use SAI bulk API to configure port, PG and queue

What I did

Make use of SAI set bulk API to improve switch boot up performance, especially in warm-boot and fast-boot scenarios.

The general concept:

First, tasks are processed one by one by corresponding process* methods which add the SAI operation with a context to a bulk buffer. Bulk buffers are split by DB operation.
Bulk buffer is flushed to syncd using SAI bulk API, first DELETE operations are pushed in bulk then SET operations are pushed. Status code for each operation is updated in the task context structure.
Lastly, corresponding process*Post methods are invoked to handle SAI status code and perform post set operations like enabling FC counter for a PG/queue upon success.
This design allows re-use of all existing code that is written to handle one task at a time and a small change is needed to maintain task context persistence throughout steps 1-3.
…-net#3505)

* portsorch: don't call updateDbPortOperStatus on all port types

PORT_TABLE contains PortChannel oper_status entries which are not
expected by portsorch which leads to warm/fastreboot failures
like:
```
2025 Feb 10 09:33:07.111055 sonic NOTICE swss#orchagent: :- bake: foundPortConfigDone = 1
2025 Feb 10 09:33:07.111080 sonic NOTICE swss#orchagent: :- bake: foundPortInitDone = 1
2025 Feb 10 09:33:07.111395 sonic NOTICE swss#orchagent: :- bake: m_portTable->getKeys 263
2025 Feb 10 09:33:07.111403 sonic NOTICE swss#orchagent: :- bake: portCount = 257, m_portCount = 0
2025 Feb 10 09:33:07.111403 sonic ERR swss#orchagent: :- bake: Invalid port table: portCount, expecting 257, got 261
```

Fixes sonic-net/sonic-buildimage#21688
*sonic-swss: Code changes for WRED and ECN statistics (sonic-net#2750)

New flex counter group for per-Queue WRED and ECN statistics
New flex counter group for per-Port WRED and ECN statistics

Why I did it
Implemented as per the HLD : https://github.com/sonic-net/SONiC/blob/master/doc/qos/ECN_and_WRED_statistics_HLD.md

How I verified it
Verfied it using Marvell DUT and SWSS unit tests.

Details if related

Two new flex counters added for per-Queue and per-Port WRED ECN statistics.
Build dependency on sonic-swss-common pull request : sonic-net/sonic-swss-common#777
…r ECMP/LAG switch hash configuration (sonic-net#3481)

* added SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL to the hash-field table

Why I did it
Need to support SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL parameters for hash calculation

How I verified it
Configure SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL via CLI, check /var/log/syslog
* Code owners update for bufferorch, muxorch and acl
…agMember for strip tag (sonic-net#3343)

What I did
Added child_ports check in addLagMember and removeLagMember for strip tag

Why I did it
portorch sets LAG member's strip tag when adding subport:

    // Change hostif vlan tag for the parent port only when a first subport is created
    if (parentPort.m_child_ports.empty())
    {
        if (!setHostIntfsStripTag(parentPort, SAI_HOSTIF_VLAN_TAG_KEEP))
but if a new member is added later, in addLagMember function, it does not handle strip tag anymore. Cause the new added lag member has wrong tag mode.
…-net#3520)

*What I did:
Added Change to Skip Route Programming if NH is link/oper down. With Scale Route testing of 60K+ routes when we toggle all the interfaces[14+ interface back to back] as done here: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_multi_po_flap.py we see because of slowness of FRR Route APP_DB processing compare to Link Notification Handling where we have updated the Nexthop Group as part of Link Notification handling to point to default route via sonic-net#3389 [if eligible] FRR slowness can reprogram the Route back to Nexthop which is link down.

This change is similar to sonic-net#3394 which was done for Nexthop Group.
…-net#3517)

* Set Port UPDATE_DSCP attribute when TC_TO_DSCP map is attached
What I did

Set Port SAI attribute SAI_PORT_ATTR_UPDATE_DSCP when TC_TO_DSCP map is attached to the port.
Why I did it

Some vendor SAI expects Sonic to set this attribute explicitly when TC_TO_DSCP map is attached to the port to modify DSCP value of the packet.
* Add appliance entry validation (sonic-net#3494)
- Do not allow more than 1 entry in DASH Appliance table.
- Do not allow DASH VNET creation before DASH Appliance entry creation.
- DASH ENI already has similar check for Appliance entry.
* [smartswitch] Add support for ENI Based Forwarding
HLD: sonic-net/SONiC#1842
Requires sonic-net/sonic-swss-common#976
Add DashEniFwdOrch which installs ACL rules to Redirect the DASH packet to corresponding DPU
What I did
Initialize the port error status map only once

Why I did it
Any change in CONFIG_DB PORT table was resulting in updating the port error status map resulting in change in error count = 0

How I verified it
Verified by raising RF and LF error events on the port
…3502)

* Use non-zero trap priority for default trap group
This priority is used internally in some vendor SAI implementations and causes
undesirable packet trapping behavior.

How I verified it

By running copp tests which include rate-limiting tests for TTL 1 packets on on Cisco/Mellanox/Arista platforms.
By manual tests with TTL 1 packets generated by scapy
…r poll calls and communication between swss/sairedis (sonic-net#3504)

* Use flex counter manager for the following counter groups

- priority group watermark
- priority group drop
- queue watermark
- port counter group

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix compiling error

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Support bulk create also for WRED/ECN counter groups

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix review comments

Signed-off-by: Stephen Sun <stephens@nvidia.com>

---------

Signed-off-by: Stephen Sun <stephens@nvidia.com>
What I did
Provide an explicit value for send_sci in the macsec vstest

Why I did it

The kernel 5.15 requires the send_sci to be true if the sci value was provided explicitly.
ysmanman and others added 29 commits November 11, 2025 15:14
Add SAI MACSec POST support in SWSS

When FIPS is enabled in SONiC, enable MACSecPOST in switch creation.
If MACSec POST is only supported in MACSec init, create MACSec objects and enable POST when initializing MACSecOrch.
Set SAI POST status in StateDB accordingly based on SAI POST status notificaiton.
MACSecMgr does not process any MACSec configuration if SAI POST fails.
With the PR, the only case that Orchagent declares SAI MACSec POST to fail is:

FIPS is enabled in SONiC; AND
SAI supports MACSec POST; AND
SAI returns failure on MACSec POST.
We particularly verified the following on switch with current BRCM SAI (13.2.1.0) that does not support MACSec POST yet:

MACSec ports came up fine when FIPS is not enabled in SONiC;
MACSec ports came up fine when FIPS is enabled in SONiC and SAI does not support MACSec POST.
sonic-net#3979)

* With ZMQ enabled between fpmsyncd to orchagent, default routes are set to DROP

Issue# sonic-net#3978

Recently ZMQ was enabled in sonic-mgmt (refer to sonic-net/sonic-mgmt@111e635).
fpmsyncd sends a DEL+SET for every route. This was coalesced by the producerStateTable infra. But with ZMQ enabled, this coalescing does not happen. As a result, orchagent gets a DEL+SET for every route. This works well normally. But for default routes, there is a bug in orchagent, where it adds a DROP action when it receives the DEL. But when the subsequent SET is received, it does not reset it to FORWARD action. This is due to a bug in its checking code.

* ZMQ configuration does not work when mgmt VRF is configured
When FRR was bumped up to 10.3 we started getting kernel routes for
eth1-midplane.
Similar to sonic-net#1606 we'll skip
these routes to avoid failures like
sonic-net/sonic-mgmt#18505
…net#3987)

Co-authored-by: StormLiangMS <89824293+StormLiangMS@users.noreply.github.com>
What I did

Fix issue reported in sonic-net#3961.
Regression happened after move to bulk implementation. The error handling of no PGs, queues was incorrect.

Bulk array preparation and reading should use a consistent approach. For example, if a port is skipped in the input array, it should also be skipped when reading from the output array.

Why I did it

Fix issue reported in sonic-net#3961.
* dot3 Stats collection
What I did

Implement RFC3635 dot3 statistics collection.

Used by sonic-net/sonic-snmpagent#350
Fixes sonic-net/sonic-buildimage#22359

Why I did it

RFC1284 defines dot3 stats that most switch vendors support. This RFC was superseded by RFC3635 which includes 64bit "HC" counters. We need to collect these statistics for use by sonic_snmpagent.
What I did
Avoided updating the NHGroup with mux neighbor nexthops during mux state updates.

Why I did it
When a mux neighbor in a ECMP NexthopGroup changes state to standby we point the prefix route originally pointing to the ECMP nexthop group to any available active neighbor NH or tunnel NH and the original ECMP nexthopgroup remains unused, so this update is useless and uncessarily takes extra SAI calls. This fix also avoid creation of nexthop group with a mix of different types of NextHops which will allow platform that do not support such nexthop groups.
…et#4010)

What I did
Temporarily skip this to unblock PRs in multiple repos
Why I did it
test_port_add_remove is failing consistently since Nov 11
How I verified it
By verifying tests via swss pipeline
Details if related
sonic-net#3960)

What I did
Allow state db to take modified entries made to the tunnel decap table

Why I did it
Prevent DB inconsistency between state, asic and appl db

How I verified it
redis-cli -n 6 HGETALL "TUNNEL_DECAP_TABLE|IPINIP_V6_TUNNEL"

"tunnel_type"
"IPINIP"
"dscp_mode"
"pipe"
"ecn_mode"
"standard"
"ttl_mode"
"pipe"
/var/log/syslog.1:2025 Nov 10 19:18:32.730385 bjw-can-2700-2 NOTICE swss#orchagent: :- doDecapTunnelTask: Fields for TUNNEL_DECAP_TABLE entry 'IPINIP_TUNNEL' have been synchronised in STATE_DB
/var/log/syslog.1:2025 Nov 10 19:18:32.740298 bjw-can-2700-2 NOTICE swss#orchagent: :- doDecapTunnelTask: Fields for TUNNEL_DECAP_TABLE entry 'IPINIP_V6_TUNNEL' have been synchronised in STATE_DB
…sonic-net#3605)

What I did
Add related handling functions for SRv6 VPN Route and PIC Context in RouteSync.
And add some associated fields for database.

Why I did it
To supplement the functionality of SRv6 VPN Route and PIC Context.

How I verified it
This change is part of the PhoenixWing project. The functionality has already been tested and running in the PhoenixWing.
To improve coverage, unit tests using mock objects for the internal processing logic of Srv6 Vpn Route and Pic Context have been added.
What I did

Support checking the capabilities of ingress/egress mirror before setting it to SAI.

Switch orchagent fetches the mirror capabilities from SAI during initialization and exposes to field PORT_INGRESS_MIRROR_CAPABLE and PORT_EGRESS_MIRROR_CAPABLE in STATE_DB table SWITCH_CAPABILITY.
Mirror orchagent check whether the direction is supported before calling the corresponding SAI API.
Why I did it

This is to avoid SAI returning error on the platforms that the mirror is not supported on a direction.
It will collect SAI SDK dump on receiving a SAI error message, which is unnecessary.
…#3933)

If a fabric port repeatedly and rapidly transitions between the isolate and unisolate states, resulting in instability, the algorithm places the link in a permanent isolated state. Currently, the threshold for triggering this condition is when a link flaps three times within a two-hour period.

Recovery from this state requires manual user intervention via a CLI command:
config fabric port unisolate -n asicX --force

HLD change is at:
…ic-net#3847)

What I did

Portorch, Neighorch, Intfsorch are updated to not access chassis app DB. Chassis app DB is accessed only if the chassisdb.conf is present indicating its a real chassis (not a FS VOQ)
For lag creation, system lags are created as the switch is a VOQ swtich.
Why I did it

For a fixed system, there is no chassis DB present
How I verified it

Ran sonic-mgmt tests to verify BGP, LAG, functionality
What I did

Fix sonic-net/sonic-buildimage#24342

Why I did it

Variable is uninitialized which is causing the error logs mentioned in the issue during startup
…on (sonic-net#3878)

What I did
Adding rx_monitor_timer and tx_monitor_timer handling per HLD: https://github.com/sonic-net/SONiC/blob/master/doc/vxlan/Overlay%20ECMP%20ehancements.md

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
It's needed for SSW HA scenario as DPU side bfd is a software solution, interval must be set to a reasonable value.
…net#3958)

* [fpmsyncd]: Fix uA SID programming for link-local adjacencies

A uA SID performs a shift and cross-connect to a direct neighbor
over a specific interface. It is defined by two parameters: an
interface and a nexthop IPv6 address.

When FRR sends a uA SID to SONiC's fpmsyncd, it includes both of
these parameters. However, fpmsyncd currently only extracts the
nexthop IPv6 address. It then creates an entry in the
SRV6_MY_SID_TABLE of ApplDB with action=ua and
adj=<nexthop_ipv6_address>. Subsequently, OrchAgent retrieves this
entry and attempts to resolve the adjacency to program the SID in
the ASIC.

The issue is that fpmsyncd extracts the nexthop IPv6 address from
the message but does not extract the interface. In cases where the
nexthop IPv6 address is a link-local address, the interface is
essential for successful nexthop resolution. Without it, the
resolution fails, and the SID is not programmed in the ASIC.

For example, the following syslog messages show OrchAgent failing to
resolve a link-local nexthop because the interface is missing:

```
Oct 27 08:31:19.345821 1cdc490d8ce2 INFO #orchagent: :- doTask: table name : SRV6_MY_SID_TABLE
Oct 27 08:31:19.345895 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: MY SID STRING fcbb:bbbb:1:fe10::
Oct 27 08:31:19.345912 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: MySid: sid fcbb:bbbb:1:fe10::, action ua, vrf , block 32, node 16, func 16, arg 0 dt_vrf , adj fe80::e822:daff:feab:3ee9
Oct 27 08:31:19.345946 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: Adjacency fe80::e822:daff:feab:3ee9
Oct 27 08:31:19.345965 1cdc490d8ce2 INFO #orchagent: :- createUpdateMysidEntry: Nexthop for adjacency fe80::e822:daff:feab:3ee9 doesn't exist in DB yet
Oct 27 08:31:19.345983 1cdc490d8ce2 ERR #orchagent: :- doTaskMySidTable: Failed to create/update my_sid entry for sid 32:16:16:0:fcbb:bbbb:1:fe10::
```
This commit fixes a 10-second startup delay during fast-reboot in dynamic buffer mode.

What I did
Add check for fast-reboot done flag m_bufferCompletelyInitialized in checkSharedBufferPoolSize() to avoid early calculation. This can save about 7s.
But still keep one calculation, because buffer pool need to be ready before buffer profile creation.
Also skip headroom validate in startup phase. This can save about 2s.

Why I did it
There is a 10-second startup delay during fast-reboot.
Because in fast-reboot, m_mmuSize is immediately available from STATE_DB (persisted from previous boot), causing checkSharedBufferPoolSize() to execute expensive Redis operations before buffer system init completes.
This PR introduces OpenTelemetry (OTEL) support for exporting SAI (Switch Abstraction Interface) statistics to observability systems. It implements conversion logic from SAI statistics to OTLP gauge metrics and adds an actor for exporting these metrics.

What I did

New OTEL message types for converting SAI statistics to OpenTelemetry gauge format
Initiated the integration of OpenTelemetry (OTEL) into the HFT components of sonic-swss
OtelActor implementation for receiving SAI stats and exporting to OTEL collectors
Command-line arguments for enabling and configuring OTEL export
Established configuration for exporting metrics and traces to OTEL collector
Why I did it

To improve observability and monitoring of HFT processes within SONiC SWSS.
OpenTelemetry provides standardized and extensible tracing which helps with debugging, performance analysis, and future integrations.
Adjust headroom calculation on SN6600 platform

*Set egress mirror headroom to 0 on SN6600 platform (sonic-net#4005)
What I did
Set flow_reconcile_pending, activate_role_pending, and brainsplit_recover_pending back to false after receiving the appropriate notification. For flow_reconcile_pending and activate_role_pending, this is after controller sends operation approval. For brainsplit, clearing the flag after DPU enters stable state again.

Why I did it
flow_reconcile_pending, activate_role_pending, and brainsplit_recover_pending were not being reset to false after being set true for the first time.
What I did
Enable SAI_TAM_TEL_TYPE_ATTR_SWITCH_ENABLE_OUTPUT_QUEUE_STATS to TAM_TEL_TYPE

Why I did it
This attribute is needed if the HFT want to support stats of IPG
What I did
Supporting update to peer ip in ha set config.

Why I did it
In repairing process peer ip needs to be updated to the new DPU to be paired with.
sonic-platform-daemon side change: sonic-net/sonic-platform-daemons#643

What I did
Support SAI_PORT_SERDES_ATTR_CUSTOM_COLLECTION (i.e. custom serdes attributes represented in JSON based string format)

Use boost::variant for the serdes_attr map value to support both std::vector<uint32_t> and std::string in a type-safe way, with easy extensibility to add more types later.
This PR updates DASH (Data Processing Unit) orchagents to use the DPU application database instead of the standard application database for all orchestrator components.

Changed database connection from APPL_DB to DPU_APPL_DB across DASH components
Updated test infrastructure to support DPU application database validation
Modified orchestrator daemon initialization to use DPU-specific database connections
What I did
Add support for platform based on Clounix asic.The Clounix asic platform code has been merged into sonic-buildimage repo. Please see details in PR: sonic-buildimage clounix PR

How I did it
Add support for platform based on Clounix asic
sonic-net#3982)

What I did
Added cleanup of COUNTERS_*_NAME_MAP entries for a port during its deinit phase, and regenerated the NAME_MAP tables with fresh OIDs during port init when queue flex counters are already enabled.

Why I did it
After dynamic port breakout of a port, queue name map tables in the COUNTERS_DB table are not regenerated leaving stale entries resulting in CLI crash
…Port is removed and created when the Speed is changed dynamically via GCU (sonic-net#3976)

What I did
Fixed the issue reported in sonic-net/sonic-buildimage#24417
Added code to populate the system_port information in the New Port structure in orchagent portsorch after the Port is removed and created when the Port speed is changed via GCU patch.

Why I did it
When the switch is created, swss queries all the SYSTEM_PORTS from SAI and updates the PORT class/structure with the corresponding system_port info after the PortInitDone event is received from portsyncd.
Then the port speed is changed with 4 Lanes via GCU patch, the port is removed from SAI and created again in swss by calling deInitPort and initPort. But in initPort, the system_port info is not updated in the new PORT structure.
So when the RIF is created on local interface, the voqSyncAddIntf adds an entry in SYSTEM_INTERFACE table in CHASSIS_APP_DB with empty key since the system_port info is not populated for the local port. For the same reason, the SYSTEM_NEIGH info is also not updated in CHASSIS_APP_DB. This breaks the basic VOQ functionality
What I did

Support FNIC pipeline changes for DashEniFwd Orch (No match on Tunnel VNI and only match on INNER_DST_MAC)
Update DashEniFwdOrch to handle the updated schema for DPU, VDPU, REMOTE_DPU & DASH_ENI_FORWARD_TABLE
Delete the ACL Table if all the ENI acl rules are deleted
While creating Tunnel NH, also pass VNI.
Update Aclorch to accept Tunnel NH in the following format: endpoint_ip@tunnel_name[,vni][,mac]
Simplify the logic by removing the reference counting logic for Remote NH Tunnel tracking
Added REQ_T_STRING_LIST option to Request parser
Allow Relaxed Attribute parsing to Request parser
This is needed because the Request parser expects a strict schema of field-value pairs.
Signed-off-by: Sonic Automation <sonicbld@microsoft.com>
Copy link
Copy Markdown

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.