Skip to content

Commit a181b5d

Browse files
Orchagent SAI error handling improvements
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. How I verified it By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures. Details if related Sample error handling snippet showing eventd log and SAI dump invocation. 2025 Jun 6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists 2025 Jun 6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS 2025 Jun 6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101 2025 Jun 6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED 2025 Jun 6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump 2025 Jun 6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}} 2025 Jun 6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP 2025 Jun 6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
1 parent 50048e7 commit a181b5d

8 files changed

Lines changed: 543 additions & 270 deletions

File tree

orchagent/main.cpp

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,8 @@ void syncd_apply_view()
127127
if (status != SAI_STATUS_SUCCESS)
128128
{
129129
SWSS_LOG_ERROR("Failed to notify syncd APPLY_VIEW %d", status);
130-
handleSaiFailure(true);
130+
handleSaiFailure(SAI_API_SWITCH, "set", status);
131+
return;
131132
}
132133
}
133134

@@ -701,7 +702,8 @@ int main(int argc, char **argv)
701702
if (status != SAI_STATUS_SUCCESS)
702703
{
703704
SWSS_LOG_ERROR("Failed to create a switch, rv:%d", status);
704-
handleSaiFailure(true);
705+
handleSaiFailure(SAI_API_SWITCH, "create", status);
706+
return EXIT_FAILURE;
705707
}
706708
SWSS_LOG_NOTICE("Create a switch, id:%" PRIu64, gSwitchId);
707709

@@ -732,7 +734,8 @@ int main(int argc, char **argv)
732734
if (status != SAI_STATUS_SUCCESS)
733735
{
734736
SWSS_LOG_ERROR("Failed to get MAC address from switch, rv:%d", status);
735-
handleSaiFailure(true);
737+
handleSaiFailure(SAI_API_SWITCH, "get", status);
738+
return EXIT_FAILURE;
736739
}
737740
else
738741
{
@@ -747,7 +750,8 @@ int main(int argc, char **argv)
747750
if (status != SAI_STATUS_SUCCESS)
748751
{
749752
SWSS_LOG_ERROR("Fail to get switch virtual router ID %d", status);
750-
handleSaiFailure(true);
753+
handleSaiFailure(SAI_API_SWITCH, "get", status);
754+
return EXIT_FAILURE;
751755
}
752756

753757
gVirtualRouterId = attr.value.oid;
@@ -789,7 +793,8 @@ int main(int argc, char **argv)
789793
if (status != SAI_STATUS_SUCCESS)
790794
{
791795
SWSS_LOG_ERROR("Failed to create underlay router interface %d", status);
792-
handleSaiFailure(true);
796+
handleSaiFailure(SAI_API_ROUTER_INTERFACE, "create", status);
797+
return EXIT_FAILURE;
793798
}
794799

795800
SWSS_LOG_NOTICE("Created underlay router interface ID %" PRIx64, gUnderlayIfId);

orchagent/orchdaemon.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -836,7 +836,7 @@ void OrchDaemon::flush()
836836
if (status != SAI_STATUS_SUCCESS)
837837
{
838838
SWSS_LOG_ERROR("Failed to flush redis pipeline %d", status);
839-
handleSaiFailure(true);
839+
handleSaiFailure(SAI_API_SWITCH, "set", status);
840840
}
841841

842842
for (auto* orch: m_orchList)

0 commit comments

Comments
 (0)