Skip to content

Add HLD for Orchagent error handling improvements#1698

Open
prabhataravind wants to merge 7 commits intosonic-net:masterfrom
prabhataravind:master
Open

Add HLD for Orchagent error handling improvements#1698
prabhataravind wants to merge 7 commits intosonic-net:masterfrom
prabhataravind:master

Conversation

@prabhataravind
Copy link
Contributor

This HLD change attempts to address the following:

  • Handle all ASIC/SAI programming errors gracefully without causing orchagent to crash or restart
  • Detect missed notifications from APP_DB to orchagent in SONiC systems that use redis-based communication channels
  • Detect out-of-sync entries between APP_DB and ASIC_DB

@prabhataravind prabhataravind marked this pull request as ready for review June 24, 2024 00:37
@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao
Copy link
Collaborator

Please leave comments if you want to be a reviewer of this HLD. Thanks.

@zhangyanzhao
Copy link
Collaborator

@prabhataravind can you please add the code PRs by referring to #806? Thanks.

@zhangyanzhao
Copy link
Collaborator

HLD PR is not merged, no code PR. Move to backlog

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

![sai status handling](images/sai_status_handling.png)

It is to be noted that some combinations in the table above are not valid scenarios like for example: SAI_STATUS_INSUFFICIENT_RESOURCES when removing an object or SAI_STATUS_ITEM_NOT_FOUND when creating an object. They are however mentioned for completeness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a section for Bulk API failure handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check for bulk stats API failures

@anilpannala anilpannala moved this from 📋 In Plan Features to MovedToBacklog in SONiC 202505 Release May 30, 2025
No new SAI APIs are introduced as part of this functionaility.

### Configuration and management
There are no configuration and management changes introduced as part of this functionality.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed config will be still committed in config-db and exists?
For example, if SAI API returns EINVAL, the invalid config which caused this error, still saved in the config-db and no rollback right? looks like it. just want to confirm.

prabhataravind added a commit to prabhataravind/sonic-swss that referenced this pull request Jun 13, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.

Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

How I verified it
By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures.

Details if related
Sample error handling snippet showing eventd log and SAI dump invocation.

2025 Jun  6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists
2025 Jun  6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS
2025 Jun  6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101
2025 Jun  6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED
2025 Jun  6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
2025 Jun  6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}}
2025 Jun  6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2025 Jun  6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
prabhataravind added a commit to prabhataravind/sonic-swss that referenced this pull request Jun 17, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.

Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

How I verified it
By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures.

Details if related
Sample error handling snippet showing eventd log and SAI dump invocation.

2025 Jun  6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists
2025 Jun  6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS
2025 Jun  6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101
2025 Jun  6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED
2025 Jun  6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
2025 Jun  6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}}
2025 Jun  6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2025 Jun  6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
prsunny pushed a commit to sonic-net/sonic-swss that referenced this pull request Jun 17, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
yejianquan added a commit to sonic-net/sonic-swss that referenced this pull request Jun 18, 2025
[202505]: Orchagent SAI error handling improvements

Cherry-pick of master PR #3587

What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

How I verified it
By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures.

Details if related
Sample error handling snippet showing eventd log and SAI dump invocation.

2025 Jun  6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists
2025 Jun  6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS
2025 Jun  6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101
2025 Jun  6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED
2025 Jun  6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
2025 Jun  6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}}
2025 Jun  6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2025 Jun  6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
divyagayathri-hcl pushed a commit to divyagayathri-hcl/sonic-swss that referenced this pull request Jun 22, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
divyagayathri-hcl pushed a commit to divyagayathri-hcl/sonic-swss that referenced this pull request Jun 23, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
balanokia pushed a commit to balanokia/sonic-swss that referenced this pull request Nov 17, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Feb 4, 2026
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.

Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.

Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
@prabhataravind prabhataravind moved this from MovedToBacklog to ✅ Done in SONiC 202505 Release Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: MovedToBacklog
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

7 participants