Add HLD for Orchagent error handling improvements#1698
Add HLD for Orchagent error handling improvements#1698prabhataravind wants to merge 7 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
|
Community review recording https://zoom.us/rec/share/BsfH3U7tgvGEZpbnJyWaMPfnkAxf9U_CpcK0h9czhnTX4OWX1Z4rLJ0zxqAfJc4.MERICW4R0gGPpjCA |
|
Please leave comments if you want to be a reviewer of this HLD. Thanks. |
|
@prabhataravind can you please add the code PRs by referring to #806? Thanks. |
|
HLD PR is not merged, no code PR. Move to backlog |
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
|
/azp run |
|
No pipelines are associated with this pull request. |
|  | ||
|
|
||
| It is to be noted that some combinations in the table above are not valid scenarios like for example: SAI_STATUS_INSUFFICIENT_RESOURCES when removing an object or SAI_STATUS_ITEM_NOT_FOUND when creating an object. They are however mentioned for completeness. | ||
|
|
There was a problem hiding this comment.
Please add a section for Bulk API failure handling.
There was a problem hiding this comment.
Please check for bulk stats API failures
| No new SAI APIs are introduced as part of this functionaility. | ||
|
|
||
| ### Configuration and management | ||
| There are no configuration and management changes introduced as part of this functionality. |
There was a problem hiding this comment.
failed config will be still committed in config-db and exists?
For example, if SAI API returns EINVAL, the invalid config which caused this error, still saved in the config-db and no rollback right? looks like it. just want to confirm.
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. How I verified it By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures. Details if related Sample error handling snippet showing eventd log and SAI dump invocation. 2025 Jun 6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists 2025 Jun 6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS 2025 Jun 6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101 2025 Jun 6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED 2025 Jun 6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump 2025 Jun 6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}} 2025 Jun 6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP 2025 Jun 6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. How I verified it By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures. Details if related Sample error handling snippet showing eventd log and SAI dump invocation. 2025 Jun 6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists 2025 Jun 6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS 2025 Jun 6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101 2025 Jun 6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED 2025 Jun 6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump 2025 Jun 6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}} 2025 Jun 6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP 2025 Jun 6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
[202505]: Orchagent SAI error handling improvements Cherry-pick of master PR #3587 What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. How I verified it By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures. Details if related Sample error handling snippet showing eventd log and SAI dump invocation. 2025 Jun 6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists 2025 Jun 6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS 2025 Jun 6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101 2025 Jun 6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED 2025 Jun 6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump 2025 Jun 6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}} 2025 Jun 6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP 2025 Jun 6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner. Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
This HLD change attempts to address the following: