Orchagent SAI error handling improvements#3587
Merged
prsunny merged 11 commits intosonic-net:masterfrom Jun 17, 2025
Merged
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ab327ec to
f9c4c00
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
f9c4c00 to
6c90a6c
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
6c90a6c to
e367f46
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
e367f46 to
30c96c8
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
30c96c8 to
c81034c
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
c81034c to
7fc9304
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
7fc9304 to
762abd7
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
271558f to
724d58a
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
724d58a to
a87e3b0
Compare
Collaborator
|
/azp run |
prabhataravind
added a commit
to prabhataravind/sonic-mgmt
that referenced
this pull request
Jun 20, 2025
* Following changes in sonic-net/sonic-swss#3587, test_duplicate_route.py needs to be updated accordingly. * Skip the test temporarily until swss submodule update is complete to avoid failures due to a circular dependency. Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Merged
11 tasks
prabhataravind
added a commit
to prabhataravind/sonic-mgmt
that referenced
this pull request
Jun 20, 2025
* Following changes in sonic-net/sonic-swss#3587, test_duplicate_route.py needs to be updated accordingly. * Skip the test temporarily until swss submodule update is complete to avoid failures due to a circular dependency. Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
yejianquan
pushed a commit
to sonic-net/sonic-mgmt
that referenced
this pull request
Jun 20, 2025
…es (#19104) * Following changes in sonic-net/sonic-swss#3587, test_duplicate_route.py needs to be updated accordingly. * Skip the test temporarily until swss submodule update is complete to avoid failures due to a circular dependency. Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
divyagayathri-hcl
pushed a commit
to divyagayathri-hcl/sonic-swss
that referenced
this pull request
Jun 22, 2025
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
divyagayathri-hcl
pushed a commit
to divyagayathri-hcl/sonic-swss
that referenced
this pull request
Jun 23, 2025
What I did This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following: handleSaiCreateStatus() / handleSaiSetStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type. Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and SAI_STATUS_NV_STORAGE_FULL. Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiRemoveStatus() changes Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump. handleSaiGetStatus() changes Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls. handleSaiFailure() changes Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly. Mock test changes Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail. Fixed errors in existing portsorch_ut test cases What is not done There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2. Why I did it Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
|
@prabhataravind can you please backport this to 202411 branch too ? |
|
@yejianquan can you please add the label for merging this commit to 202411 branch too? |
@kperumalbfn (release manager of 202411) for 202411 requests |
Collaborator
|
@deekshabhandary13 , 202411 is almost frozen. This is available from 202505. |
nissampa
pushed a commit
to nissampa/sonic-mgmt_dpu_test
that referenced
this pull request
Aug 7, 2025
…es (sonic-net#19102) * Following changes in sonic-net/sonic-swss#3587, test_duplicate_route.py needs to be updated accordingly. Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Contributor
|
Hi @prabhataravind , Due to the conflict, could you please create a PR for 202412? |
This was referenced Dec 9, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:
SAI_STATUS_NV_STORAGE_FULL.
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
How I verified it
By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures.
Details if related
Sample error handling snippet showing eventd log and SAI dump invocation.