Skip to content

Orchagent SAI error handling improvements#3587

Merged
prsunny merged 11 commits intosonic-net:masterfrom
prabhataravind:paravind/oa_crash_handling
Jun 17, 2025
Merged

Orchagent SAI error handling improvements#3587
prsunny merged 11 commits intosonic-net:masterfrom
prabhataravind:paravind/oa_crash_handling

Conversation

@prabhataravind
Copy link
Contributor

@prabhataravind prabhataravind commented Apr 1, 2025

What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

  1. handleSaiCreateStatus() / handleSaiSetStatus() changes
  • Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
  • Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
    SAI_STATUS_NV_STORAGE_FULL.
  • Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
  1. handleSaiRemoveStatus() changes
  • Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
  • Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
  • Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
  1. handleSaiGetStatus() changes
  • Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
  1. handleSaiFailure() changes
  • Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
  1. Mock test changes
  • Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
  • Fixed errors in existing portsorch_ut test cases

What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.

How I verified it
By adding mock tests to verify that orchagent no longer exits when there are SAI API call failures.

Details if related
Sample error handling snippet showing eventd log and SAI dump invocation.

2025 Jun  6 17:53:24.448168 sonic ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"10.1.0.32/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000023"} already exists
2025 Jun  6 17:53:24.448547 sonic ERR swss#orchagent: :- flush_creating_entries: EntityBulker.flush create entries failed, number of entries to create: 1, status: SAI_STATUS_ITEM_ALREADY_EXISTS
2025 Jun  6 17:53:24.448750 sonic ERR swss#orchagent: :- addRoutePost: Failed to create route 10.1.0.32/32 with next hop(s) 30.1.0.2@PortChannel101
2025 Jun  6 17:53:24.448933 sonic ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED
2025 Jun  6 17:53:24.449276 sonic NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
2025 Jun  6 17:53:24.449276 sonic NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:sai-operation-failure":{"api":"SAI_API_ROUTE","operation":"create","status":"SAI_STATUS_NOT_EXECUTED","timestamp":"2025-06-06T17:53:24.447963Z"}}
2025 Jun  6 17:53:24.449309 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2025 Jun  6 17:53:24.465408 sonic NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from ab327ec to f9c4c00 Compare April 1, 2025 23:26
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from f9c4c00 to 6c90a6c Compare April 2, 2025 00:05
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from 6c90a6c to e367f46 Compare April 2, 2025 01:09
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from e367f46 to 30c96c8 Compare April 2, 2025 01:44
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from 30c96c8 to c81034c Compare April 2, 2025 03:42
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from c81034c to 7fc9304 Compare April 2, 2025 17:34
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from 7fc9304 to 762abd7 Compare April 2, 2025 19:43
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from 271558f to 724d58a Compare April 5, 2025 02:07
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/oa_crash_handling branch from 724d58a to a87e3b0 Compare April 5, 2025 17:40
@mssonicbld
Copy link
Collaborator

/azp run

prabhataravind added a commit to prabhataravind/sonic-mgmt that referenced this pull request Jun 20, 2025
 * Following changes in sonic-net/sonic-swss#3587,
test_duplicate_route.py needs to be updated accordingly.

* Skip the test temporarily until swss submodule update is complete to avoid
  failures due to a circular dependency.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
prabhataravind added a commit to prabhataravind/sonic-mgmt that referenced this pull request Jun 20, 2025
 * Following changes in sonic-net/sonic-swss#3587,
test_duplicate_route.py needs to be updated accordingly.

* Skip the test temporarily until swss submodule update is complete to avoid
  failures due to a circular dependency.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
yejianquan pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Jun 20, 2025
…es (#19104)

* Following changes in sonic-net/sonic-swss#3587,
test_duplicate_route.py needs to be updated accordingly.

* Skip the test temporarily until swss submodule update is complete to avoid
  failures due to a circular dependency.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
divyagayathri-hcl pushed a commit to divyagayathri-hcl/sonic-swss that referenced this pull request Jun 22, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
divyagayathri-hcl pushed a commit to divyagayathri-hcl/sonic-swss that referenced this pull request Jun 23, 2025
What I did
This change aims to reduce self-induced orchagent exit when any SAI API call fails (i.e returns anything other than SAI_STATUS_SUCCESS). This change is the first set of changes that does the following:

handleSaiCreateStatus() / handleSaiSetStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND and SAI_STATUS_OBJECT_IN_USE irrespective of the object type.
Return 'task_need_retry' for SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY and
SAI_STATUS_NV_STORAGE_FULL.
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiRemoveStatus() changes
Return 'task_success' for SAI_STATUS_ITEM_ALREADY_EXISTS, SAI_STATUS_ITEM_NOT_FOUND, SAI_STATUS_ADDR_NOT_FOUND, SAI_STATUS_INSUFFICIENT_RESOURCES, SAI_STATUS_TABLE_FULL, SAI_STATUS_NO_MEMORY, SAI_STATUS_NV_STORAGE_FULL
Return 'task_need_retry' for SAI_STATUS_OBJECT_IN_USE
Call handleSaiFailure() and return 'task_failed' for other SAI errors. This will log a structured syslog via eventd and also take a SAI dump.
handleSaiGetStatus() changes
Log a NOTICE message and return task_failed. This is similar to what is being done today for GET calls.
handleSaiFailure() changes
Update handleSaiFailure() to take 3 arguments - namely the SAI API, operation type string and SAI API return status. This will be used in crafting a structured syslog error message when the failure happens. All callers of this function are updated accordingly.
Mock test changes
Added new tests for coverage and updated existing tests that do "ASSERT_DEATH" assertions when SAI API calls fail.
Fixed errors in existing portsorch_ut test cases
What is not done
There are changes needed to all orchs to handle scenarios where orchagent doesn't crash anymore when SAI API calls fail. There are also places in different orchs where an explicit exception is thrown in case of SAI errors. These and the remaining items in sonic-net/SONiC#1698 will be handled in phase-2.
 
Why I did it
Crashing orchagent on every SAI error is an overkill. Instead, we follow the approach that is called out in the HLD above to handle these errors in a more graceful manner.
@vrajeshe
Copy link

vrajeshe commented Jul 4, 2025

@prabhataravind can you please backport this to 202411 branch too ?

@deekshabhandary13
Copy link

@yejianquan can you please add the label for merging this commit to 202411 branch too?

@yejianquan
Copy link

@yejianquan can you please add the label for merging this commit to 202411 branch too?

@kperumalbfn (release manager of 202411) for 202411 requests

@prsunny
Copy link
Collaborator

prsunny commented Jul 14, 2025

@deekshabhandary13 , 202411 is almost frozen. This is available from 202505.

@Pterosaur
Copy link
Contributor

Hi @prabhataravind , Due to the conflict, could you please create a PR for 202412?
https://github.com/Azure/sonic-swss.msft/tree/202412

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.