Skip to content

Add failure notification for orchagent#1665

Merged
shi-su merged 9 commits intosonic-net:masterfrom
shi-su:on_par
Mar 31, 2021
Merged

Add failure notification for orchagent#1665
shi-su merged 9 commits intosonic-net:masterfrom
shi-su:on_par

Conversation

@shi-su
Copy link
Contributor

@shi-su shi-su commented Mar 8, 2021

What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.

How I verified it

Details if related

@lgtm-com
Copy link

lgtm-com bot commented Mar 8, 2021

This pull request fixes 2 alerts when merging 5de44a6 into 03a0e21 - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Mar 10, 2021

This pull request fixes 2 alerts when merging eaed258 into 9e30abb - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@shi-su shi-su changed the base branch from master to bmtor March 12, 2021 18:54
@shi-su shi-su changed the base branch from bmtor to master March 12, 2021 18:54
@lgtm-com
Copy link

lgtm-com bot commented Mar 12, 2021

This pull request fixes 2 alerts when merging 625dfbb into 1951365 - view on LGTM.com

fixed alerts:

  • 2 for FIXME comment

@shi-su shi-su changed the title Add failure notification for multiple orch Add failure notification for orchagent Mar 15, 2021
@shi-su shi-su requested a review from qiluo-msft March 15, 2021 16:53
if (status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to create table %s", flowWLTable.description.c_str());
handleSaiCreateStatus(SAI_API_ACL, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 8, length = 21)

handleSaiCreateStatus [](start = 8, length = 21)

Check the return value? If successfully handled, you treat it as success? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume this function magically retry and fix the issue, so the return value is task_success. You should return SAI_STATUS_SUCCESS after this function.


In reply to: 595609272 [](ancestors = 595609272)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, updated the logic to check if the return value is task_success, and proceed when it is.

{
SWSS_LOG_ERROR("Failed to create default host interface table, rv:%d", status);
throw "CoppOrch initialization failure";
handleSaiCreateStatus(SAI_API_HOSTIF, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 8, length = 21)

handleSaiCreateStatus [](start = 8, length = 21)

Do you want to keep old behavior even with your new handle function? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to check return value and conditional throw.


In reply to: 595610233 [](ancestors = 595610233)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic to check return value before throwing.

if (sai_status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to apply attribute:%d to trap group:%" PRIx64 ", name:%s, error:%d\n", trap_gr_attr.id, m_trap_group_map[trap_group_name], trap_group_name.c_str(), sai_status);
handleSaiSetStatus(SAI_API_HOSTIF, sai_status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiSetStatus [](start = 20, length = 18)

handleSaiSetStatus [](start = 20, length = 18)

Check the return value? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the return value of the handling function and return only when it is not task_success

}
else
{
it++;
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it++; [](start = 24, length = 5)

I think you change the behavior. Old code will always erase(it) no matter success or not. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the handling function always exits orchagent to achieve on-par handling with async mode. So this change will not be activated at the moment and the behavior should be the same as async mode. Once we have a better logic for the failure handling functions, we can fully control whether it should retry or not.

SWSS_LOG_ERROR("Failed to create %s FDB %s in %s on %s, rv:%d",
fdbData.type.c_str(), entry.mac.to_string().c_str(),
vlan.m_alias.c_str(), port_name.c_str(), status);
return false; //FIXME: it should be based on status. Some could be retried, some not
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME [](start = 28, length = 5)

Did you truly fixed the FIXME? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, added the FIXME back.

{
SWSS_LOG_ERROR("FdbOrch RemoveFDBEntry: Failed to remove FDB entry. mac=%s, bv_id=0x%" PRIx64,
entry.mac.to_string().c_str(), entry.bv_id);
return true; //FIXME: it should be based on status. Some could be retried. some not
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME [](start = 23, length = 5)

The same. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the FIXME back.

SWSS_LOG_ERROR("Failed to create router interface %s, rv:%d",
port.m_alias.c_str(), status);
throw runtime_error("Failed to create router interface.");
return handleSaiCreateStatus(SAI_API_ROUTER_INTERFACE, status);
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSaiCreateStatus [](start = 15, length = 21)

handleSaiCreateStatus [](start = 15, length = 21)

To exactly mimic old behavior, you can derive the handleSaiCreateStatus function in this class, and throw inside the function. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the return value of the handling function and throw exception only when it did not successfully handle the failure.

@shi-su shi-su requested a review from qiluo-msft March 24, 2021 17:25
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
shi-su added a commit to shi-su/sonic-swss that referenced this pull request Aug 17, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
shi-su added a commit that referenced this pull request Aug 18, 2021
What I did
Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits:

941875a Deactivate mirror session only when session status is true in updateLagMember (#1666)
be12482 Ignore ALREADY_EXIST error in FDB creation (#1815)
c9c1aa2 Add failure handling for SAI get operations (#1768)
47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786)
db9238f Add failure notification for orchagent (#1665)
fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596)

Why I did it
202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.

Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants