[mux]: Implement rollback for failed mux switchovers#2714
[mux]: Implement rollback for failed mux switchovers#2714theasianpianist merged 15 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
- Make all switchover-related SAI operations idempotent (create and remove for ACL, next hop, neighbor, and route) - Implement rollback when a switchover fails Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
| // Check at the start of the function means we will never reach here | ||
| SWSS_LOG_ERROR("[%s] Rollback to %s not supported", mux_name_.c_str(), | ||
| muxStateValToString.at(prev_state_).c_str()); | ||
| return; |
There was a problem hiding this comment.
This will leave st_chg_in_progress_ as true, right?
There was a problem hiding this comment.
Yes, but we will never get here since the check on the first line of this function will return early for FAILED and PENDING states
| void bindAllPorts(AclTable &acl_table); | ||
|
|
||
| // class shared dict: ACL table name -> ACL table | ||
| static std::map<std::string, AclTable> acl_table_; |
There was a problem hiding this comment.
why is this removed? i think its changing the overall logic.
There was a problem hiding this comment.
based on my understanding, this was used to cache if the ACL table had already been created in hardware. I changed the behavior to delegate this check to AclOrch instead, since the static scope of this map was preventing the mock_tests from passing. The MuxAclHandler now always asks AclOrch to check if the table already exists, instead of storing that info in MuxAclHandler.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
c78cc21 to
c28207c
Compare
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
| status = sai_acl_api->create_acl_entry(&m_ruleOid, gSwitchId, (uint32_t)rule_attrs.size(), rule_attrs.data()); | ||
| if (status != SAI_STATUS_SUCCESS) | ||
| { | ||
| if (status == SAI_STATUS_ITEM_ALREADY_EXISTS) |
There was a problem hiding this comment.
@theasianpianist, is this change required for mux idempotency/rollback? @bingwang-ms , can you please review?
There was a problem hiding this comment.
If the ACL entry already exists, the SAI API will return this status, when we rollback we might hit this scenario so want to continue normally if the entry does already exist.
orchagent/neighorch.cpp
Outdated
| { | ||
| /* When next hop is not found, we continue to remove neighbor entry. */ | ||
| if (status == SAI_STATUS_ITEM_NOT_FOUND) | ||
| if (status == SAI_STATUS_ITEM_NOT_FOUND || status == SAI_STATUS_INVALID_PARAMETER) |
There was a problem hiding this comment.
This seems to be changing existing flow. Are we getting SAI_STATUS_INVALID_PARAMETER here for non-existent entries?
There was a problem hiding this comment.
orchagent/aclorch.cpp
Outdated
| auto status = sai_acl_api->remove_acl_entry(m_ruleOid); | ||
| if (status != SAI_STATUS_SUCCESS) | ||
| { | ||
| if (status == SAI_STATUS_ITEM_NOT_FOUND || status == SAI_STATUS_INVALID_PARAMETER) |
There was a problem hiding this comment.
Let's get the sairedis merged and revisit this section
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Need to fix build failure |
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
|
@theasianpianist can you create separate PR for 202205 branch? |
- Make all SAI API operations needed for switchover idempotent - Implement rollback when a switchover fails Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
PR here: #2761 |
- Make all SAI API operations needed for switchover idempotent - Implement rollback when a switchover fails Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
- Make all SAI API operations needed for switchover idempotent - Implement rollback when a switchover fails Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Depends on sonic-net/sonic-sairedis#1224 being merged first
What I did
Why I did it
How I verified it
Details if related