Skip to content

Fix the issue of SAI next-hop member removal error when ports that accept routes that are eligible for default route swap are toggle multiple times#23

Merged
arlakshm merged 2 commits intoAzure:202405from
abdosi:202405
Jan 29, 2025
Merged

Fix the issue of SAI next-hop member removal error when ports that accept routes that are eligible for default route swap are toggle multiple times#23
arlakshm merged 2 commits intoAzure:202405from
abdosi:202405

Conversation

@abdosi
Copy link
Copy Markdown

@abdosi abdosi commented Jan 20, 2025

What I did:
Fix the issue of SAI next-hop member removal error when ports that accept routes that are eligible for default route swap are toggle multiple times

Sequence of flow:

  1. Port P1,P2 P3 accept routes R that are eligible for default Route D nexthop port P4
  2. R point to Nexthop Group (N) that contains P1, P2 P3
  3. P1, P2,P3 link goes down.
  4. R point to Nexthop Group (N) that now contains P4 and P1,P2,P3 are removed only from SAI
  5. P1,P2,P3 link comes up. We do not add back P1,P2,P3 to N again as accept R to get deleted
  6. Now P1,P2,P3 link goes down again. Since we have already remove P1,P2,P3 we get SAI error as we try to remove P1,P2,P3 again.

How I fix:
When scenario 6 happens check if Route is already pointing to Defafult Route Nexthop than we don;t need to delete that nexthop.

How I verify:
This scenario got uncovered running: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/link_flap/test_cont_link_flap.py#L109.

After fix issue is not seen

accept routes that are eligible for default route swap are toggle
multiple times

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@abdosi abdosi requested a review from prsunny as a code owner January 20, 2025 03:46
@abdosi abdosi requested a review from arlakshm January 20, 2025 03:49
@abdosi
Copy link
Copy Markdown
Author

abdosi commented Jan 20, 2025

@yejianquan for viz.

@yejianquan
Copy link
Copy Markdown

yejianquan commented Jan 20, 2025

nice catch @abdosi !
Do we need to create the fix to master branch also?

@arlakshm
Copy link
Copy Markdown

arlakshm commented Jan 21, 2025

@abdosi

  1. P1,P2,P3 link comes up. We do not add back P1,P2,P3 to N again as accept R to get deleted
    Now P1,P2,P3 link goes down again.
  2. Since we have already remove P1,P2,P3 we get SAI error as we try to remove P1,P2,P3 again.

For points 5 and 6 to occur, the link comes up before the control plane responds to the link down event. In the logs, do we observe that Orchagent gets a message to delete R after links P1, P2,P3 go down?

Since there are multiple Route with same nexthop group we might get message to delete R with [P1,P2,P3] but R' can still point to [P1,P2,P3]

Do the SAI errors seen in 6 cause Orchagent crash?

@arlakshm : Yes it can cause OA crash.

Is it possible to add UT for this case?

@arlakshm UT added

@prsunny
Copy link
Copy Markdown

prsunny commented Jan 22, 2025

nice catch @abdosi ! Do we need to create the fix to master branch also?

I think master PR is not merged yet. It can be updated to add this fix. Also @abdosi , can you share a link to master PR?

@prsunny : master pr: sonic-net/sonic-swss#3389

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@arlakshm arlakshm merged commit 463df23 into Azure:202405 Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants