Added change not to create ECMP Group in SAI and program the route if none of ECMP members are active/link-up#3394
Conversation
all the members of ECMP groups are inactive. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
|
@prsunny reminder to review, thanks. |
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
|
@abdosi , i think the test failures are specific to this PR as I see it passing for all others |
|
/azp run Azure.sonic-swss |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
|
/azp run Azure.sonic-swss |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Readding label for cherry-pick |
… none of ECMP members are active/link-up (sonic-net#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down.
|
Cherry-pick PR to 202405: #3442 |
… none of ECMP members are active/link-up (#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down.
| { | ||
| SWSS_LOG_INFO("Failed to find route entry, vrf_id 0x%" PRIx64 ", prefix %s\n", vrf_id, | ||
| ipPrefix.to_string().c_str()); | ||
| if (it_route_table->second.size() == 0) |
There was a problem hiding this comment.
I suspect it's unsafe to remove m_syncdRoutes[vrf_id] here.
gRouteBulker.creating_entries_count(route_entry) can give unexpected result. it just counts the number of certain prefix that are pending creation.
however, if no such prefix but other prefixes belong to the VRF pending creation, it still returns 0 and the VRF routing table is removed.
After that, once the bulk returns, it can not find the VRF table when handling the routing entries, which causes an exception.
We saw the following uncaught C++ exception
(gdb) bt
#0 0x00007f5791aedebc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5791a9efb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f5791a89472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f5791de0919 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f5791debe1a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f5791debe85 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f5791dec0d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007f5791de3240 in std::__throw_out_of_range(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00005594e856d956 in std::map<unsigned long, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::map<swss::IpPrefix, RouteNhg, std::less<swss::IpPrefix>, std::allocator<std::pair<swss::IpPrefix const, RouteNhg> > > > > >::at (this=<optimized out>, __k=<optimized out>) at /usr/include/c++/12/bits/stl_map.h:551
#9 0x00005594e8564beb in RouteOrch::addRoutePost (this=this@entry=0x5594ea13e080, ctx=..., nextHops=...) at ./orchagent/routeorch.cpp:2145
#10 0x00005594e856b0b2 in RouteOrch::doTask (this=0x5594ea13e080, consumer=...) at ./orchagent/routeorch.cpp:1021
#11 0x00005594e85282d2 in Orch::doTask (this=0x5594ea13e080) at ./orchagent/orch.cpp:553
#12 0x00005594e851909a in OrchDaemon::start (this=this@entry=0x5594ea0a0950) at ./orchagent/orchdaemon.cpp:895
#13 0x00005594e8485632 in main (argc=<optimized out>, argv=<optimized out>) at ./orchagent/main.cpp:818
(gdb) frame 9
#9 0x00005594e8564beb in RouteOrch::addRoutePost (this=this@entry=0x5594ea13e080, ctx=..., nextHops=...) at ./orchagent/routeorch.cpp:2145
… none of ECMP members are active/link-up (sonic-net#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down.
… none of ECMP members are active/link-up (sonic-net#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down.
|
Cherry-pick PR to 202411: #3522 |
*What I did: Added Change to Skip Route Programming if NH is link/oper down. With Scale Route testing of 60K+ routes when we toggle all the interfaces[14+ interface back to back] as done here: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_multi_po_flap.py we see because of slowness of FRR Route APP_DB processing compare to Link Notification Handling where we have updated the Nexthop Group as part of Link Notification handling to point to default route via #3389 [if eligible] FRR slowness can reprogram the Route back to Nexthop which is link down. This change is similar to #3394 which was done for Nexthop Group.
… none of ECMP members are active/link-up (sonic-net#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down.
…-net#3520) *What I did: Added Change to Skip Route Programming if NH is link/oper down. With Scale Route testing of 60K+ routes when we toggle all the interfaces[14+ interface back to back] as done here: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_multi_po_flap.py we see because of slowness of FRR Route APP_DB processing compare to Link Notification Handling where we have updated the Nexthop Group as part of Link Notification handling to point to default route via sonic-net#3389 [if eligible] FRR slowness can reprogram the Route back to Nexthop which is link down. This change is similar to sonic-net#3394 which was done for Nexthop Group.
… none of ECMP members are active/link-up (sonic-net#3394) What I did: Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up. Also do not program the Temp Route if Neigh is not active (Link Down) Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in removeRoute as removeRoutePost will not be called in this case. Why I did: In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable. Also in this case no point to add Temp Route if neighbor is link down. Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
…-net#3520) *What I did: Added Change to Skip Route Programming if NH is link/oper down. With Scale Route testing of 60K+ routes when we toggle all the interfaces[14+ interface back to back] as done here: https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_multi_po_flap.py we see because of slowness of FRR Route APP_DB processing compare to Link Notification Handling where we have updated the Nexthop Group as part of Link Notification handling to point to default route via sonic-net#3389 [if eligible] FRR slowness can reprogram the Route back to Nexthop which is link down. This change is similar to sonic-net#3394 which was done for Nexthop Group. Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
What I did:
Added change not to create ECMP Group in SAI and program the route if none of the ECMP members are active/link-up.
Also do not program the Temp Route if Neigh is not active (Link Down)
Also as part of this change if Route is not programmed and if we remove that route than decrement VRF Reference count in
removeRouteasremoveRoutePostwill not be called in this case.Why I did:
In scale setup of T2 it's possible all links can go down simultaneously which case we can get Route messages with all nexthops being in down state. In such case we might create empty Nexthop Group in SAI for the given route which causes not needed SAI call for Nexthop Group creation and also create traffic blackhole for the route where that route can still forward traffic via default route if eligible/applicable.
Also in this case no point to add Temp Route if neighbor is link down.
How I verify:
Reference to full context of this changes:
Swss_route_enhancemnts.docx