[BGP Scale] Increase NHG Member Downtime Timeout#20843
[BGP Scale] Increase NHG Member Downtime Timeout#20843ccroy-arista wants to merge 1 commit intosonic-net:masterfrom
Conversation
Increase the downtime timeout for the nexthop group member scale test.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @r12f could you help to take a look? Is this fix acceptable to increase timeout to 120 seconds from 30? |
| MAX_DOWNTIME_ONE_PORT_FLAPPING = 30 # seconds | ||
| MAX_DOWNTIME_UNISOLATION = 300 # seconds | ||
| MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE = 30 # seconds | ||
| MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE = 120 # seconds |
There was a problem hiding this comment.
hi Chris, the downtime is estimated based on the number of dropped packets and TX PPS. do you mind to help check why there will be this many packets dropped in your case? this looks weird.
|
+ @PriyanshTratiya here for viz and review. |
|
resetting my approval for getting the packet drop reason from @ccroy-arista |
There was a problem hiding this comment.
Thanks for this PR. I believe we can keep MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE to its original 30s. The high dataplane downtime seen during the nexthop group member scale test is being addressed directly in new proposed PR #21939, which fixes the nexthop‑related test behavior that was inflating the measured downtime.
With that fix in place, the calculated dataplane downtime should drop back to a level that fits within the existing 30s bound.
|
Closing this PR, as the downtime has been increased separately here: #22081 |
Description of PR
Increase the downtime timeout for the nexthop group member scale test.
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
The nexthop group member scale failes the counters downtime check at end with the timeout set at 30 seconds. Observed that it can take around 80 seconds for the counters to stabilize.
How did you do it?
Increased MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE from 30 seconds to 120 seconds.
How did you verify/test it?
Ran the test against the t0-isoldated-d2u510s2 topology and confirmed that it now passes.
Any platform specific information?
Tested on Arista-7060X6-64PE-B-C512S2.