Skip to content

[BGP Scale] Increase NHG Member Downtime Timeout#20843

Closed
ccroy-arista wants to merge 1 commit intosonic-net:masterfrom
ccroy-arista:fix-bgp-scale-nhg-member-counters-downtime
Closed

[BGP Scale] Increase NHG Member Downtime Timeout#20843
ccroy-arista wants to merge 1 commit intosonic-net:masterfrom
ccroy-arista:fix-bgp-scale-nhg-member-counters-downtime

Conversation

@ccroy-arista
Copy link
Contributor

Description of PR

Increase the downtime timeout for the nexthop group member scale test.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202412
  • 202505

Approach

What is the motivation for this PR?

The nexthop group member scale failes the counters downtime check at end with the timeout set at 30 seconds. Observed that it can take around 80 seconds for the counters to stabilize.

How did you do it?

Increased MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE from 30 seconds to 120 seconds.

How did you verify/test it?

Ran the test against the t0-isoldated-d2u510s2 topology and confirmed that it now passes.

Any platform specific information?

Tested on Arista-7060X6-64PE-B-C512S2.

Increase the downtime timeout for the nexthop
group member scale test.
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator

hi @r12f could you help to take a look? Is this fix acceptable to increase timeout to 120 seconds from 30?

MAX_DOWNTIME_ONE_PORT_FLAPPING = 30 # seconds
MAX_DOWNTIME_UNISOLATION = 300 # seconds
MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE = 30 # seconds
MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE = 120 # seconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Chris, the downtime is estimated based on the number of dropped packets and TX PPS. do you mind to help check why there will be this many packets dropped in your case? this looks weird.

@r12f r12f requested a review from PriyanshTratiya November 9, 2025 18:56
@r12f
Copy link
Collaborator

r12f commented Nov 9, 2025

+ @PriyanshTratiya here for viz and review.

@r12f r12f self-requested a review November 9, 2025 18:58
@r12f
Copy link
Collaborator

r12f commented Nov 9, 2025

resetting my approval for getting the packet drop reason from @ccroy-arista

Copy link
Contributor

@PriyanshTratiya PriyanshTratiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. I believe we can keep MAX_DOWNTIME_NEXTHOP_GROUP_MEMBER_CHANGE to its original 30s. The high dataplane downtime seen during the nexthop group member scale test is being addressed directly in new proposed PR #21939, which fixes the nexthop‑related test behavior that was inflating the measured downtime.

With that fix in place, the calculated dataplane downtime should drop back to a level that fits within the existing 30s bound.

@ccroy-arista
Copy link
Contributor Author

Closing this PR, as the downtime has been increased separately here: #22081
In light of those changes, tests need to be re-run and results re-evaluated (against 202511 branch now).

@ccroy-arista ccroy-arista deleted the fix-bgp-scale-nhg-member-counters-downtime branch February 19, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants