Skip to content

[warm-reboot] Fix the issue where BGP info was not being extracted from neigh logs#974

Merged
neethajohn merged 5 commits intosonic-net:masterfrom
neethajohn:bgp_neigh_time
Jun 28, 2019
Merged

[warm-reboot] Fix the issue where BGP info was not being extracted from neigh logs#974
neethajohn merged 5 commits intosonic-net:masterfrom
neethajohn:bgp_neigh_time

Conversation

@neethajohn
Copy link
Contributor

@neethajohn neethajohn commented Jun 24, 2019

Description of PR

In warm reboot case, portchannel up/down event might not be recorded in the Arista logs all the time. With the check we had in place, no neigh logs were being printed. With the changes in this PR, we should see BGP info being printed in the neigh logs all the time and portchannel info if it exists in the Arista logs.

Type of change

  • Bug fix
  • [] Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

How did you do it?

Added reboot_type var to determine if the portchannel info can be optional while parsing logs

How did you verify/test it?

Ran warm-reboot and fast reboot tests on T0-64 topology on 7260 and verified logs are getting printed in both cases

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need comment here to explain what the expression mean semantically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in some comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For WR we expect to see no flaps of any interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft mentioned that on Mellanox platforms I might see a diff of 1s for portchannel up/down. I looked at a couple of logs and saw the following behavior

  1. on Broadcom platforms, I might see the logs at times with the same timestamp for up/down or not see them at all
  2. on Mellanox, same as above. At times, I see logs with a diff of 1s between up and down.
    Since we already have check_change_time() which checks the Po status, it was decided to ignore the Po logs.

Copy link
Contributor

@qiluo-msft qiluo-msft Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca Should we treat it as a bug?

on Broadcom platforms, I might see the logs at times with the same timestamp for up/down #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both v4 and v6 started later than we expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on data collected from multiple platforms and discussions with @yxieca, there will be an additional check that I will be adding later. That will ensure that for each peer, down to established time should not be more than 165sec. The plan now is to watch the peer down to established time in the logs for a couple of weeks to see if that is the right number that we want to check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each platform has its own cpu speed. So 165 sec could work for some platforms, but might fail for others
Also I don't think we need to add the bgp test here.
If we want bgp test, we can make it here: https://github.com/Azure/sonic-buildimage/tree/master/platform/vs/tests/bgp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed bgp time check. only fixing the log error in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? magic constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parse_logs(), we have this statement result['route_timeout'] = result_rt before returning result when we don't find any bgp or portchannel info in the logs. log_data will have at least 1 key always.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know we have correct item when len(log_data) is bigger then 1?
Please check for particular keys.
Your check could stop working with another change in self.parse_logs()

Copy link
Contributor

@qiluo-msft qiluo-msft Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neethajohn If

log_data will have at least 1 key always

you may return a tuple, and the first element is result_rt #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improved the non empty check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? another magic constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as above. I will put some comments before the check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this

Copy link
Contributor

@qiluo-msft qiluo-msft Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between v4 and v6 peer [](start = 36, length = 22)

between v4 and v6 peer [](start = 36, length = 22)

Could you explain "between v4 and v6 peer" and why the time is useful? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer having this check. only addressing the log issue in this PR

Copy link
Contributor

@qiluo-msft qiluo-msft Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< [](start = 32, length = 1)

< [](start = 32, length = 1)

'<=' if you mean 'not more than' #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@neethajohn neethajohn changed the title [warm-reboot] Compare v4 and v6 peer establishment times and fix logging issue [warm-reboot] Fix the issue where BGP info was not being extracted from neigh logs Jun 27, 2019
log_lines = log_output.split("\r\n")[1:-1]
log_data = self.parse_logs(log_lines)
if len(log_data) != 0:
if (self.reboot_type == 'fast-reboot' and\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\ [](start = 53, length = 1)

Add one more blank before '', or aligned them in multiple lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added space

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one minor comment left.

# for fast-reboot, we expect to have both the bgp and portchannel events in the logs. for warm-reboot, portchannel events might not be present in the logs all the time.
if self.reboot_type == 'fast-reboot' and (initial_time_bgp == -1 or initial_time_if == -1):
return result
elif self.reboot_type == 'warm-reboot' and (initial_time_bgp == -1 and initial_time_if == -1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for WR it's an issue if both bgp and LAG shows no messages?
If just bgp shows nothing, or LAG shows nothing this is alright?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For WR, bgp info will always show up in the logs because of the time diff between the idle and established messages. So I expect it to be there all the time. So the 'bgp shows nothing' should always be false. But portchannel messages are seen sometimes with same timestamp for up to down or a diff of 1s (Mellanox).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need anything which is constant in the formula in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get your point. Are you referring to the -1? That value is returned by the extract_from_logs() if regex match fails.
I couldn't figure out any other way to do it. Let me know if you have any suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix me if I wrong:

From your words:

For WR, bgp info will always show up in the logs because of the time diff between the idle and established messages.

That means that initial_time_bgp == -1 will never be True ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ideal case, it will never be true. if its true, then I don't have the bgp info in the logs. I need to return because I can't continue with any more bgp related checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, removed the portchannel check for warm-reboot case

@neethajohn neethajohn merged commit 9dd1fb7 into sonic-net:master Jun 28, 2019
@neethajohn neethajohn deleted the bgp_neigh_time branch June 28, 2019 22:13
yxieca pushed a commit that referenced this pull request Jul 10, 2019
fraserg-arista pushed a commit to fraserg-arista/sonic-mgmt that referenced this pull request Feb 24, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
This PR adds test cases and supporting utilities to measure **route programming time** under high‑scale BGP IPv6 scenarios, building on the refactoring in [PR sonic-net#21335](sonic-net#21335). It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time.

Key Points:

- Introduces new tests that:
 - Trigger BGP route updates (via scale, convergence, and flap scenarios) and
 - Measure end‑to‑end route programming time from control‑plane event to data‑plane readiness.
- Adds helper logic to:
 - Capture timestamps, events counts around BGP state/route changes and data‑plane verification,
 - Collect and aggregate per‑iteration route programming statistics,
 - Expose pass/fail criteria based on configured thresholds.
- Reuses the generalized connection and flap mechanisms introduced in PR sonic-net#21335 and to generate realistic route programming events at scale.
- Enhances logging so that per‑iteration programming times are visible for debugging and trend analysis.
- Keeps tests parameter‑driven to allow different scales, iteration counts, and thresholds without duplicating logic.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
 - [ ] Skipped for non-supported platforms
- [ x ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
After adding flexible convergence and flap coverage, we also need **quantitative** visibility into route programming performance. Specifically:

- How long does it take for routes to be fully programmed after:
 - Initial BGP convergence under scale,
 - BGP admin flaps or port flaps,
 - Large‑scale route updates?
- Number of the NextHopGroup and Route Events
- Do these programming times remain within agreed targets across iterations and scales?
- Can we detect regressions or platform‑specific performance issues in route programming latency?

#### How did you do it?

- Added new route‑programming‑time test flows (in `tests/bgp/test_ipv6_bgp_scale.py`) that:
 - Use existing mechanisms to:
 - Establish BGP sessions and advertise a high scale of IPv6 routes.
 - Trigger events (e.g., convergence, admin flap cycles) using the refactored connection APIs.
 - Around each event, record:
 - The time when the control‑plane action is taken (e.g., admin shut/no‑shut, route announcement),
 - The time when verification confirms that routes are fully programmed and forwarding is working.
 - Compute route programming time as the delta between these timestamps.
 - Get the number of NextHopGroup and Route Events
- Implemented helper functions to:
 - Configure thresholds for acceptable route programming time,
 - Aggregate and report per‑iteration statistics,
 - Mark the test as failed if the measured times exceed configured limits.
- Reused logging and verification utilities from the refactored BGP test infrastructure so that:
 - The tests share as much code as possible with the convergence and flap tests,
 - Route programming time measurements can be correlated with existing convergence logs.

#### How did you verify/test it?

- Executed the new route programming time tests in the same environment used for the high‑bgp and flap tests:
 - Topology: `t0-isolated-d2u510s2`
 - Platform: Broadcom Arista-7060X6-64PE-B-C512S2
- Verified that:
 - Route programming times are collected and logged for each iteration,
 - Measured values are stable and within expected thresholds under high scale,
 - No unexpected failures or long‑tail outliers appear during repeated runs.

#### Any platform specific information?

- Verified on Broadcom Arista-7060X6-64PE-B-C512S2 (or the platform(s) you actually used).
- The test logic itself is platform‑agnostic, but measured route programming times are naturally platform and scale dependent.

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
5331ecd [vslib]: Fix MACsec bug in SCI and XPN (sonic-net#1003)
ac04509 Fix build issues on gcc-10 (sonic-net#999)
1b8ce97 [pipeline] Download swss common artifact in a separated directory (sonic-net#995)
7a2e096 Change sonic-buildimage.vs artifact source from CI build to official build. (sonic-net#992)
d5866a3 [vslib]: fix create MACsec SA error (sonic-net#986)
f36f7ce Added Support for enum query capability of Nexthop Group Type. (sonic-net#989)
323b89b Support for MACsec statistics (sonic-net#892)
26a8a12 Prevent other notification event storms to keep enqueue unchecked and drained all memory that leads to crashing the switch router (sonic-net#968)
0cb253a Fix object availability conversion (sonic-net#974)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants