Add FRR failed route check in route_check.py#4119
Add FRR failed route check in route_check.py#4119StormLiangMS merged 6 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull Request Overview
This PR adds detection and reporting of failed FRR routes in addition to the existing missed routes tracking. The change enhances route checking by identifying routes with a "failed" state in FRR and reporting them separately.
- Tracks failed FRR routes alongside missed routes
- Updates function signature to return both missed and failed routes
- Adds separate error logging for failed routes
Comments suppressed due to low confidence (1)
scripts/route_check.py:706
- Surplus named argument for string format. An argument named 'indent' is provided, but it is not required by format "FRR failed routes: {}".
print_message(syslog.LOG_DEBUG, "FRR failed routes: {}".format(failed_rt, indent=4))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/route_check.py
Outdated
| if not missed_rt: | ||
| if entry.get('failed', False): | ||
| failed_rt.append(route_prefix) | ||
| if not missed_rt or not failed_rt: |
There was a problem hiding this comment.
The loop termination condition uses or instead of and, causing the retry loop to exit prematurely if either list is empty. This means if there are failed routes but no missed routes (or vice versa), the function will stop retrying immediately. The condition should be if not missed_rt and not failed_rt: to only exit when both lists are empty.
| if not missed_rt or not failed_rt: | |
| if not missed_rt and not failed_rt: |
scripts/route_check.py
Outdated
| print_message(syslog.LOG_DEBUG, "FRR missed routes: {}".format(missed_rt, indent=4)) | ||
| return missed_rt | ||
| print_message(syslog.LOG_DEBUG, "FRR failed routes: {}".format(failed_rt, indent=4)) |
There was a problem hiding this comment.
The str.format() method doesn't accept an indent parameter. This appears to be attempting to use json.dumps() formatting style but is calling it on a plain format string. Either remove the indent=4 parameter or use json.dumps(missed_rt, indent=4) and json.dumps(failed_rt, indent=4) to properly format the output.
scripts/route_check.py
Outdated
| if is_suppress_fib_pending_enabled(namespace): | ||
| mitigate_installed_not_offloaded_frr_routes(namespace, rt_frr_miss, rt_appl) | ||
| if rt_frr_failed: | ||
| print_message(syslog.LOG_ERR, "Some routes have failed state in FRR {}: {}".format(namespace, rt_frr_failed)) |
There was a problem hiding this comment.
[nitpick] The error message format is inconsistent with line 908-909 which uses 'in FRR{}' (no space before the curly brace). This message uses 'in FRR {}' (with a space). For consistency, consider matching the format used in the similar error message above.
| print_message(syslog.LOG_ERR, "Some routes have failed state in FRR {}: {}".format(namespace, rt_frr_failed)) | |
| print_message(syslog.LOG_ERR, "Some routes have failed state in FRR{}: {}".format(namespace, rt_frr_failed)) |
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
It supports to detect the route with offload False or without offload, which can capture queued route, because queued route doesn’t have offload, but for rejected route, the offload is True. It can’t detect the rejected route.
So, we add a new detection check for the value of key failed for route entries, which can cover both rejected and queued routes.
It will help to detect rejected route and queued route on device.
How I did it
Append failed route prefix into failed list, if it's not empty, script will print error message into syslog
# Check for failed state
if entry.get('failed', False):
failed_rt.append(route_prefix)
How to verify it
For rejected route:
The output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"failed_FRR_routes": [
"0.0.0.0/0",
"192.168.128.0/25",
"192.168.128.128/25",
"192.168.136.0/25",
"192.168.136.128/25",
"192.168.144.0/25",
"192.168.144.128/25",
"192.168.152.0/25",
"192.168.152.128/25",
"192.168.160.0/25",
"192.168.160.128/25",
"192.168.168.0/25",
"192.168.168.128/25",
"192.168.176.0/25",
"192.168.176.128/25",
"192.168.184.0/25",
"192.168.184.128/25",
"192.168.192.0/25",
"192.168.192.128/25",
"192.168.200.0/25",
"192.168.200.128/25",
"192.168.208.0/25",
"192.168.208.128/25",
"192.168.216.0/25",
"192.168.216.128/25",
"192.168.224.0/25",
"192.168.224.128/25",
"192.168.232.0/25",
"192.168.232
Failed. Look at reported mismatches above
For rejected route:
the output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"missed_FRR_routes": [
{
"destSelected": true,
"distance": 20,
"failed": true,
"installedNexthopGroupId": 39146,
"internalFlags": 8,
"internalNextHopActiveNum": 4,
"internalNextHopNum": 4,
"internalStatus": 168,
"metric": 0,
"nexthopGroupId": 39146,
"nexthops": [
{
"active": true,
"afi": "ipv4",
"fib": true,
"flags": 3,
"interfaceIndex": 6,
"interfaceName": "PortChannel102",
"ip": "10.0.0.1",
"rmapSource": "10.1.0.32",
"weight": 1
},
{
"active": true,
Failed. Look at reported mismatches above
|
in 202405, this check is only run , when big_suppress_fib_pending feature is enabled(which is not).. Also this command internally executes, "show ip route json", which is risky and known to have memory spike on scaled route systems.. hence, this PR cant go in 202405 branch. |
What I did
It supports to detect the route with offload False or without offload, which can capture queued route, because queued route doesn’t have offload, but for rejected route, the offload is True. It can’t detect the rejected route.
So, we add a new detection check for the value of key failed for route entries, which can cover both rejected and queued routes.
It will help to detect rejected route and queued route on device.
How I did it
Append failed route prefix into failed list, if it's not empty, script will print error message into syslog
# Check for failed state
if entry.get('failed', False):
failed_rt.append(route_prefix)
How to verify it
For rejected route:
The output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"failed_FRR_routes": [
"0.0.0.0/0",
"192.168.128.0/25",
"192.168.128.128/25",
"192.168.136.0/25",
"192.168.136.128/25",
"192.168.144.0/25",
"192.168.144.128/25",
"192.168.152.0/25",
"192.168.152.128/25",
"192.168.160.0/25",
"192.168.160.128/25",
"192.168.168.0/25",
"192.168.168.128/25",
"192.168.176.0/25",
"192.168.176.128/25",
"192.168.184.0/25",
"192.168.184.128/25",
"192.168.192.0/25",
"192.168.192.128/25",
"192.168.200.0/25",
"192.168.200.128/25",
"192.168.208.0/25",
"192.168.208.128/25",
"192.168.216.0/25",
"192.168.216.128/25",
"192.168.224.0/25",
"192.168.224.128/25",
"192.168.232.0/25",
"192.168.232
Failed. Look at reported mismatches above
For rejected route:
the output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"missed_FRR_routes": [
{
"destSelected": true,
"distance": 20,
"failed": true,
"installedNexthopGroupId": 39146,
"internalFlags": 8,
"internalNextHopActiveNum": 4,
"internalNextHopNum": 4,
"internalStatus": 168,
"metric": 0,
"nexthopGroupId": 39146,
"nexthops": [
{
"active": true,
"afi": "ipv4",
"fib": true,
"flags": 3,
"interfaceIndex": 6,
"interfaceName": "PortChannel102",
"ip": "10.0.0.1",
"rmapSource": "10.1.0.32",
"weight": 1
},
{
"active": true,
Failed. Look at reported mismatches above
|
hi @ZhaohuiS do you mind to help take this change to 202412 too? |
|
@r12f PR for 202412 is here, please help review. Azure/sonic-utilities.msft#284 |
What I did
It supports to detect the route with offload False or without offload, which can capture queued route, because queued route doesn’t have offload, but for rejected route, the offload is True. It can’t detect the rejected route.
So, we add a new detection check for the value of key failed for route entries, which can cover both rejected and queued routes.
It will help to detect rejected route and queued route on device.
How I did it
Append failed route prefix into failed list, if it's not empty, script will print error message into syslog
# Check for failed state
if entry.get('failed', False):
failed_rt.append(route_prefix)
How to verify it
For rejected route:
The output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"failed_FRR_routes": [
"0.0.0.0/0",
"192.168.128.0/25",
"192.168.128.128/25",
"192.168.136.0/25",
"192.168.136.128/25",
"192.168.144.0/25",
"192.168.144.128/25",
"192.168.152.0/25",
"192.168.152.128/25",
"192.168.160.0/25",
"192.168.160.128/25",
"192.168.168.0/25",
"192.168.168.128/25",
"192.168.176.0/25",
"192.168.176.128/25",
"192.168.184.0/25",
"192.168.184.128/25",
"192.168.192.0/25",
"192.168.192.128/25",
"192.168.200.0/25",
"192.168.200.128/25",
"192.168.208.0/25",
"192.168.208.128/25",
"192.168.216.0/25",
"192.168.216.128/25",
"192.168.224.0/25",
"192.168.224.128/25",
"192.168.232.0/25",
"192.168.232
Failed. Look at reported mismatches above
For rejected route:
the output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"missed_FRR_routes": [
{
"destSelected": true,
"distance": 20,
"failed": true,
"installedNexthopGroupId": 39146,
"internalFlags": 8,
"internalNextHopActiveNum": 4,
"internalNextHopNum": 4,
"internalStatus": 168,
"metric": 0,
"nexthopGroupId": 39146,
"nexthops": [
{
"active": true,
"afi": "ipv4",
"fib": true,
"flags": 3,
"interfaceIndex": 6,
"interfaceName": "PortChannel102",
"ip": "10.0.0.1",
"rmapSource": "10.1.0.32",
"weight": 1
},
{
"active": true,
Failed. Look at reported mismatches above
Signed-off-by: Priyansh Tratiya <[email protected]>
What I did
It supports to detect the route with offload False or without offload, which can capture queued route, because queued route doesn’t have offload, but for rejected route, the offload is True. It can’t detect the rejected route.
So, we add a new detection check for the value of key failed for route entries, which can cover both rejected and queued routes.
It will help to detect rejected route and queued route on device.
How I did it
Append failed route prefix into failed list, if it's not empty, script will print error message into syslog
# Check for failed state
if entry.get('failed', False):
failed_rt.append(route_prefix)
How to verify it
For rejected route:
The output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"failed_FRR_routes": [
"0.0.0.0/0",
"192.168.128.0/25",
"192.168.128.128/25",
"192.168.136.0/25",
"192.168.136.128/25",
"192.168.144.0/25",
"192.168.144.128/25",
"192.168.152.0/25",
"192.168.152.128/25",
"192.168.160.0/25",
"192.168.160.128/25",
"192.168.168.0/25",
"192.168.168.128/25",
"192.168.176.0/25",
"192.168.176.128/25",
"192.168.184.0/25",
"192.168.184.128/25",
"192.168.192.0/25",
"192.168.192.128/25",
"192.168.200.0/25",
"192.168.200.128/25",
"192.168.208.0/25",
"192.168.208.128/25",
"192.168.216.0/25",
"192.168.216.128/25",
"192.168.224.0/25",
"192.168.224.128/25",
"192.168.232.0/25",
"192.168.232
Failed. Look at reported mismatches above
For rejected route:
the output of route_check.py
Some routes have failed state in FRR : ['0.0.0.0/0', '192.168.128.0/25', '192.168.128.128/25', '192.168.136.0/25', '192.168.136.128/25', '192.168.144.0/25', '192.168.144.128/25', '192.168.152.0/25', '192.168.152.128/25', '192.168.160.0/25', '192.168.160.128/25', '192.168.168.0/25', '192.168.168.128/25', '192.168.176.0/25', '192.168.176.128/25', '192.168.184.0/25', '192.168.184.128/25', '192.168.192.0/25', '192.168.192.128/25', '192.168.200.0/25', '192.168.200.128/25', '192.168.208.0/25', '192.168.208.128/25', '192.168.216.0/25', '192.168.216.128/25', '192.168.224.0/25', '192.168.224.128/25', '192.168.232.0/25', '192.168.232.128/25', '192.168.240.0/25', '192.168.240.128/25', '192.168.248.0/25', '192.168.248.128/25', '192.169.0.0/25', '192.169.0.128/25', '192.169.104.0/25', '192.169.104.128/25', '192.169.112.0/25', '192.169.112.128/25', '192.169.120.0/25', '192.169.120.128/25', '192.169.128.0/25', '192.169.128.128/25', '192.169.136.0/25', '192.169.136.128/25', '192.169.144.0/25', '192.16
Failure results: {{
"": {
"missed_FRR_routes": [
{
"destSelected": true,
"distance": 20,
"failed": true,
"installedNexthopGroupId": 39146,
"internalFlags": 8,
"internalNextHopActiveNum": 4,
"internalNextHopNum": 4,
"internalStatus": 168,
"metric": 0,
"nexthopGroupId": 39146,
"nexthops": [
{
"active": true,
"afi": "ipv4",
"fib": true,
"flags": 3,
"interfaceIndex": 6,
"interfaceName": "PortChannel102",
"ip": "10.0.0.1",
"rmapSource": "10.1.0.32",
"weight": 1
},
{
"active": true,
Failed. Look at reported mismatches above
Signed-off-by: Priyansh Tratiya <[email protected]>
Co-authored-by: Zhaohui Sun <[email protected]>
What I did
It supports to detect the route with offload False or without offload, which can capture queued route, because queued route doesn’t have offload, but for rejected route, the offload is True. It can’t detect the rejected route.
So, we add a new detection check for the value of key failed for route entries, which can cover both rejected and queued routes.
It will help to detect rejected route and queued route on device.
How I did it
Append failed route prefix into failed list, if it's not empty, script will print error message into syslog
How to verify it
The output of route_check.py
the output of route_check.py
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)