Skip to content

[generic_config_updater] Add diagnostic logging for vlanintf_validato…#4105

Merged
qiluo-msft merged 4 commits into
sonic-net:masterfrom
DavidZagury:master_service_validatior_log
Nov 17, 2025
Merged

[generic_config_updater] Add diagnostic logging for vlanintf_validato…#4105
qiluo-msft merged 4 commits into
sonic-net:masterfrom
DavidZagury:master_service_validatior_log

Conversation

@DavidZagury

Copy link
Copy Markdown
Contributor

…r failures

Add enhanced error logging to diagnose vlanintf_validator failures when flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing zero overhead during normal operation while enabling root cause analysis when the validator fails.

This could help in the investigatation of sonic-net/sonic-buildimage#23680

What I did

Added diagnostic logging to vlanintf_validator to identify the root cause when neighbor flush operations fail during VLAN interface IP removal. Previously, the validator would fail silently with only a generic error message, making it impossible to diagnose the actual failure reason.

How I did it

  1. Enhanced command_wrapper() function to:

    • Capture stdout and stderr from subprocess calls by setting capture_output=True
    • Log the exact command, return code, stdout, and stderr when commands fail
    • Only log when errors occur (returncode != 0), maintaining zero overhead during normal operation
  2. Added error logging in vlanintf_validator() to:

    • Log which specific interface and IP address failed during neighbor flush
    • Include the return code for correlation with command_wrapper logs
  3. Updated unit tests in service_validator_test.py to reflect the original command format

How to verify it

  1. Normal operation (no errors):

    • Apply a VLAN interface configuration change
    • Verify no additional log messages appear when operation succeeds
  2. Error scenario:

    • Trigger a vlanintf_validator failure (e.g., by manipulating interface state during config update)
    • Verify detailed error logs appear showing:
      • The exact command that failed
      • The return code
      • stdout/stderr output from the command

Previous command output (if the output of a command-line utility has changed)

2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False

New command output (if the output of a command-line utility has changed)

2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: Command failed: 'ip neigh flush dev Vlan1000 192.168.0.1/21', returncode: 2
2025 Oct 25 18:04:21.266398 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: stderr: Cannot find device "Vlan1000"
2025 Oct 25 18:04:21.266399 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: vlanintf_validator: Failed to flush neighbors for Vlan1000 192.168.0.1/21, returncode=2
2025 Oct 25 18:04:21.266400 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False

…r failures

Add enhanced error logging to diagnose vlanintf_validator failures when
flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing
zero overhead during normal operation while enabling root cause
analysis when the validator fails.
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@DavidZagury DavidZagury force-pushed the master_service_validatior_log branch from 4ee08fe to 44b049d Compare November 13, 2025 09:06
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds diagnostic logging to vlanintf_validator to help diagnose failures when flushing neighbor entries for deleted VLAN interface IPs. The logging only activates on errors, maintaining zero overhead during normal operations.

Key Changes

  • Enhanced command_wrapper() to capture stdout/stderr and log detailed error information when commands fail
  • Added error logging in vlanintf_validator() to identify which interface and IP address failed
  • Updated test infrastructure to support stdout/stderr in mocked subprocess calls and added failure test cases

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
generic_config_updater/services_validator.py Added diagnostic error logging to command_wrapper and vlanintf_validator functions
tests/generic_config_updater/service_validator_test.py Updated mock infrastructure and added test cases for vlanintf_validator failure scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@qiluo-msft qiluo-msft merged commit dc7e71a into sonic-net:master Nov 17, 2025
13 checks passed
gordon-nexthop pushed a commit to nexthop-ai/sonic-utilities that referenced this pull request Nov 25, 2025
…r failures (sonic-net#4105)

Add enhanced error logging to diagnose vlanintf_validator failures when flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing zero overhead during normal operation while enabling root cause analysis when the validator fails.

This could help in the investigatation of sonic-net/sonic-buildimage#23680

What I did
Added diagnostic logging to vlanintf_validator to identify the root cause when neighbor flush operations fail during VLAN interface IP removal. Previously, the validator would fail silently with only a generic error message, making it impossible to diagnose the actual failure reason.

How I did it
Enhanced command_wrapper() function to:

Capture stdout and stderr from subprocess calls by setting capture_output=True
Log the exact command, return code, stdout, and stderr when commands fail
Only log when errors occur (returncode != 0), maintaining zero overhead during normal operation
Added error logging in vlanintf_validator() to:

Log which specific interface and IP address failed during neighbor flush
Include the return code for correlation with command_wrapper logs
Updated unit tests in service_validator_test.py to reflect the original command format

How to verify it
Normal operation (no errors):

Apply a VLAN interface configuration change
Verify no additional log messages appear when operation succeeds
Error scenario:

Trigger a vlanintf_validator failure (e.g., by manipulating interface state during config update)
Verify detailed error logs appear showing:
The exact command that failed
The return code
stdout/stderr output from the command
Previous command output (if the output of a command-line utility has changed)
2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False
New command output (if the output of a command-line utility has changed)
2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: Command failed: 'ip neigh flush dev Vlan1000 192.168.0.1/21', returncode: 2
2025 Oct 25 18:04:21.266398 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: stderr: Cannot find device "Vlan1000"
2025 Oct 25 18:04:21.266399 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: vlanintf_validator: Failed to flush neighbors for Vlan1000 192.168.0.1/21, returncode=2
2025 Oct 25 18:04:21.266400 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False
saiarcot895 added a commit to saiarcot895/sonic-mgmt that referenced this pull request Dec 1, 2025
sonic-net/sonic-utilities#4105 added error
logging for when a GCU command exits with a non-zero exit code. There is
one case of a (known) failure where dhcp_relay fails to restart because
it hit the start limit. These error logs then need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
yxieca pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Dec 5, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
nissampa pushed a commit to nissampa/sonic-mgmt_dpu_test that referenced this pull request Dec 9, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Nishanth Sampath Kumar <nissampa@cisco.com>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Dec 11, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
saravanan-nexthop pushed a commit to saravanan-nexthop/sonic-mgmt that referenced this pull request Dec 15, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saravanan <saravanan@nexthop.ai>
saiarcot895 added a commit to saiarcot895/sonic-mgmt that referenced this pull request Dec 19, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
vmittal-msft pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Jan 7, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
YairRaviv pushed a commit to YairRaviv/sonic-utilities that referenced this pull request Jan 12, 2026
…r failures (sonic-net#4105)

Add enhanced error logging to diagnose vlanintf_validator failures when flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing zero overhead during normal operation while enabling root cause analysis when the validator fails.

This could help in the investigatation of sonic-net/sonic-buildimage#23680

What I did
Added diagnostic logging to vlanintf_validator to identify the root cause when neighbor flush operations fail during VLAN interface IP removal. Previously, the validator would fail silently with only a generic error message, making it impossible to diagnose the actual failure reason.

How I did it
Enhanced command_wrapper() function to:

Capture stdout and stderr from subprocess calls by setting capture_output=True
Log the exact command, return code, stdout, and stderr when commands fail
Only log when errors occur (returncode != 0), maintaining zero overhead during normal operation
Added error logging in vlanintf_validator() to:

Log which specific interface and IP address failed during neighbor flush
Include the return code for correlation with command_wrapper logs
Updated unit tests in service_validator_test.py to reflect the original command format

How to verify it
Normal operation (no errors):

Apply a VLAN interface configuration change
Verify no additional log messages appear when operation succeeds
Error scenario:

Trigger a vlanintf_validator failure (e.g., by manipulating interface state during config update)
Verify detailed error logs appear showing:
The exact command that failed
The return code
stdout/stderr output from the command
Previous command output (if the output of a command-line utility has changed)
2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False
New command output (if the output of a command-line utility has changed)
2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: Command failed: 'ip neigh flush dev Vlan1000 192.168.0.1/21', returncode: 2
2025 Oct 25 18:04:21.266398 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: stderr: Cannot find device "Vlan1000"
2025 Oct 25 18:04:21.266399 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: vlanintf_validator: Failed to flush neighbors for Vlan1000 192.168.0.1/21, returncode=2
2025 Oct 25 18:04:21.266400 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
thisptr-sh pushed a commit to thisptr-sh/sonic-mgmt that referenced this pull request Jan 21, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Jan 29, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Zhuohui Tan <zhuohui.tan@amd.com>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Apr 1, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: selldinesh <dinesh.sellappan@keysight.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Apr 20, 2026
What is the motivation for this PR?
Update the loganalyzer ignore list to fix a regex issue, and update the GCU test to ignore new error syslogs.

The regex for ignoring systemd-networkd.socket failing to start included [1], which has a special meaning when interpreted as regex (match one character that has 1 in it; in other words, match a 1). Escape the brackets so that they get treated literally.

sonic-net/sonic-utilities#4105 started logging errors when a GCU command fails with a non-zero exit status. However, there is one case of a (known) failure where dhcp_relay fails to restart because it hit the start limit. These error logs need to be ignored.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants