-
Notifications
You must be signed in to change notification settings - Fork 1k
[sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface #22186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
StormLiangMS
merged 4 commits into
sonic-net:master
from
vkjammala-arista:fix-sflow-packet-failures
Mar 26, 2026
+70
−11
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
0955385
[sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow pac…
vkjammala-arista 4c381d4
Addressing review comments
vkjammala-arista b458b30
Fix pre-commit check failures
vkjammala-arista ddb7e49
Revert PR#21674 partially to enable "sflow/test_sflow.py" test
vkjammala-arista File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you verified that this works if we are hitting the case where SYSTEM_READY is not up yet? I'm pretty sure
hsflowd.autogets generated before the timeout for SYSTEM_READY runs, which is what we are trying to wait for here. There might be value in also checking for hsflowd to be up, but I don't think it solves that particular problem case.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From sflow HLD:
hsflowd daemon will initialize only after receiving the SYSTEM_READY|SYSTEM_STATE Status=Up from SONiC. The system ready state however depends on all the monitored daemons to be ready. Failure on any of these, will result in system state to be down. In such scenarios, sflow will wait until 180 seconds and if system ready is not up will proceed with initializationIIUC, hsflowd processes the collector configuration only after SYSTEM_READY is up, so that's the reason why I have added 240 secs (see
wait_until_hsflowd_ready) wait_until hsflowd processes the collector config (i.ehsflowd.autoto have collector_ip information)And I'm not seeing any issues around this with this PR fix. Please let me know if my understanding isn't correct.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my change #21195 which has a very similar hsflow_ready check, as well as an explicit SYSTEM_READY check. I found that 180s works fine for the timeout, is there a reason you went with 240s?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how helpful or reliable the SYSTEM_READY check is, and I don't know exactly how the sFlow/hsflowd process uses this SYSTEM_READY state. So, in my opinion, as far as the sFlow test is concerned, checking whether
hsflowdhas processed the collector IP configuration should be enough.I have gone through your
hsflow_readycheck, I see you aren't checking whether collector config is actually processed or not (checking on hsflowd process is up or not, might not be sufficient in some scenarios)is there a reason you went with 240s?No specific reason, I just wanted to give some extra delay than what HLD stated (i.e >180 secs) for test to be more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYSTEM_READY is a top-level concept in SONiC, and is definitely something we can and should use if we need to.
For this case, though, I agree with you.
We can inspect the hsflowd source directly and see what it does. Here's the SYSTEM_READY check (from https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2355):
db_getSystemReady(mod)checks for SYTEM_READY|SYSTEM_STATE=UP, so waiting for SYSTEM_READY is an appropriate check. Howerver, the collectors are not processed until hsflowd is in the CONNECTED state, which doesn't happen until hsflowd sees that the SYSTEM_READY check succeeds or times out (the timeout is 180s, which is where the 3-minute timeline in the HLD comes from) (https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2374):So, in this case, checking for collectors works as a stand-in for checking SYSTEM_READY, as long as we include a long enough timeout. I looked into different failure cases, but hsflowd truncates the .auto file when it starts, and we would only hit the SYSTEM_READY thing in the case of an hsflowd restart.
I will update my PR, it has some further fixes that are still valuable (I reworked the PTF sample timing to catch intermittent cases where startup delay fluctuations can occasionally case failures).
If you want to merge this one, can you get rid of the test skip and trigger the CI/CD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anders-nexthop for the details and pointers to SYSTEM_READY checks. I have updated my PR to get rid of test skip.