[sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface#22186
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR fixes sFlow test failures caused by timing issues and assertion errors. The hsflowd daemon requires up to 3+ minutes to initialize and process collector configuration after being enabled or rebooted. Tests were failing because they attempted to verify sFlow traffic before the daemon was ready. Additionally, a KeyError would occur when flow samples weren't received but counter samples were.
Changes:
- Added wait logic to ensure hsflowd has processed collector configuration before running PTF tests
- Fixed assertion logic in PTF tests to check specific packet types (counter vs flow) instead of total samples
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/sflow/test_sflow.py | Added verify_hsflowd_ready and wait_until_hsflowd_ready functions to wait for hsflowd initialization; updated partial_ptf_runner fixture to wait for each collector before starting PTF tests; added ast import for parsing collector list |
| ansible/roles/test/files/ptftests/py3/sflow_test.py | Fixed assertions to check total_counter_count for polling tests and total_flow_count for flow tests instead of total_samples, preventing KeyError when flow_port_count doesn't exist; improved error messages to be more specific |
PR Review: Fix sflow/test_sflow.py failuresThe PR addresses two independent issues: (1) a ✅ Fix #1:
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @StormLiangMS i have addressed the comments. Minor: start_time / elapsed Only Meaningful on Success Path Minor: Fixture Dependency Addition May Affect Test Collection |
772a1b7 to
a78f760
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@StormLiangMS Can you please help in merging this PR |
|
@vkjammala-arista for this change and for your other reviews,
|
| """ | ||
| return all( | ||
| duthost.shell( | ||
| f"docker exec sflow grep -q 'collector={ip}' /etc/hsflowd.auto 2>/dev/null", |
There was a problem hiding this comment.
have you verified that this works if we are hitting the case where SYSTEM_READY is not up yet? I'm pretty sure hsflowd.auto gets generated before the timeout for SYSTEM_READY runs, which is what we are trying to wait for here. There might be value in also checking for hsflowd to be up, but I don't think it solves that particular problem case.
There was a problem hiding this comment.
From sflow HLD: hsflowd daemon will initialize only after receiving the SYSTEM_READY|SYSTEM_STATE Status=Up from SONiC. The system ready state however depends on all the monitored daemons to be ready. Failure on any of these, will result in system state to be down. In such scenarios, sflow will wait until 180 seconds and if system ready is not up will proceed with initialization
IIUC, hsflowd processes the collector configuration only after SYSTEM_READY is up, so that's the reason why I have added 240 secs (see wait_until_hsflowd_ready) wait_until hsflowd processes the collector config (i.e hsflowd.auto to have collector_ip information)
And I'm not seeing any issues around this with this PR fix. Please let me know if my understanding isn't correct.
There was a problem hiding this comment.
See my change #21195 which has a very similar hsflow_ready check, as well as an explicit SYSTEM_READY check. I found that 180s works fine for the timeout, is there a reason you went with 240s?
There was a problem hiding this comment.
I'm not sure how helpful or reliable the SYSTEM_READY check is, and I don't know exactly how the sFlow/hsflowd process uses this SYSTEM_READY state. So, in my opinion, as far as the sFlow test is concerned, checking whether hsflowd has processed the collector IP configuration should be enough.
I have gone through your hsflow_ready check, I see you aren't checking whether collector config is actually processed or not (checking on hsflowd process is up or not, might not be sufficient in some scenarios)
is there a reason you went with 240s?
No specific reason, I just wanted to give some extra delay than what HLD stated (i.e >180 secs) for test to be more reliable.
There was a problem hiding this comment.
SYSTEM_READY is a top-level concept in SONiC, and is definitely something we can and should use if we need to.
For this case, though, I agree with you.
We can inspect the hsflowd source directly and see what it does. Here's the SYSTEM_READY check (from https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2355):
case HSP_SONIC_STATE_WAIT_READY:
// all dbs connected - wait for SYSTEM_READY
{
time_t waiting = mdata->pollBus->now.tv_sec - mdata->waitReadyStart;
if(waiting < sp->sonic.waitReady) {
db_getSystemReady(mod);
}
else {
EVDebug(mod, 1, "sonic: waitReady timeout after %ld seconds", (long)waiting);
setSonicState(mod, HSP_SONIC_STATE_CONNECTED);
}
db_getSystemReady(mod) checks for SYTEM_READY|SYSTEM_STATE=UP, so waiting for SYSTEM_READY is an appropriate check. Howerver, the collectors are not processed until hsflowd is in the CONNECTED state, which doesn't happen until hsflowd sees that the SYSTEM_READY check succeeds or times out (the timeout is 180s, which is where the 3-minute timeline in the HLD comes from) (https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2374):
case HSP_SONIC_STATE_CONNECTED:
// connected and ready - learn config
db_getMeta(mod);
dbEvt_subscribe(mod);
// the next steps read the starting agent/polling/collector
// config. Any subsequent changes will be detected via dbEvt.
setSonicState(mod, HSP_SONIC_STATE_SFLOWGLOBAL);
break;
So, in this case, checking for collectors works as a stand-in for checking SYSTEM_READY, as long as we include a long enough timeout. I looked into different failure cases, but hsflowd truncates the .auto file when it starts, and we would only hit the SYSTEM_READY thing in the case of an hsflowd restart.
I will update my PR, it has some further fixes that are still valuable (I reworked the PTF sample timing to catch intermittent cases where startup delay fluctuations can occasionally case failures).
If you want to merge this one, can you get rid of the test skip and trigger the CI/CD?
There was a problem hiding this comment.
Thanks @anders-nexthop for the details and pointers to SYSTEM_READY checks. I have updated my PR to get rid of test skip.
I'm not sure why this test is getting skipped in community CI/CD, in our testbeds these tests are running fine. |
Different topology maybe? |
I see the reason for confusion now, test is being skipped in the master (but not on 202511) because of PR #21674 changes. And my test results are from 202511 branch, so yeah we need to remove above piece of code. |
anders-nexthop
left a comment
There was a problem hiding this comment.
Going through the testcases, I a couple cases were we aren't passing any collectors when setting up the config that would have the potential to hide issues. so we may want some form of system ready check in a couple of spots. But overall this is good (I will layer my changes on top of this one).
Please make sure to revert the conditional skip though, and post a passing test result (the test results from the checks age out after awhile, and since we are re-enabling this one after some time it would be good to have a record that it did indeed pass successfully -- and not just skip).
| duthost, intf) | ||
| var['portmap'] = json.dumps(var['sflow_ports']) | ||
| ptfhost.copy(content=var['portmap'], dest="/tmp/sflow_ports.json") | ||
| partial_ptf_runner( |
There was a problem hiding this comment.
This call is trying to make sure that hsflowd didn't come up after a reboot when it's not supposed to. But there might be a wait of up to 3 minutes for hsflowd to start (the SYSTEM_READY check). The way is_hsflowd_ready() is written, it won't wait at all for hsflowd in this case, since there are no collectors to check for. If we don't wait for hsflowd to be ready here, then we aren't really testing anything with this ptf_runner call. So we would either need to wait the 3 minutes, or check for SYSTEM_READY ourselves here before continuing.
There was a problem hiding this comment.
I see, as you are planning to your PR which is having some further fixes, do you want to take care of this. Otherwise, I will take care of this scenario in my follow-up review. Thanks!
There was a problem hiding this comment.
Yes I will update my PR to cover this case.
| 'show sflow')['stdout'])) == 0) | ||
| verify_show_sflow(duthost, status='up', collector=[]) | ||
| wait_until(30, 5, 0, verify_sflow_config_apply, duthost) | ||
| partial_ptf_runner( |
There was a problem hiding this comment.
technically if we ran this test case as a standalone, this could fail if we are stalled waiting for SYSTEM_READY. That's not a very likely case, as the test needs the collector to exist to be very meaningful, but it's still a consideration.
StormLiangMS
left a comment
There was a problem hiding this comment.
Review (for aristanetworks/sonic-qual.msft#1121)
LGTM. Good fix for the sflow reboot flakiness:
verify_hsflowd_ready/wait_until_hsflowd_readyproperly waits for collector config in hsflowd.auto before running traffic verification.- PTF-side fix correctly distinguishes
total_counter_countvstotal_flow_countinstead of using the ambiguoustotal_samples. - 240s timeout is generous but reasonable for reboot scenarios where hsflowd can take 3+ minutes to initialize.
93372bf to
d8dad4a
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…kets not received on collector interface Issue sonic-net#1: In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins to be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test can fail with i) "Packets are not received in active collector, collector\d+" and ii) "Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+" hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication. Issue sonic-net#2: If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to counter samples seen on collector interface, "data['total_samples']" will not be zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and "total_counter_count" before calling corresponding sample analyze functions. Signed-off-by: Vinod <vkjammala@arista.com>
1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the
collector IPs (instead of calling it sequentially for each IP)
2) Add docstring for "wait_until_hsflowd_ready" function
3) Updated "ast.literal_eval" usage to handle the case where
"active_collectors" is passed as empty string ("" instead of "[]")
Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: Vinod <vkjammala@arista.com>
a542928 to
ddb7e49
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…kets not received on collector interface (sonic-net#22186) * [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface Issue #1: In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins to be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test can fail with i) "Packets are not received in active collector, collector\d+" and ii) "Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+" hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication. Issue #2: If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to counter samples seen on collector interface, "data['total_samples']" will not be zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and "total_counter_count" before calling corresponding sample analyze functions. Signed-off-by: Vinod <vkjammala@arista.com> * Addressing review comments 1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the collector IPs (instead of calling it sequentially for each IP) 2) Add docstring for "wait_until_hsflowd_ready" function 3) Updated "ast.literal_eval" usage to handle the case where "active_collectors" is passed as empty string ("" instead of "[]") Signed-off-by: Vinod <vkjammala@arista.com> * Fix pre-commit check failures Signed-off-by: Vinod <vkjammala@arista.com> * Revert PR#21674 partially to enable "sflow/test_sflow.py" test Signed-off-by: Vinod <vkjammala@arista.com> --------- Signed-off-by: Vinod <vkjammala@arista.com> Signed-off-by: mssonicbld <sonicbld@microsoft.com>
|
Cherry-pick PR to 202511: #23335 |
|
@vkjammala-arista — CI is green and has 1 approval from @anders-nexthop. Needs additional approvals. Requesting @mramezani95 @xwjiang-ms @zypgithub @BYGX-wcr @yutongzhang-microsoft to please review. This fixes sflow packet reception, tracked in aristanetworks/sonic-qual.msft#1062. |
…kets not received on collector interface (#22186) (#23335) * [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface Issue #1: In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins to be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test can fail with i) "Packets are not received in active collector, collector\d+" and ii) "Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+" hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication. Issue #2: If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to counter samples seen on collector interface, "data['total_samples']" will not be zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and "total_counter_count" before calling corresponding sample analyze functions. * Addressing review comments 1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the collector IPs (instead of calling it sequentially for each IP) 2) Add docstring for "wait_until_hsflowd_ready" function 3) Updated "ast.literal_eval" usage to handle the case where "active_collectors" is passed as empty string ("" instead of "[]") * Fix pre-commit check failures * Revert PR#21674 partially to enable "sflow/test_sflow.py" test --------- Signed-off-by: Vinod <vkjammala@arista.com> Signed-off-by: mssonicbld <sonicbld@microsoft.com> Co-authored-by: vkjammala-arista <152394203+vkjammala-arista@users.noreply.github.com>


Description of PR
Summary: Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface
Fixes # #22180
Type of change
Back port request
Approach
What is the motivation for this PR?
Currenlty bunch of
sflow/test_sflow.pytest cases are failing with below signaturesAssertionError: False is not true : \.{2,}Packets are not received in active collector +,collector\d+AssertionError: False is not true : Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+KeyError: 'flow_port_count'Issue #1:
In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins (See HLD) be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test expecting sample packets on sflowtool can fail with above two signatures.
hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication.
Issue #2:
If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting
KeyError: 'flow_port_count'. Due to counter samples seen on collector interface,data['total_samples']" will not be zero butdata['total_flow_count']will be 0 and lead to KeyError when tried to accessdata['flow_port_count']`.How did you do it?
For Issue#1:
hsflowd service is writing to
/etc/hsflowd.autoonce it's processed collector configuration. Thus waiting for collector info to be present in/etc/hsflowd.auto"seems to be safe option before proceeding with sflow traffic verfication.For Issue#2:
Fix is to have assert on
total_flow_countandtotal_counter_countbefore calling corresponding sample analyzer functions.How did you verify/test it?
Test is passing with the fixes.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation