Skip to content

[sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface#22186

Merged
StormLiangMS merged 4 commits intosonic-net:masterfrom
vkjammala-arista:fix-sflow-packet-failures
Mar 26, 2026
Merged

[sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface#22186
StormLiangMS merged 4 commits intosonic-net:masterfrom
vkjammala-arista:fix-sflow-packet-failures

Conversation

@vkjammala-arista
Copy link
Contributor

Description of PR

Summary: Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface
Fixes # #22180

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Currenlty bunch of sflow/test_sflow.py test cases are failing with below signatures

  1. AssertionError: False is not true : \.{2,}Packets are not received in active collector +,collector\d+

  2. AssertionError: False is not true : Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+

  3. KeyError: 'flow_port_count'

Issue #1:
In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins (See HLD) be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test expecting sample packets on sflowtool can fail with above two signatures.

hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication.

Issue #2:
If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting KeyError: 'flow_port_count'. Due to counter samples seen on collector interface, data['total_samples']" will not be zero but data['total_flow_count']will be 0 and lead to KeyError when tried to accessdata['flow_port_count']`.

How did you do it?

For Issue#1:
hsflowd service is writing to /etc/hsflowd.auto once it's processed collector configuration. Thus waiting for collector info to be present in /etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication.

For Issue#2:
Fix is to have assert on total_flow_count and total_counter_count before calling corresponding sample analyzer functions.

How did you verify/test it?

Test is passing with the fixes.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes sFlow test failures caused by timing issues and assertion errors. The hsflowd daemon requires up to 3+ minutes to initialize and process collector configuration after being enabled or rebooted. Tests were failing because they attempted to verify sFlow traffic before the daemon was ready. Additionally, a KeyError would occur when flow samples weren't received but counter samples were.

Changes:

  • Added wait logic to ensure hsflowd has processed collector configuration before running PTF tests
  • Fixed assertion logic in PTF tests to check specific packet types (counter vs flow) instead of total samples

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/sflow/test_sflow.py Added verify_hsflowd_ready and wait_until_hsflowd_ready functions to wait for hsflowd initialization; updated partial_ptf_runner fixture to wait for each collector before starting PTF tests; added ast import for parsing collector list
ansible/roles/test/files/ptftests/py3/sflow_test.py Fixed assertions to check total_counter_count for polling tests and total_flow_count for flow tests instead of total_samples, preventing KeyError when flow_port_count doesn't exist; improved error messages to be more specific

@StormLiangMS
Copy link
Collaborator

StormLiangMS commented Feb 24, 2026

@vkjammala-arista

PR Review: Fix sflow/test_sflow.py failures

The PR addresses two independent issues: (1) a KeyError: flow_port_count crash in the PTF sflow test, and (2) flaky failures because hsflowd isn't fully initialized when traffic verification starts. Both root causes are correctly identified.


✅ Fix #1: sflow_test.pyKeyError: flow_port_count (Correct)

The original guard data['total_samples'] != 0 before analyze_flow_sample() was insufficient — it allowed entry when only counter samples were received (total_samples > 0 but total_flow_count == 0), since flow_port_count is only populated when total_flow_count > 0:

if data['total_flow_count']:
    data['flow_port_count'] = Counter(...)  # not set if flow_count == 0

Changing the guard to data['total_flow_count'] != 0 is the correct fix. Similarly, guarding analyze_counter_sample() with total_counter_count != 0 is semantically accurate. Both changes are correct and minimal.


🟡 Issue: ast.literal_eval on active_collectors is fragile

In partial_ptf_runner:

for collector in ast.literal_eval(kwargs['active_collectors']):
    wait_until_hsflowd_ready(duthost, var[collector]['ip_addr'])

If active_collectors is ever passed as an empty string "" (rather than "[]"), ast.literal_eval("") raises ValueError and crashes the runner for every test. Consider a safer parse:

collectors = kwargs.get('active_collectors', '[]')
collector_list = ast.literal_eval(collectors) if isinstance(collectors, str) else collectors
for collector in collector_list:
    wait_until_hsflowd_ready(duthost, var[collector]['ip_addr'])

🟡 Issue: Sequential Wait for Each Collector Adds Unnecessary Delay

wait_until_hsflowd_ready is called sequentially for each active collector, each with a 240-second timeout. However, hsflowd processes all collectors at once from a single config file — once collector0 is ready, collector1 will typically also be ready immediately. The sequential approach adds no correctness benefit but could mislead on timing logs. Consider checking all collectors in a single wait_until call.


🟡 Minor: start_time / elapsed Only Meaningful on Success Path

start_time = time.time()
pytest_assert(wait_until(...), "failed...")   # raises on timeout
elapsed = time.time() - start_time           # never reached on failure

On failure, elapsed time is never logged, though it would be the most useful debug info. Consider including it in the failure message:

pytest_assert(..., f"... within 240 seconds (elapsed: {time.time()-start_time:.1f}s)")

🟡 Minor: Fixture Dependency Addition May Affect Test Collection

partial_ptf_runner now additionally depends on duthosts and rand_one_dut_hostname (typically module-scoped). This is valid in pytest, but worth noting — any test class using this fixture now implicitly depends on dut hostname selection, which could affect parametrization behavior if the fixture scope changes in the future.


🟡 Minor: Missing Docstring for wait_until_hsflowd_ready

verify_hsflowd_ready has a good docstring, but wait_until_hsflowd_ready has none. Since it contains retry logic and max timeout policy (240s), a docstring with Args/Raises would be helpful.


✅ What Looks Good

  • hsflowd readiness check approach is well-motivated: /etc/hsflowd.auto is explicitly documented in the SONiC HLD as the signal that hsflowd has processed collector config — the correct and canonical readiness indicator.
  • 240s timeout matches the HLD's documented 3+ minute initialization window.
  • Placement of the wait in partial_ptf_runner covers all callers uniformly without requiring per-test changes.
  • import ast correctly added.
  • Both fixes are targeted, minimal, and don't introduce new test logic.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vkjammala-arista
Copy link
Contributor Author

Hi @StormLiangMS i have addressed the comments.

Minor: start_time / elapsed Only Meaningful on Success Path
This doesn't make much sense, as in the failure case we know that hsflowd didn't process the collector config in 240 secs.

Minor: Fixture Dependency Addition May Affect Test Collection
Given all the testcases depends on duthosts and rand_one_dut_hostname, I think it should be fine for partial_ptf_runner to depend on these fixtures. Otherwise we can make testcases to pass duthost to partial_ptf_runner, but it's not really needed though.

@vkjammala-arista vkjammala-arista force-pushed the fix-sflow-packet-failures branch from 772a1b7 to a78f760 Compare March 5, 2026 06:09
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vkjammala-arista
Copy link
Contributor Author

@StormLiangMS Can you please help in merging this PR

@anders-nexthop
Copy link
Contributor

@vkjammala-arista for this change and for your other reviews, test_sflow.py is currently being skipped (which is counted as a pass) so none of your PRs have any actual coverage of the test. You would need to re-enable the test in order to verify through the community CI/CD that these PRs are actually addressing issues.

image

"""
return all(
duthost.shell(
f"docker exec sflow grep -q 'collector={ip}' /etc/hsflowd.auto 2>/dev/null",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you verified that this works if we are hitting the case where SYSTEM_READY is not up yet? I'm pretty sure hsflowd.auto gets generated before the timeout for SYSTEM_READY runs, which is what we are trying to wait for here. There might be value in also checking for hsflowd to be up, but I don't think it solves that particular problem case.

Copy link
Contributor Author

@vkjammala-arista vkjammala-arista Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From sflow HLD: hsflowd daemon will initialize only after receiving the SYSTEM_READY|SYSTEM_STATE Status=Up from SONiC. The system ready state however depends on all the monitored daemons to be ready. Failure on any of these, will result in system state to be down. In such scenarios, sflow will wait until 180 seconds and if system ready is not up will proceed with initialization

IIUC, hsflowd processes the collector configuration only after SYSTEM_READY is up, so that's the reason why I have added 240 secs (see wait_until_hsflowd_ready) wait_until hsflowd processes the collector config (i.e hsflowd.auto to have collector_ip information)

And I'm not seeing any issues around this with this PR fix. Please let me know if my understanding isn't correct.

Copy link
Contributor

@anders-nexthop anders-nexthop Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my change #21195 which has a very similar hsflow_ready check, as well as an explicit SYSTEM_READY check. I found that 180s works fine for the timeout, is there a reason you went with 240s?

Copy link
Contributor Author

@vkjammala-arista vkjammala-arista Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how helpful or reliable the SYSTEM_READY check is, and I don't know exactly how the sFlow/hsflowd process uses this SYSTEM_READY state. So, in my opinion, as far as the sFlow test is concerned, checking whether hsflowd has processed the collector IP configuration should be enough.

I have gone through your hsflow_ready check, I see you aren't checking whether collector config is actually processed or not (checking on hsflowd process is up or not, might not be sufficient in some scenarios)

is there a reason you went with 240s?
No specific reason, I just wanted to give some extra delay than what HLD stated (i.e >180 secs) for test to be more reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEM_READY is a top-level concept in SONiC, and is definitely something we can and should use if we need to.

For this case, though, I agree with you.

We can inspect the hsflowd source directly and see what it does. Here's the SYSTEM_READY check (from https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2355):

    case HSP_SONIC_STATE_WAIT_READY:
      // all dbs connected - wait for SYSTEM_READY
      {
	time_t waiting = mdata->pollBus->now.tv_sec - mdata->waitReadyStart;
	if(waiting < sp->sonic.waitReady) {
	  db_getSystemReady(mod);
	}
	else {
	  EVDebug(mod, 1, "sonic: waitReady timeout after %ld seconds", (long)waiting);
	  setSonicState(mod, HSP_SONIC_STATE_CONNECTED);
	}

db_getSystemReady(mod) checks for SYTEM_READY|SYSTEM_STATE=UP, so waiting for SYSTEM_READY is an appropriate check. Howerver, the collectors are not processed until hsflowd is in the CONNECTED state, which doesn't happen until hsflowd sees that the SYSTEM_READY check succeeds or times out (the timeout is 180s, which is where the 3-minute timeline in the HLD comes from) (https://github.com/sflow/host-sflow/blob/05be6bd61bd9926cd8a0e0d837a69165fd8add71/src/Linux/mod_sonic.c#L2374):

    case HSP_SONIC_STATE_CONNECTED:
      // connected and ready - learn config
      db_getMeta(mod);
      dbEvt_subscribe(mod);
      // the next steps read the starting agent/polling/collector
      // config. Any subsequent changes will be detected via dbEvt.
      setSonicState(mod, HSP_SONIC_STATE_SFLOWGLOBAL);
      break;

So, in this case, checking for collectors works as a stand-in for checking SYSTEM_READY, as long as we include a long enough timeout. I looked into different failure cases, but hsflowd truncates the .auto file when it starts, and we would only hit the SYSTEM_READY thing in the case of an hsflowd restart.

I will update my PR, it has some further fixes that are still valuable (I reworked the PTF sample timing to catch intermittent cases where startup delay fluctuations can occasionally case failures).

If you want to merge this one, can you get rid of the test skip and trigger the CI/CD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anders-nexthop for the details and pointers to SYSTEM_READY checks. I have updated my PR to get rid of test skip.

@vkjammala-arista
Copy link
Contributor Author

@vkjammala-arista for this change and for your other reviews, test_sflow.py is currently being skipped (which is counted as a pass) so none of your PRs have any actual coverage of the test. You would need to re-enable the test in order to verify through the community CI/CD that these PRs are actually addressing issues.

image

I'm not sure why this test is getting skipped in community CI/CD, in our testbeds these tests are running fine.

collected 15 items

sflow/test_sflow.py::TestSflowCollector::test_sflow_config PASSED                                                                                                                                              [  6%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:11:44 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_collector_del_add PASSED                                                                                                                                         [ 13%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:15:49 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_two_collectors PASSED                                                                                                                                            [ 20%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:23:17 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testPolling PASSED                                                                                                                                                      [ 26%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:25:23 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDisablePolling PASSED                                                                                                                                               [ 33%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:27:07 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDifferentPollingInt PASSED                                                                                                                                          [ 40%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:29:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfRemoval PASSED                                                                                                                                                [ 46%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:32:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfSamplingRate PASSED                                                                                                                                           [ 53%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:36:57 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testNonDefaultAgent PASSED                                                                                                                                                   [ 60%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:39:10 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testDelAgent PASSED                                                                                                                                                          [ 66%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:40:54 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testAddAgent PASSED                                                                                                                                                          [ 73%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:42:35 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowEnable PASSED                                                                                                                                                  [ 80%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:51:02 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowDisable PASSED                                                                                                                                                 [ 86%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:58:38 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

12/03/2026 02:58:44 memory_utilization.check_memory_threshol L0064 WARNING| Skipping memory check for monit-memory_usage due to zero value

sflow/test_sflow.py::TestReboot::testFastreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [ 93%]
sflow/test_sflow.py::TestReboot::testWarmreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [100%]

@anders-nexthop
Copy link
Contributor

@vkjammala-arista for this change and for your other reviews, test_sflow.py is currently being skipped (which is counted as a pass) so none of your PRs have any actual coverage of the test. You would need to re-enable the test in order to verify through the community CI/CD that these PRs are actually addressing issues.
image

I'm not sure why this test is getting skipped in community CI/CD, in our testbeds these tests are running fine.

collected 15 items

sflow/test_sflow.py::TestSflowCollector::test_sflow_config PASSED                                                                                                                                              [  6%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:11:44 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_collector_del_add PASSED                                                                                                                                         [ 13%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:15:49 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_two_collectors PASSED                                                                                                                                            [ 20%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:23:17 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testPolling PASSED                                                                                                                                                      [ 26%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:25:23 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDisablePolling PASSED                                                                                                                                               [ 33%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:27:07 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDifferentPollingInt PASSED                                                                                                                                          [ 40%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:29:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfRemoval PASSED                                                                                                                                                [ 46%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:32:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfSamplingRate PASSED                                                                                                                                           [ 53%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:36:57 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testNonDefaultAgent PASSED                                                                                                                                                   [ 60%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:39:10 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testDelAgent PASSED                                                                                                                                                          [ 66%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:40:54 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testAddAgent PASSED                                                                                                                                                          [ 73%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:42:35 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowEnable PASSED                                                                                                                                                  [ 80%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:51:02 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowDisable PASSED                                                                                                                                                 [ 86%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:58:38 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

12/03/2026 02:58:44 memory_utilization.check_memory_threshol L0064 WARNING| Skipping memory check for monit-memory_usage due to zero value

sflow/test_sflow.py::TestReboot::testFastreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [ 93%]
sflow/test_sflow.py::TestReboot::testWarmreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [100%]

Different topology maybe?

@vkjammala-arista
Copy link
Contributor Author

vkjammala-arista commented Mar 12, 2026

@vkjammala-arista for this change and for your other reviews, test_sflow.py is currently being skipped (which is counted as a pass) so none of your PRs have any actual coverage of the test. You would need to re-enable the test in order to verify through the community CI/CD that these PRs are actually addressing issues.
image

I'm not sure why this test is getting skipped in community CI/CD, in our testbeds these tests are running fine.

collected 15 items

sflow/test_sflow.py::TestSflowCollector::test_sflow_config PASSED                                                                                                                                              [  6%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:11:44 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_collector_del_add PASSED                                                                                                                                         [ 13%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:15:49 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowCollector::test_two_collectors PASSED                                                                                                                                            [ 20%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:23:17 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testPolling PASSED                                                                                                                                                      [ 26%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:25:23 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDisablePolling PASSED                                                                                                                                               [ 33%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:27:07 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowPolling::testDifferentPollingInt PASSED                                                                                                                                          [ 40%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:29:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfRemoval PASSED                                                                                                                                                [ 46%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:32:30 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestSflowInterface::testIntfSamplingRate PASSED                                                                                                                                           [ 53%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:36:57 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testNonDefaultAgent PASSED                                                                                                                                                   [ 60%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:39:10 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testDelAgent PASSED                                                                                                                                                          [ 66%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:40:54 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestAgentId::testAddAgent PASSED                                                                                                                                                          [ 73%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:42:35 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowEnable PASSED                                                                                                                                                  [ 80%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:51:02 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

sflow/test_sflow.py::TestReboot::testRebootSflowDisable PASSED                                                                                                                                                 [ 86%]
------------------------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------------------------
12/03/2026 02:58:38 memory_utilization._parse_threshold      L0161 WARNING| Percentage threshold outside normal range (0-100): 120.0%

12/03/2026 02:58:44 memory_utilization.check_memory_threshol L0064 WARNING| Skipping memory check for monit-memory_usage due to zero value

sflow/test_sflow.py::TestReboot::testFastreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [ 93%]
sflow/test_sflow.py::TestReboot::testWarmreboot SKIPPED (Dualtor topology doesn't support advanced-reboot)                                                                                                     [100%]

Different topology maybe?

I see the reason for confusion now, test is being skipped in the master (but not on 202511) because of PR #21674 changes.

  skip:
    reason: "The testcase is skipped due to github issue #21701"
    conditions:
      - "https://github.com/sonic-net/sonic-mgmt/issues/21701"

And my test results are from 202511 branch, so yeah we need to remove above piece of code.

Copy link
Contributor

@anders-nexthop anders-nexthop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the testcases, I a couple cases were we aren't passing any collectors when setting up the config that would have the potential to hide issues. so we may want some form of system ready check in a couple of spots. But overall this is good (I will layer my changes on top of this one).

Please make sure to revert the conditional skip though, and post a passing test result (the test results from the checks age out after awhile, and since we are re-enabling this one after some time it would be good to have a record that it did indeed pass successfully -- and not just skip).

duthost, intf)
var['portmap'] = json.dumps(var['sflow_ports'])
ptfhost.copy(content=var['portmap'], dest="/tmp/sflow_ports.json")
partial_ptf_runner(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is trying to make sure that hsflowd didn't come up after a reboot when it's not supposed to. But there might be a wait of up to 3 minutes for hsflowd to start (the SYSTEM_READY check). The way is_hsflowd_ready() is written, it won't wait at all for hsflowd in this case, since there are no collectors to check for. If we don't wait for hsflowd to be ready here, then we aren't really testing anything with this ptf_runner call. So we would either need to wait the 3 minutes, or check for SYSTEM_READY ourselves here before continuing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, as you are planning to your PR which is having some further fixes, do you want to take care of this. Otherwise, I will take care of this scenario in my follow-up review. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will update my PR to cover this case.

'show sflow')['stdout'])) == 0)
verify_show_sflow(duthost, status='up', collector=[])
wait_until(30, 5, 0, verify_sflow_config_apply, duthost)
partial_ptf_runner(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically if we ran this test case as a standalone, this could fail if we are stalled waiting for SYSTEM_READY. That's not a very likely case, as the test needs the collector to exist to be very meaningful, but it's still a consideration.

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review (for aristanetworks/sonic-qual.msft#1121)

LGTM. Good fix for the sflow reboot flakiness:

  • verify_hsflowd_ready / wait_until_hsflowd_ready properly waits for collector config in hsflowd.auto before running traffic verification.
  • PTF-side fix correctly distinguishes total_counter_count vs total_flow_count instead of using the ambiguous total_samples.
  • 240s timeout is generous but reasonable for reboot scenarios where hsflowd can take 3+ minutes to initialize.

@vkjammala-arista vkjammala-arista force-pushed the fix-sflow-packet-failures branch from 93372bf to d8dad4a Compare March 20, 2026 02:07
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…kets not received on collector interface

Issue sonic-net#1:
In some cases (like sflow config enabled for first time, device reboot),
hsflowd daemon is taking little over 3 mins to be fully initialized and
process collector config. During this window, hsflowd service won't send
sflow packets ('CounterSample', 'FlowSample' etc) to collector interface
and thus test can fail with i) "Packets are not received in active
collector, collector\d+" and ii) "Expected Number of samples are not
collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+"

hsflowd service is writing to "/etc/hsflowd.auto" once it's processed
collector configuration. Thus waiting for collector info to be present in
"/etc/hsflowd.auto" seems to be safe option before proceeding with
sflow traffic verfication.

Issue sonic-net#2:
If the test expects flow samples/packets on the collector interface but they aren't
seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to
counter samples seen on collector interface, "data['total_samples']" will not be
zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to
access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and
"total_counter_count" before calling corresponding sample analyze functions.

Signed-off-by: Vinod <vkjammala@arista.com>
1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the
   collector IPs (instead of calling it sequentially for each IP)
2) Add docstring for "wait_until_hsflowd_ready" function
3) Updated "ast.literal_eval" usage to handle the case where
   "active_collectors" is passed as empty string ("" instead of "[]")

Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: Vinod <vkjammala@arista.com>
@vkjammala-arista vkjammala-arista force-pushed the fix-sflow-packet-failures branch from a542928 to ddb7e49 Compare March 20, 2026 02:16
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit 37981c7 into sonic-net:master Mar 26, 2026
16 checks passed
@StormLiangMS StormLiangMS added Request for 202511 branch Request to backport a change to 202511 branch Approved for 202511 branch labels Mar 26, 2026
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 26, 2026
…kets not received on collector interface (sonic-net#22186)

* [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface

Issue #1:
In some cases (like sflow config enabled for first time, device reboot),
hsflowd daemon is taking little over 3 mins to be fully initialized and
process collector config. During this window, hsflowd service won't send
sflow packets ('CounterSample', 'FlowSample' etc) to collector interface
and thus test can fail with i) "Packets are not received in active
collector, collector\d+" and ii) "Expected Number of samples are not
collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+"

hsflowd service is writing to "/etc/hsflowd.auto" once it's processed
collector configuration. Thus waiting for collector info to be present in
"/etc/hsflowd.auto" seems to be safe option before proceeding with
sflow traffic verfication.

Issue #2:
If the test expects flow samples/packets on the collector interface but they aren't
seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to
counter samples seen on collector interface, "data['total_samples']" will not be
zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to
access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and
"total_counter_count" before calling corresponding sample analyze functions.

Signed-off-by: Vinod <vkjammala@arista.com>

* Addressing review comments

1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the
   collector IPs (instead of calling it sequentially for each IP)
2) Add docstring for "wait_until_hsflowd_ready" function
3) Updated "ast.literal_eval" usage to handle the case where
   "active_collectors" is passed as empty string ("" instead of "[]")

Signed-off-by: Vinod <vkjammala@arista.com>

* Fix pre-commit check failures

Signed-off-by: Vinod <vkjammala@arista.com>

* Revert PR#21674 partially to enable "sflow/test_sflow.py" test

Signed-off-by: Vinod <vkjammala@arista.com>

---------

Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: mssonicbld <sonicbld@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #23335

@StormLiangMS
Copy link
Collaborator

@vkjammala-arista — CI is green and has 1 approval from @anders-nexthop. Needs additional approvals. Requesting @mramezani95 @xwjiang-ms @zypgithub @BYGX-wcr @yutongzhang-microsoft to please review. This fixes sflow packet reception, tracked in aristanetworks/sonic-qual.msft#1062.

mssonicbld added a commit that referenced this pull request Mar 26, 2026
…kets not received on collector interface (#22186) (#23335)

* [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface

Issue #1:
In some cases (like sflow config enabled for first time, device reboot),
hsflowd daemon is taking little over 3 mins to be fully initialized and
process collector config. During this window, hsflowd service won't send
sflow packets ('CounterSample', 'FlowSample' etc) to collector interface
and thus test can fail with i) "Packets are not received in active
collector, collector\d+" and ii) "Expected Number of samples are not
collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+"

hsflowd service is writing to "/etc/hsflowd.auto" once it's processed
collector configuration. Thus waiting for collector info to be present in
"/etc/hsflowd.auto" seems to be safe option before proceeding with
sflow traffic verfication.

Issue #2:
If the test expects flow samples/packets on the collector interface but they aren't
seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to
counter samples seen on collector interface, "data['total_samples']" will not be
zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to
access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and
"total_counter_count" before calling corresponding sample analyze functions.



* Addressing review comments

1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the
   collector IPs (instead of calling it sequentially for each IP)
2) Add docstring for "wait_until_hsflowd_ready" function
3) Updated "ast.literal_eval" usage to handle the case where
   "active_collectors" is passed as empty string ("" instead of "[]")



* Fix pre-commit check failures



* Revert PR#21674 partially to enable "sflow/test_sflow.py" test



---------

Signed-off-by: Vinod <vkjammala@arista.com>
Signed-off-by: mssonicbld <sonicbld@microsoft.com>
Co-authored-by: vkjammala-arista <152394203+vkjammala-arista@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants