Skip to content

Bug: countersyncd: Netlink receive buffer overflows with very large HFT counter set #26246

@DavidZagury

Description

@DavidZagury

Is it platform specific

generic

Importance or Severity

Critical

Description of the bug

Under high HFT load with a very large number of counters (e.g. full queue counters, as in test_hft_full_queue_counters), countersyncd can hit ENOBUFS (“No buffer space available”) on the netlink socket.

The kernel reports that the netlink socket’s receive buffer is full (“No buffer space available”), so the overflow is at the kernel → userspace boundary: data is arriving faster than the process is reading it.

So the kernel is telling us its netlink receive buffer is full and it had to drop data. The problem is before our processing: we are not draining the socket often enough.

The code waits a fixed interval (e.g. 10 ms) before each socket readiness check, then does one read. So we only try to drain the kernel buffer at most once per interval. If HFT samples are sent every 10 ms. With a very large number of counters, each 10 ms burst is large. If we only read once per 10 ms, the kernel can get the next burst before we’ve read the previous one, so its buffer fills and we see the “buffer full” error. Because comm stats show downstream is keeping up, the only remaining explanation is that we’re not draining the kernel buffer often enough for this data rate and burst size.

I have created a debug version with reduced interval of 5ms - which has caused this issue to disappear under the same load

Steps to Reproduce

  1. Configure SONiC with HFT enabled and a profile that creates a very large number of counters (e.g. full queue counters, as covered by test_hft_full_queue_counters on a switch with 512 ports, and 8 QUEUES defined per port), with 10 ms sample interval.
  2. Start countersyncd with:
    countersyncd -e --stats-interval 10
  3. Run under steady traffic so HFT samples are sent every 10 ms with the full counter set.
  4. Observe logs for ENOBUFS.

Actual Behavior and Expected Behavior

Expected behavior
No ENOBUFS when running with a very large number of HFT counters (e.g. full queue counters as in test_hft_full_queue_counters), as long as the poll interval is appropriate for the data rate.

Actual behavior
With a very large number of counters, logs repeatedly show:

[WARN] Netlink receive buffer full (ENOBUFS). Consider increasing buffer size or processing messages faster. Error: Os { code: 105, kind: Uncategorized, message: "No buffer space available" }
Some HFT samples are dropped; stats can be incomplete or lag during bursts.
With a smaller number of counters, the same setup typically does not show ENOBUFS.

Relevant log output

With 10 ms poll interval (reproduces the bug):

[crates/countersyncd/src/actor/data_netlink.rs:892] [WARN] Netlink receive buffer full (ENOBUFS). Consider increasing buffer size or processing messages faster. Error: Os { code: 105, kind: Uncategorized, message: "No buffer space available" }[2026-03-18 12:10:46.408] [crates/countersyncd/src/actor/data_netlink.rs:892] [WARN] Netlink receive buffer full (ENOBUFS). ...[2026-03-18 12:10:46.508] [crates/countersyncd/src/actor/data_netlink.rs:892] [WARN] Netlink receive buffer full (ENOBUFS). ...

Comm stats (data path itself is not backlogged; bottleneck is kernel→userspace):

[crates/countersyncd/src/utilities/mod.rs:140] [INFO] Comm stats [data_netlink.ipfix_records]: count=47754, avg_len=0.00, peak_len=0, min_len=0, last_len=0, rms_len=0.00, nonzero_count=0, capacity=1024, avg_util=0.00, peak_util=0.00
[crates/countersyncd/src/utilities/mod.rs:140] [INFO] Comm stats [ipfix.stats_reporter]: count=95507, avg_len=0.50, peak_len=3, min_len=0, last_len=1, rms_len=0.71, nonzero_count=47741, capacity=1024, avg_util=0.00, peak_util=0.00
[crates/countersyncd/src/utilities/mod.rs:140] [INFO] Comm stats [control_netlink.data_netlink_cmd]: count=10, avg_len=0.00, peak_len=0, min_len=0, last_len=0, rms_len=0.00, nonzero_count=0, capacity=10, avg_util=0.00, peak_util=0.00

Output of show version, show techsupport

Attach files (if any)

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions