Skip to content

Introduce keepalives for ZmqClient and ZmqServer#1162

Draft
prabhataravind wants to merge 2 commits intosonic-net:masterfrom
prabhataravind:paravind/zmq_keepalive_fix
Draft

Introduce keepalives for ZmqClient and ZmqServer#1162
prabhataravind wants to merge 2 commits intosonic-net:masterfrom
prabhataravind:paravind/zmq_keepalive_fix

Conversation

@prabhataravind
Copy link
Contributor

@prabhataravind prabhataravind commented Mar 23, 2026

Fixes issue: sonic-net/sonic-buildimage#23110

When a DPU is powered off and back on, the ZMQ client on the switch still holds a stale TCP connection. The first message sent after DPU restart is delivered over the dead connection, gets a TCP RST, and is silently lost. ZMQ then auto-reconnects, so subsequent messages succeed.

This patch enables:

  1. TCP keepalive on ZmqClient PUSH sockets to detect dead connections proactively (within ~8 seconds of peer going down).
  2. TCP keepalive on ZmqServer PULL sockets as defense-in-depth.

With these changes, after DPU power-off:

  • TCP keepalive probes will fail, causing ZMQ to tear down the stale connection and reconnect

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind prabhataravind force-pushed the paravind/zmq_keepalive_fix branch from 653fed6 to 9c68d7c Compare March 23, 2026 18:51
Fixes issue: sonic-net/sonic-buildimage#23110

When a DPU is powered off and back on, the ZMQ client on the switch still
holds a stale TCP connection. The first message sent after DPU restart is
delivered over the dead connection, gets a TCP RST, and is silently lost.
ZMQ then auto-reconnects, so subsequent messages succeed.

This patch enables:
1. TCP keepalive on ZmqClient PUSH sockets to detect dead connections
   proactively (within ~8 seconds of peer going down).
2. ZMQ_IMMEDIATE on ZmqClient PUSH sockets to prevent queueing messages
   to peers whose underlying TCP connection is not yet completed.
3. TCP keepalive on ZmqServer PULL sockets as defense-in-depth.

With these changes, after DPU power-off:
- TCP keepalive probes will fail, causing ZMQ to tear down the stale
  connection and reconnect
- ZMQ_IMMEDIATE prevents the first message from being queued to a peer
  with an incomplete connection, so it stays in the send queue until
  the reconnection completes

Signed-off-by: Prabhat Aravind <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants