Skip to content

Fix DPU restart message drop by Zmq lazy bind.#3837

Merged
prsunny merged 4 commits intosonic-net:masterfrom
liuh-80:dev/liuh/zmq_lazy_bind
Aug 26, 2025
Merged

Fix DPU restart message drop by Zmq lazy bind.#3837
prsunny merged 4 commits intosonic-net:masterfrom
liuh-80:dev/liuh/zmq_lazy_bind

Conversation

@liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Aug 19, 2025

Fix DPU restart message drop by Zmq lazy bind.

Why I did it

Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it

Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.

This PR depends on sonic-net/sonic-swss-common#1068

Work item tracking
  • Microsoft ADO: 33995986

How to verify it

Pass all test cases.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Fix DPU restart message drop by Zmq lazy bind.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented Aug 21, 2025

swss-common PR merged
Will build again when new artifact ready.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Aug 21, 2025

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented Aug 25, 2025

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80 liuh-80 marked this pull request as ready for review August 25, 2025 08:19
@liuh-80 liuh-80 requested a review from prsunny as a code owner August 25, 2025 08:19
Copy link
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prabhataravind prabhataravind requested a review from Copilot August 26, 2025 03:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes message drops during DPU restart by implementing lazy binding for ZMQ servers. Instead of binding immediately when the ZmqServer is created, the binding is deferred until after message handlers are registered, preventing messages from being lost during the initialization gap.

  • Introduces lazy binding for ZmqServer instances by passing a true parameter to the constructor
  • Updates the main orchestration agent to call bind() after all handlers are registered
  • Modifies unit tests to explicitly call bind() after registering handlers

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
lib/orch_zmq_config.cpp Enables lazy binding by passing true to ZmqServer constructor
orchagent/main.cpp Adds explicit bind() call after handler registration with logging
tests/mock_tests/zmq_orch_ut.cpp Updates unit test to call bind() after registering message handler

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@prsunny prsunny merged commit f9bf770 into sonic-net:master Aug 26, 2025
15 checks passed
a114j0y pushed a commit to a114j0y/sonic-swss that referenced this pull request Aug 28, 2025
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.
a114j0y added a commit to a114j0y/sonic-swss that referenced this pull request Aug 28, 2025
* upstream/master:
  [ssw][ha] set `SAI_HA_SCOPE_ATTR_ADMIN_STATE`   (sonic-net#3841)
  Fix DPU restart message drop by Zmq lazy bind. (sonic-net#3837)
  [ssw][ha] consume new ha_scope fields (sonic-net#3825)
  Add PFC historical statistics estimation to the PFCWD Orch (sonic-net#3533)
a114j0y added a commit to a114j0y/sonic-swss that referenced this pull request Aug 28, 2025
* upstream/master:
  [ssw][ha] set `SAI_HA_SCOPE_ATTR_ADMIN_STATE`   (sonic-net#3841)
  Fix DPU restart message drop by Zmq lazy bind. (sonic-net#3837)
  [ssw][ha] consume new ha_scope fields (sonic-net#3825)
  Add PFC historical statistics estimation to the PFCWD Orch (sonic-net#3533)
a114j0y pushed a commit to a114j0y/sonic-swss that referenced this pull request Aug 29, 2025
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202506: Azure/sonic-swss.msft#144

Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.
balanokia pushed a commit to balanokia/sonic-swss that referenced this pull request Nov 17, 2025
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Feb 4, 2026
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
Fix DPU restart message drop by Zmq lazy bind.

Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants