Skip to content

Add lazy binding support to ZmqServer.#1068

Merged
qiluo-msft merged 9 commits intosonic-net:masterfrom
liuh-80:dev/liuh/fix_zmq_server_drop
Aug 21, 2025
Merged

Add lazy binding support to ZmqServer.#1068
qiluo-msft merged 9 commits intosonic-net:masterfrom
liuh-80:dev/liuh/fix_zmq_server_drop

Conversation

@liuh-80
Copy link
Copy Markdown
Contributor

@liuh-80 liuh-80 commented Aug 18, 2025

Add lazy binding support to ZmqServer.

Why I did it

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

How I did it

Add lazy binding support to ZmqServer.

Work item tracking
  • Microsoft ADO: 33995986

How to verify it

Pass all test cases.
Add new test case for lazy binding.

Pass all sonic-mgmt test with this PR:
sonic-net/sonic-buildimage#23741

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Add lazy binding support to ZmqServer.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

SWSS_LOG_ENTER();
if (m_socket)
{
SWSS_LOG_THROW("ZmqServer has already been bound to the endpoint: %s", m_endpoint.c_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider this bind as no-op instead of throw?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ZMQ server should only be bound once.
To catch potential issues during development, I recommend throwing an exception if a second bind attempt is detected. This will help identify incorrect usage early and ensure proper lifecycle management of the server.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! so it is an explicit lazy binding instead of implicit one. makes sense!

@qiluo-msft qiluo-msft merged commit 02a9ab4 into sonic-net:master Aug 21, 2025
15 checks passed
mssonicbld added a commit to mssonicbld/sonic-swss.msft that referenced this pull request Aug 30, 2025
Fix DPU restart message drop by Zmq lazy bind.

#### Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

#### How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.

This PR depends on sonic-net/sonic-swss-common#1068

##### Work item tracking
- Microsoft ADO: 33995986

#### How to verify it
Pass all test cases.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
Fix DPU restart message drop by Zmq lazy bind.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit to Azure/sonic-swss.msft that referenced this pull request Aug 30, 2025
Fix DPU restart message drop by Zmq lazy bind.

#### Why I did it
Fix issue:
sonic-net/sonic-buildimage#23110

When creating a ZmqServer followed by a ZmqProducerStateTable, there may be a time gap between the server starting to receive messages and the producer state table registering its handler.

This gap can lead to dropped messages. To avoid this, use lazy binding and invoke bind() only after the handler is registered.

#### How I did it
Update Orchagent to use lazy binding for ZMQ.
ZMQ is now created with lazy bind in Orchagent, and the bind() operation is deferred until all ZmqProducerStateTable instances have been initialized. This ensures handlers are registered before any messages are received, preventing potential data loss during startup.

This PR depends on sonic-net/sonic-swss-common#1068

##### Work item tracking
- Microsoft ADO: 33995986

#### How to verify it
Pass all test cases.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
Fix DPU restart message drop by Zmq lazy bind.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to msft-202506: Azure/sonic-swss-common.msft#54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants