Skip to content

Improve connection establishment.#1187

Open
gmartin82 wants to merge 8 commits intoeclipse-zenoh:mainfrom
gmartin82:ZEN-838
Open

Improve connection establishment.#1187
gmartin82 wants to merge 8 commits intoeclipse-zenoh:mainfrom
gmartin82:ZEN-838

Conversation

@gmartin82
Copy link
Copy Markdown
Contributor

@gmartin82 gmartin82 commented Mar 16, 2026

Description

This PR updates z_open() connection-establishment behaviour so configured listen/connect locators can be retried instead of being attempted only once.

It adds independent retry timeout and failure-policy configuration for listen and connect locators, and clarifies peer-mode behaviour when a session can be opened with only partial peer connectivity.

What does this PR do?

  • Adds unstable connect configuration options:

    • Z_CONFIG_CONNECT_TIMEOUT_KEY
    • Z_CONFIG_CONNECT_EXIT_ON_FAILURE_KEY
  • Adds unstable listen configuration options:

    • Z_CONFIG_LISTEN_TIMEOUT_KEY
    • Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY
  • The new connect/listen timeout and exit-on-failure configuration keys are unstable and are only exposed when Z_FEATURE_UNSTABLE_API is enabled. The corresponding default string constants remain available so the implementation can keep default behaviour consistent when the unstable keys are not enabled.

  • Timeout values support:

    • 0: no retry; attempt once.
    • >0: retry retryable failures until the timeout expires.
    • -1: retry indefinitely.
  • Default behaviour:

    • Connect timeout defaults to 0.
    • Listen timeout defaults to 0.
    • Client connect exit-on-failure defaults to true.
    • Peer connect exit-on-failure defaults to false.
    • Listen exit-on-failure defaults to true.
  • Client mode:

    • Only connect locators are used.
    • Configured connect locators are treated as alternatives.
    • z_open() succeeds when one connect locator succeeds.
    • z_open() fails if no connect locator can establish a transport.
    • connect_exit_on_failure has no effect in client mode, as a session cannot be opened without a transport.
  • Peer mode:

    • z_open() requires a primary transport before returning successfully.
    • The primary transport can be established either by:
      • opening the configured listen locator, or
      • connecting to one configured connect locator.
    • If a listen locator is configured, it is attempted first.
    • If listening fails and listen_exit_on_failure=false, z_open() may still succeed by connecting to a configured connect locator.
    • If no listen or connect locator establishes a primary transport, z_open() fails.
    • Once the primary transport is established, remaining connect locators are added as additional peers.
    • If connect_exit_on_failure=true, z_open() fails if all required peer connections cannot be established within the configured timeout.
    • If connect_exit_on_failure=false, z_open() may return successfully with partial connectivity and leave remaining retryable peer connections to be attempted:
      • synchronously (single-thread mode, when session tasks are progressed), or
      • asynchronously (multi-thread mode, via background transport tasks).
  • Retry semantics:

    • Only retryable transport errors are retried.
    • Non-retryable errors permanently exclude a locator from further attempts.
    • Retryable failures use exponential backoff until timeout expires.
  • Adds _Z_ERR_TRANSPORT_OPEN_PARTIAL_CONNECTIVITY, returned when:

    • a primary transport has been established, but
    • required peer connectivity could not be completed under a strict (exit_on_failure=true) policy.
  • Adds tests covering:

    • default open behaviour without unstable config keys
    • connect timeout when no endpoint is available (Z_FEATURE_UNSTABLE_API)
    • partial peer connectivity (Z_FEATURE_UNSTABLE_API)
    • late-joining endpoints becoming available within the timeout (Z_FEATURE_UNSTABLE_API)
    • behaviour differences between strict and best-effort failure policies (Z_FEATURE_UNSTABLE_API)

Why is this change needed?

Previously, configured listen/connect locators were attempted once during z_open(). This made session establishment sensitive to startup ordering: if a peer was not yet listening, or a local listen endpoint was temporarily unavailable, z_open() could fail immediately.

This change makes connection establishment more robust by allowing applications to choose between:

  • existing no-retry behaviour,
  • bounded retry,
  • infinite retry,
  • strict failure behaviour, and
  • best-effort peer connectivity.

Related Issues

ZEN-838


🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

  • Enhancement scope documented - Clear description of what is being improved
  • Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
  • Backwards compatible - Existing code/APIs still work unchanged
  • No new APIs added - Only improving existing functionality
  • Tests updated - Existing tests pass, new test cases added if needed
  • Performance improvement measured - If applicable, before/after metrics provided
  • Documentation updated - Existing docs updated to reflect improvements
  • User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

@gmartin82 gmartin82 added the enhancement Existing things could work better label Mar 16, 2026
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@gmartin82 gmartin82 force-pushed the ZEN-838 branch 5 times, most recently from 3de806f to 3626566 Compare March 18, 2026 18:24
@gmartin82 gmartin82 marked this pull request as ready for review March 18, 2026 18:27
@DenisBiryukov91
Copy link
Copy Markdown
Contributor

I believe peer connection is better to be done as asynchronous background task based on #1190.

Comment thread include/zenoh-pico/api/types.h Outdated
Comment thread src/net/session.c
* If listen locators are configured, one of them is used as the primary open.
* Otherwise one connect locator is opened with retry/backoff.
*/
while (ret != _Z_RES_OK) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry loop currently treats every ret != _Z_RES_OK as non-fatal and retriable. But _z_open_inner() can also return non-retriable errors such as invalid locator/schema or protocol/handshake failures. I think we should not retry fatal errors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe locators should be validated in advance upon initialization in the config.
Everything else should probably be considered as retriable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested, the implementation only retries locators that are deemed to be retryable. The set of error codes considered retryable in _z_client_reopen_task_fn was used as a basis for this.

Note: Zenoh retries all locators without considering returned errors.

Comment thread src/net/session.c Outdated

uint32_t sleep_ms = _Z_OPEN_BACKOFF_MIN_MS;

while (pending_mask != 0u) {
Copy link
Copy Markdown
Member

@steils steils Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for :293. _z_new_peer() failures are always retried until timeout. Should we distinguish fatal and non-fatal errors in this loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sashacmc
Copy link
Copy Markdown
Member

sashacmc commented Apr 3, 2026

@DenisBiryukov91 can we close this PR, or should it be updated after the #1190

@gmartin82
Copy link
Copy Markdown
Contributor Author

@sashacmc It needs to be rebased and updated to add background connection establishment.

@gmartin82 gmartin82 force-pushed the ZEN-838 branch 3 times, most recently from ddcb73c to b13c620 Compare April 13, 2026 16:24
@gmartin82 gmartin82 marked this pull request as draft April 17, 2026 18:12
Comment thread include/zenoh-pico/net/session.h Fixed
const _z_config_t *session_cfg);
void _z_free_transport(_z_transport_t **zt);

#if Z_FEATURE_UNICAST_PEER == 1
Comment thread src/transport/unicast/transport.c Fixed
Comment thread src/transport/unicast/transport.c Fixed
@gmartin82 gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from e23879f to d108d4d Compare April 29, 2026 18:02
Comment thread src/protocol/config.c Fixed
Comment thread src/protocol/config.c Fixed
@gmartin82 gmartin82 force-pushed the ZEN-838 branch 4 times, most recently from f81f46b to 66d79f9 Compare April 30, 2026 10:34
@gmartin82 gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from da614bb to 75fe43a Compare April 30, 2026 12:40
@gmartin82 gmartin82 marked this pull request as ready for review April 30, 2026 13:40
Comment thread src/net/session.c Outdated
return _Z_ERR_CONFIG_LOCATOR_INVALID;
}

// This implementation uses a bitmask to track which connect locators remain retryable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if we ever receive a non-retriable error (i.e. tls certificate issue - I can not think of anything else, if locators are validated in advance when they are inserted in the config) we should just terminate with error and thus using mask seems to be unnecessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket is clearly scoped to improving the connection establishment not the config.

Please raise a separate issue if you want locators validated on insertion as this is not in scope.

Copy link
Copy Markdown
Contributor

@DenisBiryukov91 DenisBiryukov91 Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still this does not change the fact that we do not need to trace retriability of locators - if a error is not retriable we just abort everything - since the provided config is broken.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating locators at config insertion time is a separate behavioral change and is outside the scope of this PR. This PR is about improving connection establishment resilience for configured locators.

The retryability handling was added specifically to address the earlier review comment that the loop should not retry fatal open errors indefinitely. The mask is how we implement that per locator: retryable failures remain pending, non-retryable failures are removed from the pending set, and successful locators are removed as completed.

The set of retryable transport errors used here is aligned with the existing _z_client_reopen_task_fn() behavior. Whether a non-retryable locator failure should cause z_open() itself to fail is governed by the existing exit_on_failure options; when that option is false, one failed locator should not make every configured locator mandatory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit_on_failure should only be applied to retrievable failures (as it is the case zenoh-rust), it has nothing to do with non-retriable ones, which makes retry redundant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When exit_on_failure is set, Zenoh exits on any error. The implementation matches this behaviour.

Comment thread docs/config.rst

The default listen timeout is `0`.

`Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY` accepts `true` or `false`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark new config parameters as unstable (add a comment), since I'm not sure that they will be maintained in the near future in the context of the upcoming changes in P2P

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread src/net/session.c Outdated
}

// This implementation uses a bitmask to track which connect locators remain retryable.
if (connect_len > (sizeof(uint64_t) * 8u)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 8u means?
Please avoid any "magic numbers" in the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of magic numbers removed.

Comment thread src/net/session.c Outdated

// This implementation uses a bitmask to track which connect locators remain retryable.
if (connect_len > (sizeof(uint64_t) * 8u)) {
_Z_ERROR("Too many connect locators configured");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a limit on the number of locators? What is it? Where is it configured?

Copy link
Copy Markdown
Contributor Author

@gmartin82 gmartin82 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use a dynamicly sized svec of pending peers to avoid a limit.

Comment thread src/net/session.c Outdated
if (ret != _Z_RES_OK) {
break;
}
if (retry_mask != 0u) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, the logic with the retry mask looks strange. In what case should we stop attempting to connect? If such a case arises, it shouldn't be hidden, but disclosed to the user. If this can be determined during the locator check, then it should be done there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation has been changed to use a new data structure to track pending peers.

The validation comment is not relevant to this discussion and is not part of the scope of the ticket. Locators are currently validated by lower-level code which I haven't changed. If you wish validation to be addressed differently, a separate ticket should be raised for the issue.

@gmartin82 gmartin82 force-pushed the ZEN-838 branch 6 times, most recently from bf935b8 to b1280a9 Compare May 6, 2026 16:51
gmartin82 added 2 commits May 7, 2026 13:21
- document new open retry config options as unstable
- replace the fixed retry bitmask with per-locator pending peer state
- improve open retry and failure behaviour documentation
- gate open retry config behind Z_FEATURE_UNSTABLE_API
- cover unstable API builds in single-thread CI
- treat _Z_ERR_TRANSPORT_RX_DURATION_EXPIRED as retryable
- propagate interests/declarations to newly added peers
- dispatch connectivity events for dynamically added peers
@gmartin82 gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from 498130d to f1354dc Compare May 7, 2026 16:35
@gmartin82
Copy link
Copy Markdown
Contributor Author

Note ESP-IDF build failure is unrelated and fixed by this PR: #1218

@gmartin82 gmartin82 requested a review from sashacmc May 7, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Existing things could work better

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants