Improve connection establishment. by gmartin82 · Pull Request #1187 · eclipse-zenoh/zenoh-pico

gmartin82 · 2026-03-16T18:20:33Z

Description

This PR updates z_open() connection-establishment behaviour so configured listen/connect locators can be retried instead of being attempted only once.

It adds independent retry timeout and failure-policy configuration for listen and connect locators, and clarifies peer-mode behaviour when a session can be opened with only partial peer connectivity.

What does this PR do?

Adds unstable connect configuration options:
- Z_CONFIG_CONNECT_TIMEOUT_KEY
- Z_CONFIG_CONNECT_EXIT_ON_FAILURE_KEY
Adds unstable listen configuration options:
- Z_CONFIG_LISTEN_TIMEOUT_KEY
- Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY
The new connect/listen timeout and exit-on-failure configuration keys are unstable and are only exposed when Z_FEATURE_UNSTABLE_API is enabled. The corresponding default string constants remain available so the implementation can keep default behaviour consistent when the unstable keys are not enabled.
Timeout values support:
- 0: no retry; attempt once.
- >0: retry retryable failures until the timeout expires.
- -1: retry indefinitely.
Default behaviour:
- Connect timeout defaults to 0.
- Listen timeout defaults to 0.
- Client connect exit-on-failure defaults to true.
- Peer connect exit-on-failure defaults to false.
- Listen exit-on-failure defaults to true.
Client mode:
- Only connect locators are used.
- Configured connect locators are treated as alternatives.
- z_open() succeeds when one connect locator succeeds.
- z_open() fails if no connect locator can establish a transport.
- connect_exit_on_failure has no effect in client mode, as a session cannot be opened without a transport.
Peer mode:
- z_open() requires a primary transport before returning successfully.
- The primary transport can be established either by:
  - opening the configured listen locator, or
  - connecting to one configured connect locator.
- If a listen locator is configured, it is attempted first.
- If listening fails and listen_exit_on_failure=false, z_open() may still succeed by connecting to a configured connect locator.
- If no listen or connect locator establishes a primary transport, z_open() fails.
- Once the primary transport is established, remaining connect locators are added as additional peers.
- If connect_exit_on_failure=true, z_open() fails if all required peer connections cannot be established within the configured timeout.
- If connect_exit_on_failure=false, z_open() may return successfully with partial connectivity and leave remaining retryable peer connections to be attempted:
  - synchronously (single-thread mode, when session tasks are progressed), or
  - asynchronously (multi-thread mode, via background transport tasks).
Retry semantics:
- Only retryable transport errors are retried.
- Non-retryable errors permanently exclude a locator from further attempts.
- Retryable failures use exponential backoff until timeout expires.
Adds _Z_ERR_TRANSPORT_OPEN_PARTIAL_CONNECTIVITY, returned when:
- a primary transport has been established, but
- required peer connectivity could not be completed under a strict (exit_on_failure=true) policy.
Adds tests covering:
- default open behaviour without unstable config keys
- connect timeout when no endpoint is available (Z_FEATURE_UNSTABLE_API)
- partial peer connectivity (Z_FEATURE_UNSTABLE_API)
- late-joining endpoints becoming available within the timeout (Z_FEATURE_UNSTABLE_API)
- behaviour differences between strict and best-effort failure policies (Z_FEATURE_UNSTABLE_API)

Why is this change needed?

Previously, configured listen/connect locators were attempted once during z_open(). This made session establishment sensitive to startup ordering: if a peer was not yet listening, or a local listen endpoint was temporarily unavailable, z_open() could fail immediately.

This change makes connection establishment more robust by allowing applications to choose between:

existing no-retry behaviour,
bounded retry,
infinite retry,
strict failure behaviour, and
best-effort peer connectivity.

Related Issues

ZEN-838

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

Enhancement scope documented - Clear description of what is being improved
Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
Backwards compatible - Existing code/APIs still work unchanged
No new APIs added - Only improving existing functionality
Tests updated - Existing tests pass, new test cases added if needed
Performance improvement measured - If applicable, before/after metrics provided
Documentation updated - Existing docs updated to reflect improvements
User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

Check off items as you complete them (change - [ ] to - [x])
The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

github-advanced-security

Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

DenisBiryukov91 · 2026-03-23T10:06:49Z

I believe peer connection is better to be done as asynchronous background task based on #1190.

steils · 2026-04-01T11:25:32Z

+     * If listen locators are configured, one of them is used as the primary open.
+     * Otherwise one connect locator is opened with retry/backoff.
+     */
+    while (ret != _Z_RES_OK) {


This retry loop currently treats every ret != _Z_RES_OK as non-fatal and retriable. But _z_open_inner() can also return non-retriable errors such as invalid locator/schema or protocol/handshake failures. I think we should not retry fatal errors.

I believe locators should be validated in advance upon initialization in the config.
Everything else should probably be considered as retriable

As requested, the implementation only retries locators that are deemed to be retryable. The set of error codes considered retryable in _z_client_reopen_task_fn was used as a basis for this.

Note: Zenoh retries all locators without considering returned errors.

steils · 2026-04-01T11:27:23Z

+
+    uint32_t sleep_ms = _Z_OPEN_BACKOFF_MIN_MS;
+
+    while (pending_mask != 0u) {


Same as for :293. _z_new_peer() failures are always retried until timeout. Should we distinguish fatal and non-fatal errors in this loop?

sashacmc · 2026-04-03T18:23:17Z

@DenisBiryukov91 can we close this PR, or should it be updated after the #1190

gmartin82 · 2026-04-08T09:26:45Z

@sashacmc It needs to be rebased and updated to add background connection establishment.

                       const _z_config_t *session_cfg);
 void _z_free_transport(_z_transport_t **zt);

+#if Z_FEATURE_UNICAST_PEER == 1


DenisBiryukov91 · 2026-04-30T18:22:41Z

+        return _Z_ERR_CONFIG_LOCATOR_INVALID;
+    }
+
+    // This implementation uses a bitmask to track which connect locators remain retryable.


I believe if we ever receive a non-retriable error (i.e. tls certificate issue - I can not think of anything else, if locators are validated in advance when they are inserted in the config) we should just terminate with error and thus using mask seems to be unnecessary

The ticket is clearly scoped to improving the connection establishment not the config.

Please raise a separate issue if you want locators validated on insertion as this is not in scope.

Still this does not change the fact that we do not need to trace retriability of locators - if a error is not retriable we just abort everything - since the provided config is broken.

Validating locators at config insertion time is a separate behavioral change and is outside the scope of this PR. This PR is about improving connection establishment resilience for configured locators.

The retryability handling was added specifically to address the earlier review comment that the loop should not retry fatal open errors indefinitely. The mask is how we implement that per locator: retryable failures remain pending, non-retryable failures are removed from the pending set, and successful locators are removed as completed.

The set of retryable transport errors used here is aligned with the existing _z_client_reopen_task_fn() behavior. Whether a non-retryable locator failure should cause z_open() itself to fail is governed by the existing exit_on_failure options; when that option is false, one failed locator should not make every configured locator mandatory.

exit_on_failure should only be applied to retrievable failures (as it is the case zenoh-rust), it has nothing to do with non-retriable ones, which makes retry redundant.

When exit_on_failure is set, Zenoh exits on any error. The implementation matches this behaviour.

sashacmc · 2026-05-06T13:46:33Z

+
+The default listen timeout is `0`.
+
+`Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY` accepts `true` or `false`.


Please mark new config parameters as unstable (add a comment), since I'm not sure that they will be maintained in the near future in the context of the upcoming changes in P2P

sashacmc · 2026-05-06T13:50:16Z

+    }
+
+    // This implementation uses a bitmask to track which connect locators remain retryable.
+    if (connect_len > (sizeof(uint64_t) * 8u)) {


What does 8u means?
Please avoid any "magic numbers" in the code.

Use of magic numbers removed.

sashacmc · 2026-05-06T13:51:07Z

+
+    // This implementation uses a bitmask to track which connect locators remain retryable.
+    if (connect_len > (sizeof(uint64_t) * 8u)) {
+        _Z_ERROR("Too many connect locators configured");


Do we have a limit on the number of locators? What is it? Where is it configured?

Changed to use a dynamicly sized svec of pending peers to avoid a limit.

sashacmc · 2026-05-06T13:54:40Z

-                if (ret != _Z_RES_OK) {
-                    break;
-                }
+    if (retry_mask != 0u) {


Frankly, the logic with the retry mask looks strange. In what case should we stop attempting to connect? If such a case arises, it shouldn't be hidden, but disclosed to the user. If this can be determined during the locator check, then it should be done there.

The implementation has been changed to use a new data structure to track pending peers.

The validation comment is not relevant to this discussion and is not part of the scope of the ticket. Locators are currently validated by lower-level code which I haven't changed. If you wish validation to be addressed differently, a separate ticket should be raised for the issue.

- document new open retry config options as unstable - replace the fixed retry bitmask with per-locator pending peer state - improve open retry and failure behaviour documentation - gate open retry config behind Z_FEATURE_UNSTABLE_API - cover unstable API builds in single-thread CI

- treat _Z_ERR_TRANSPORT_RX_DURATION_EXPIRED as retryable - propagate interests/declarations to newly added peers - dispatch connectivity events for dynamically added peers

gmartin82 · 2026-05-07T16:41:14Z

Note ESP-IDF build failure is unrelated and fixed by this PR: #1218

gmartin82 added the enhancement Existing things could work better label Mar 16, 2026

github-advanced-security AI found potential problems Mar 16, 2026

View reviewed changes

gmartin82 force-pushed the ZEN-838 branch 5 times, most recently from 3de806f to 3626566 Compare March 18, 2026 18:24

gmartin82 marked this pull request as ready for review March 18, 2026 18:27

gmartin82 requested review from DenisBiryukov91 and steils March 18, 2026 18:29

DenisBiryukov91 reviewed Mar 23, 2026

View reviewed changes

Comment thread include/zenoh-pico/api/types.h Outdated

steils suggested changes Apr 1, 2026

View reviewed changes

gmartin82 force-pushed the ZEN-838 branch 3 times, most recently from ddcb73c to b13c620 Compare April 13, 2026 16:24

gmartin82 marked this pull request as draft April 17, 2026 18:12

github-advanced-security AI found potential problems Apr 24, 2026

View reviewed changes

gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from e23879f to d108d4d Compare April 29, 2026 18:02

github-advanced-security AI found potential problems Apr 29, 2026

View reviewed changes

Comment thread src/protocol/config.c Fixed

Comment thread src/protocol/config.c Fixed

gmartin82 force-pushed the ZEN-838 branch 4 times, most recently from f81f46b to 66d79f9 Compare April 30, 2026 10:34

gmartin82 added 3 commits April 30, 2026 12:18

Improve peer open connection establishment

c4a97a5

Document z_open connection retry configuration

d4324b3

Add z_open peer establishment tests

f184d87

Ensure transport reopen task takes account of new error.

1032c54

gmartin82 force-pushed the ZEN-838 branch from 66d79f9 to 1032c54 Compare April 30, 2026 11:22

Disbale new timeout when testing max peer connections.

a35fac3

gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from da614bb to 75fe43a Compare April 30, 2026 12:40

Change connect timeout default to 0 in both peer and client modes.

2cfa666

gmartin82 force-pushed the ZEN-838 branch from 75fe43a to 2cfa666 Compare April 30, 2026 13:11

gmartin82 marked this pull request as ready for review April 30, 2026 13:40

DenisBiryukov91 reviewed Apr 30, 2026

View reviewed changes

sashacmc requested changes May 6, 2026

View reviewed changes

gmartin82 force-pushed the ZEN-838 branch 6 times, most recently from bf935b8 to b1280a9 Compare May 6, 2026 16:51

gmartin82 added 2 commits May 7, 2026 13:21

Retry expired RX durations and sync new peer state

f1354dc

- treat _Z_ERR_TRANSPORT_RX_DURATION_EXPIRED as retryable - propagate interests/declarations to newly added peers - dispatch connectivity events for dynamically added peers

gmartin82 force-pushed the ZEN-838 branch 2 times, most recently from 498130d to f1354dc Compare May 7, 2026 16:35

gmartin82 requested a review from sashacmc May 7, 2026 17:03


		uint32_t sleep_ms = _Z_OPEN_BACKOFF_MIN_MS;

		while (pending_mask != 0u) {


		The default listen timeout is `0`.

		`Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY` accepts `true` or `false`.

Conversation

gmartin82 commented Mar 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What does this PR do?

Why is this change needed?

Related Issues

🏷️ Label-Based Checklist

✨ Enhancement Requirements

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

DenisBiryukov91 commented Mar 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steils Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sashacmc commented Apr 3, 2026

Uh oh!

gmartin82 commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DenisBiryukov91 Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmartin82 May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmartin82 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gmartin82 commented Mar 16, 2026 •

edited by github-actions Bot

Loading

steils Apr 1, 2026 •

edited

Loading

DenisBiryukov91 Apr 30, 2026 •

edited

Loading

gmartin82 May 7, 2026 •

edited

Loading