Improve connection establishment.#1187
Conversation
There was a problem hiding this comment.
Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
3de806f to
3626566
Compare
|
I believe peer connection is better to be done as asynchronous background task based on #1190. |
| * If listen locators are configured, one of them is used as the primary open. | ||
| * Otherwise one connect locator is opened with retry/backoff. | ||
| */ | ||
| while (ret != _Z_RES_OK) { |
There was a problem hiding this comment.
This retry loop currently treats every ret != _Z_RES_OK as non-fatal and retriable. But _z_open_inner() can also return non-retriable errors such as invalid locator/schema or protocol/handshake failures. I think we should not retry fatal errors.
There was a problem hiding this comment.
I believe locators should be validated in advance upon initialization in the config.
Everything else should probably be considered as retriable
There was a problem hiding this comment.
As requested, the implementation only retries locators that are deemed to be retryable. The set of error codes considered retryable in _z_client_reopen_task_fn was used as a basis for this.
Note: Zenoh retries all locators without considering returned errors.
|
|
||
| uint32_t sleep_ms = _Z_OPEN_BACKOFF_MIN_MS; | ||
|
|
||
| while (pending_mask != 0u) { |
There was a problem hiding this comment.
Same as for :293. _z_new_peer() failures are always retried until timeout. Should we distinguish fatal and non-fatal errors in this loop?
|
@DenisBiryukov91 can we close this PR, or should it be updated after the #1190 |
|
@sashacmc It needs to be rebased and updated to add background connection establishment. |
ddcb73c to
b13c620
Compare
| const _z_config_t *session_cfg); | ||
| void _z_free_transport(_z_transport_t **zt); | ||
|
|
||
| #if Z_FEATURE_UNICAST_PEER == 1 |
e23879f to
d108d4d
Compare
f81f46b to
66d79f9
Compare
da614bb to
75fe43a
Compare
| return _Z_ERR_CONFIG_LOCATOR_INVALID; | ||
| } | ||
|
|
||
| // This implementation uses a bitmask to track which connect locators remain retryable. |
There was a problem hiding this comment.
I believe if we ever receive a non-retriable error (i.e. tls certificate issue - I can not think of anything else, if locators are validated in advance when they are inserted in the config) we should just terminate with error and thus using mask seems to be unnecessary
There was a problem hiding this comment.
The ticket is clearly scoped to improving the connection establishment not the config.
Please raise a separate issue if you want locators validated on insertion as this is not in scope.
There was a problem hiding this comment.
Still this does not change the fact that we do not need to trace retriability of locators - if a error is not retriable we just abort everything - since the provided config is broken.
There was a problem hiding this comment.
Validating locators at config insertion time is a separate behavioral change and is outside the scope of this PR. This PR is about improving connection establishment resilience for configured locators.
The retryability handling was added specifically to address the earlier review comment that the loop should not retry fatal open errors indefinitely. The mask is how we implement that per locator: retryable failures remain pending, non-retryable failures are removed from the pending set, and successful locators are removed as completed.
The set of retryable transport errors used here is aligned with the existing _z_client_reopen_task_fn() behavior. Whether a non-retryable locator failure should cause z_open() itself to fail is governed by the existing exit_on_failure options; when that option is false, one failed locator should not make every configured locator mandatory.
There was a problem hiding this comment.
exit_on_failure should only be applied to retrievable failures (as it is the case zenoh-rust), it has nothing to do with non-retriable ones, which makes retry redundant.
There was a problem hiding this comment.
When exit_on_failure is set, Zenoh exits on any error. The implementation matches this behaviour.
|
|
||
| The default listen timeout is `0`. | ||
|
|
||
| `Z_CONFIG_LISTEN_EXIT_ON_FAILURE_KEY` accepts `true` or `false`. |
There was a problem hiding this comment.
Please mark new config parameters as unstable (add a comment), since I'm not sure that they will be maintained in the near future in the context of the upcoming changes in P2P
| } | ||
|
|
||
| // This implementation uses a bitmask to track which connect locators remain retryable. | ||
| if (connect_len > (sizeof(uint64_t) * 8u)) { |
There was a problem hiding this comment.
What does 8u means?
Please avoid any "magic numbers" in the code.
There was a problem hiding this comment.
Use of magic numbers removed.
|
|
||
| // This implementation uses a bitmask to track which connect locators remain retryable. | ||
| if (connect_len > (sizeof(uint64_t) * 8u)) { | ||
| _Z_ERROR("Too many connect locators configured"); |
There was a problem hiding this comment.
Do we have a limit on the number of locators? What is it? Where is it configured?
There was a problem hiding this comment.
Changed to use a dynamicly sized svec of pending peers to avoid a limit.
| if (ret != _Z_RES_OK) { | ||
| break; | ||
| } | ||
| if (retry_mask != 0u) { |
There was a problem hiding this comment.
Frankly, the logic with the retry mask looks strange. In what case should we stop attempting to connect? If such a case arises, it shouldn't be hidden, but disclosed to the user. If this can be determined during the locator check, then it should be done there.
There was a problem hiding this comment.
The implementation has been changed to use a new data structure to track pending peers.
The validation comment is not relevant to this discussion and is not part of the scope of the ticket. Locators are currently validated by lower-level code which I haven't changed. If you wish validation to be addressed differently, a separate ticket should be raised for the issue.
bf935b8 to
b1280a9
Compare
- document new open retry config options as unstable - replace the fixed retry bitmask with per-locator pending peer state - improve open retry and failure behaviour documentation - gate open retry config behind Z_FEATURE_UNSTABLE_API - cover unstable API builds in single-thread CI
- treat _Z_ERR_TRANSPORT_RX_DURATION_EXPIRED as retryable - propagate interests/declarations to newly added peers - dispatch connectivity events for dynamically added peers
498130d to
f1354dc
Compare
|
Note ESP-IDF build failure is unrelated and fixed by this PR: #1218 |
Description
This PR updates
z_open()connection-establishment behaviour so configured listen/connect locators can be retried instead of being attempted only once.It adds independent retry timeout and failure-policy configuration for listen and connect locators, and clarifies peer-mode behaviour when a session can be opened with only partial peer connectivity.
What does this PR do?
Adds unstable connect configuration options:
Z_CONFIG_CONNECT_TIMEOUT_KEYZ_CONFIG_CONNECT_EXIT_ON_FAILURE_KEYAdds unstable listen configuration options:
Z_CONFIG_LISTEN_TIMEOUT_KEYZ_CONFIG_LISTEN_EXIT_ON_FAILURE_KEYThe new connect/listen timeout and exit-on-failure configuration keys are unstable and are only exposed when
Z_FEATURE_UNSTABLE_APIis enabled. The corresponding default string constants remain available so the implementation can keep default behaviour consistent when the unstable keys are not enabled.Timeout values support:
0: no retry; attempt once.>0: retry retryable failures until the timeout expires.-1: retry indefinitely.Default behaviour:
0.0.true.false.true.Client mode:
z_open()succeeds when one connect locator succeeds.z_open()fails if no connect locator can establish a transport.connect_exit_on_failurehas no effect in client mode, as a session cannot be opened without a transport.Peer mode:
z_open()requires a primary transport before returning successfully.listen_exit_on_failure=false,z_open()may still succeed by connecting to a configured connect locator.z_open()fails.connect_exit_on_failure=true,z_open()fails if all required peer connections cannot be established within the configured timeout.connect_exit_on_failure=false,z_open()may return successfully with partial connectivity and leave remaining retryable peer connections to be attempted:Retry semantics:
Adds
_Z_ERR_TRANSPORT_OPEN_PARTIAL_CONNECTIVITY, returned when:exit_on_failure=true) policy.Adds tests covering:
Z_FEATURE_UNSTABLE_API)Z_FEATURE_UNSTABLE_API)Z_FEATURE_UNSTABLE_API)Z_FEATURE_UNSTABLE_API)Why is this change needed?
Previously, configured listen/connect locators were attempted once during
z_open(). This made session establishment sensitive to startup ordering: if a peer was not yet listening, or a local listen endpoint was temporarily unavailable,z_open()could fail immediately.This change makes connection establishment more robust by allowing applications to choose between:
Related Issues
ZEN-838
🏷️ Label-Based Checklist
Based on the labels applied to this PR, please complete these additional requirements:
Labels:
enhancement✨ Enhancement Requirements
Since this PR enhances existing functionality:
Remember: Enhancements should not introduce new APIs or breaking changes.
Instructions:
- [ ]to- [x])This checklist updates automatically when labels change, but preserves your checked boxes.