Make find_actor() delete stale sockaddr entries from registrar on OSError#366
Make find_actor() delete stale sockaddr entries from registrar on OSError#366
find_actor() delete stale sockaddr entries from registrar on OSError#366Conversation
find_actor() delete stale sockaddr entries from registar actor on OSErrorfind_actor() delete stale sockaddr entries from registrar on OSError
|
Lul gotta add the |
|
This needs to be bumped to integrate better with the new multi-ipc-proto stuff landed in #375 !1 |
|
I think CI is running clean now? Only thing left is to maybe also do the |
3a31c9d to
c9fda3f
Compare
c9fda3f to
eb8285a
Compare
There was a problem hiding this comment.
Pull request overview
This PR enhances the actor discovery system by implementing automatic cleanup of stale registry entries when connection failures occur. It changes the internal registry storage from a standard dict to a bidict (bidirectional dictionary) to enable efficient reverse lookups by address, and adds logic to delete stale entries when OSError is encountered during actor connection attempts.
Changes:
- Switch registry storage from
dicttobidictfor bidirectional lookups between actor UIDs and addresses - Add
delete_addr()method to the Arbiter class for removing stale registry entries by address - Modify
maybe_open_portal()to catchOSErrorand delete stale registry entries when connection fails - Add comprehensive test to verify stale entry deletion behavior
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tractor/_runtime.py | Changed registry from dict to bidict, added delete_addr() method for cleaning up stale entries by address |
| tractor/_discovery.py | Modified query_actor() to return Portal along with address, updated maybe_open_portal() to delete stale entries on OSError, minor formatting improvements |
| tests/test_discovery.py | Added test_stale_entry_is_deleted() to verify automatic cleanup, formatting improvements to existing tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tractor/_discovery.py
Outdated
| # registar actor. | ||
| uid: tuple[str, str] = await reg_portal.run_from_ns( | ||
| 'self', | ||
| 'delete_addr', | ||
| addr=addr, | ||
| ) | ||
| log.warning( | ||
| f'Deleted stale registry entry !\n' | ||
| f'addr: {addr!r}\n' | ||
| f'uid: {uid!r}\n' | ||
| ) |
There was a problem hiding this comment.
Potential AttributeError when reg_portal is None. When query_actor() finds a peer directly (line 178-180), it yields (addr, None), meaning reg_portal will be None. If an OSError occurs when connecting, this code will attempt to call reg_portal.run_from_ns() on None, causing an AttributeError. This should check if reg_portal is not None before attempting to delete the stale entry.
| # registar actor. | |
| uid: tuple[str, str] = await reg_portal.run_from_ns( | |
| 'self', | |
| 'delete_addr', | |
| addr=addr, | |
| ) | |
| log.warning( | |
| f'Deleted stale registry entry !\n' | |
| f'addr: {addr!r}\n' | |
| f'uid: {uid!r}\n' | |
| ) | |
| # registar actor when a registry portal is available. | |
| if reg_portal is not None: | |
| uid: tuple[str, str] = await reg_portal.run_from_ns( | |
| 'self', | |
| 'delete_addr', | |
| addr=addr, | |
| ) | |
| log.warning( | |
| f'Deleted stale registry entry !\n' | |
| f'addr: {addr!r}\n' | |
| f'uid: {uid!r}\n' | |
| ) | |
| else: | |
| log.warning( | |
| 'Failed to connect to peer at %r and no registry portal ' | |
| 'is available to delete a potentially stale entry', | |
| addr, | |
| ) |
tractor/_runtime.py
Outdated
| ) -> tuple[str, str]: | ||
| uid: tuple | None = self._registry.inverse.pop( |
There was a problem hiding this comment.
Incorrect return type annotation. The method is declared to return tuple[str, str] but can return None when uid is None (line 2040-2043). The return type should be tuple[str, str] | None to accurately reflect this behavior.
| ) -> tuple[str, str]: | |
| uid: tuple | None = self._registry.inverse.pop( | |
| ) -> tuple[str, str] | None: | |
| uid: tuple[str, str] | None = self._registry.inverse.pop( |
tractor/_discovery.py
Outdated
| # contacted at that addr. | ||
| except OSError: | ||
| # NOTE: ensure we delete the stale entry from the | ||
| # registar actor. |
There was a problem hiding this comment.
Spelling error in comment: 'registar' should be 'registrar'
| # registar actor. | |
| # registrar actor. |
| name == 'registrar' | ||
| and actor.is_registrar | ||
| and | ||
| actor.is_registrar |
There was a problem hiding this comment.
Unusual line break placement. The 'and' operator is isolated on its own line (line 170), which is inconsistent with common Python style guides. Consider either keeping the condition on one line or breaking it more conventionally (e.g., after the 'and' on the previous line).
eb8285a to
26b54c6
Compare
Removes the `pytest` deprecation warns and attempts to avoid some de-registration raciness, though i'm starting to think the real issue is due to not having the fixes from #366 (which handle the new dereg on `OSError` case from UDS)? - use `.channel.aid.uid` over deprecated `.channel.uid` throughout `test_discovery.py`. - add polling loop (up to 5s) for subactor de-reg check in `spawn_and_check_registry()` to handle slower transports like UDS where teardown takes longer. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
26b54c6 to
ec3555a
Compare
Removes the `pytest` deprecation warns and attempts to avoid some de-registration raciness, though i'm starting to think the real issue is due to not having the fixes from #366 (which handle the new dereg on `OSError` case from UDS)? - use `.channel.aid.uid` over deprecated `.channel.uid` throughout `test_discovery.py`. - add polling loop (up to 5s) for subactor de-reg check in `spawn_and_check_registry()` to handle slower transports like UDS where teardown takes longer. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
1a80c48 to
655b539
Compare
Removes the `pytest` deprecation warns and attempts to avoid some de-registration raciness, though i'm starting to think the real issue is due to not having the fixes from #366 (which handle the new dereg on `OSError` case from UDS)? - use `.channel.aid.uid` over deprecated `.channel.uid` throughout `test_discovery.py`. - add polling loop (up to 5s) for subactor de-reg check in `spawn_and_check_registry()` to handle slower transports like UDS where teardown takes longer. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
655b539 to
dffe9f0
Compare
Removes the `pytest` deprecation warns and attempts to avoid some de-registration raciness, though i'm starting to think the real issue is due to not having the fixes from #366 (which handle the new dereg on `OSError` case from UDS)? - use `.channel.aid.uid` over deprecated `.channel.uid` throughout `test_discovery.py`. - add polling loop (up to 5s) for subactor de-reg check in `spawn_and_check_registry()` to handle slower transports like UDS where teardown takes longer. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
dffe9f0 to
6491df7
Compare
Since stale addrs can be leaked where the actor transport server task crashes but doesn't (successfully) unregister from the registrar, we need a remote way to remove such entries; hence this new (registrar) method. To implement this make use of the `bidict` lib for the `._registry` table thus making it super simple to do reverse uuid lookups from an input socket-address.
In cases where an actor's transport server task (by default handling new TCP connections) terminates early but does not de-register from the pertaining registry (aka the registrar) actor's address table, the trying-to-connect client actor will get a connection error on that address. In the case where client handles a (local) `OSError` (meaning the target actor address is likely being contacted over `localhost`) exception, make a further call to the registrar to delete the stale entry and `yield None` gracefully indicating to calling code that no `Portal` can be delivered to the target address. This issue was originally discovered in `piker` where the `emsd` (clearing engine) actor would sometimes crash on rapid client re-connects and then leave a `pikerd` stale entry. With this fix new clients will attempt connect via an endpoint which will re-spawn the `emsd` when a `None` portal is delivered (via `maybe_spawn_em()`).
6491df7 to
d929fb7
Compare
Fix potential `AttributeError` when `query_actor()` yields a `None` portal (peer-found-locally path) and an `OSError` is raised during transport connect. Also, - fix `Arbiter.delete_addr()` return type to `tuple[str, str]|None` bc it can return `None`. - fix "registar" typo -> "registrar" in comment. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
|
I let I'll do another re-review and then likely land dis puppy 🐶 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
tests/test_discovery.py:383
- The nested
async with (collapse_eg(), trio.open_nursery()...)block is indented deeper than the surroundingasync with (... open_stream_from ...)block. As written, this will raise anIndentationErrorand prevent the test module from importing.
):
async with (
collapse_eg(),
trio.open_nursery() as tn,
):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tractor/_runtime.py
Outdated
|
|
||
| self._registry[uid] = addr | ||
| # XXX NOTE, value must also be hashable. | ||
| self._registry[uid] = tuple(addr) |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Good catch 👍 - switched to bidict.forceput() which silently overwrites any prior (stale) entry mapping a different uid to the same addr. This handles the unclean-shutdown / actor-restart-reusing-addr case without needing explicit exception handling.
📎 fixed in
85457cb8
tractor/_discovery.py
Outdated
| # NOTE: ensure we delete the stale entry | ||
| # from the registrar actor when available. | ||
| if reg_portal is not None: | ||
| uid: tuple[str, str] = await reg_portal.run_from_ns( |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Valid - fixed the annotation to tuple[str, str]|None and now branch on the return value: log "Deleted stale registry entry" when a uid is returned, or "No registry entry found for addr" when None.
📎 fixed in
85457cb8
| tractor.open_nursery( | ||
| debug_mode=debug_mode, | ||
| ) as an, | ||
| tractor.get_registry(reg_addr) as _reg_ptl, | ||
| ): |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Good point - added explicit registry_addrs=[reg_addr] to both the open_nursery() and find_actor() calls so the test is guaranteed to use the remote registrar daemon started by the daemon fixture.
📎 fixed in
85457cb8
| ) -> AsyncGenerator[ | ||
| UnwrappedAddress|None, | ||
| tuple[UnwrappedAddress|None, Portal|None], | ||
| None, | ||
| ]: |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Updated the docstring to document the new (addr, reg_portal) yield shape - explains when reg_portal is None (peer found locally via get_peer_by_name()) vs when it's a Portal to the registrar.
📎 fixed in
85457cb8
- Use `bidict.forceput()` in `register_actor()` to handle duplicate addr values from stale entries or actor restarts. - Fix `uid` annotation to `tuple[str, str]|None` in `maybe_open_portal()` and handle the `None` return from `delete_addr()` in log output. - Pass explicit `registry_addrs=[reg_addr]` to `open_nursery()` and `find_actor()` in `test_stale_entry_is_deleted` to ensure the test uses the remote registrar. - Update `query_actor()` docstring to document the new `(addr, reg_portal)` yield shape. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`msgpack` deserializes tuples as lists over IPC so the `bidict.inverse.pop()` needs a `tuple`-cast to match registry keys. Regressed-by: 85457cb (`registry_addrs` change) Found-via: `/run-tests` test_stale_entry_is_deleted (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tractor/_runtime.py
Outdated
| addr: tuple[str, int|str], | ||
| ) -> tuple[str, str]|None: | ||
| # NOTE: `addr` arrives as a `list` over IPC | ||
| # (msgpack deserializes tuples -> lists) so | ||
| # coerce to `tuple` for the bidict hash lookup. | ||
| uid: tuple | None = self._registry.inverse.pop( |
There was a problem hiding this comment.
In delete_addr(), uid is annotated as tuple | None, but the function return type is tuple[str, str] | None and _registry.inverse is keyed to the UID tuple. Tighten the local variable annotation to tuple[str, str] | None (and consider broadening the addr parameter type if it can arrive as a list over IPC, as noted in the comment) to keep typing consistent.
| addr: tuple[str, int|str], | |
| ) -> tuple[str, str]|None: | |
| # NOTE: `addr` arrives as a `list` over IPC | |
| # (msgpack deserializes tuples -> lists) so | |
| # coerce to `tuple` for the bidict hash lookup. | |
| uid: tuple | None = self._registry.inverse.pop( | |
| addr: tuple[str, int | str] | list[str | int], | |
| ) -> tuple[str, str] | None: | |
| # NOTE: `addr` arrives as a `list` over IPC | |
| # (msgpack deserializes tuples -> lists) so | |
| # coerce to `tuple` for the bidict hash lookup. | |
| uid: tuple[str, str] | None = self._registry.inverse.pop( |
There was a problem hiding this comment.
🤖 response authored by
claude-code
Good catch — tightened uid annotation to tuple[str, str]|None and broadened addr param type to tuple[str, int|str]|list[str|int] to document the msgpack list-over-IPC reality (runtime coercion via tuple(addr) was already in place from b557ec20).
📎 fixed in
e71eec07
| ) -> AsyncGenerator[ | ||
| UnwrappedAddress|None, | ||
| tuple[UnwrappedAddress|None, Portal|None], | ||
| None, | ||
| ]: | ||
| ''' | ||
| Lookup a transport address (by actor name) via querying a registrar | ||
| listening @ `regaddr`. | ||
|
|
||
| Returns the transport protocol (socket) address or `None` if no | ||
| entry under that name exists. | ||
| Yields a `tuple` of `(addr, reg_portal)` where, | ||
| - `addr` is the transport protocol (socket) address or `None` if | ||
| no entry under that name exists, | ||
| - `reg_portal` is the `Portal` to the registrar used for the | ||
| lookup (or `None` when the peer was found locally via | ||
| `get_peer_by_name()`). |
There was a problem hiding this comment.
query_actor() can yield a LocalPortal from get_registry() when current_actor().is_registrar is true, but the yield type is annotated as Portal|None and the docstring refers to a Portal. Please widen the annotation (and docstring) to include LocalPortal so the type contract matches actual behavior.
There was a problem hiding this comment.
🤖 response authored by
claude-code
Valid — LocalPortal is a standalone class (not a Portal subclass), and get_registry() yields one when the current actor is the registrar querying for non-registrar names.
Widened the yield type annotation, reg_portal local annotation, and docstring to include LocalPortal.
📎 fixed in
e71eec07
- Add `LocalPortal` union to `query_actor()` return type and `reg_portal` var annotation since the registrar yields a `LocalPortal` instance. - Update docstring to note the `LocalPortal` case. - Widen `.delete_addr()` `addr` param to accept `list[str|int]` bc msgpack deserializes tuples as lists over IPC. - Tighten `uid` annotation to `tuple[str, str]|None`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
All in the title 😎