Skip to content

Make find_actor() delete stale sockaddr entries from registrar on OSError#366

Merged
goodboy merged 11 commits intomainfrom
dereg_on_oserror
Mar 25, 2026
Merged

Make find_actor() delete stale sockaddr entries from registrar on OSError#366
goodboy merged 11 commits intomainfrom
dereg_on_oserror

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Aug 28, 2023

All in the title 😎

@goodboy goodboy changed the title Make find_actor() delete stale sockaddr entries from registar actor on OSError Make find_actor() delete stale sockaddr entries from registrar on OSError Aug 28, 2023
@goodboy
Copy link
Copy Markdown
Owner Author

goodboy commented Aug 29, 2023

Lul gotta add the bidict dep?

@goodboy
Copy link
Copy Markdown
Owner Author

goodboy commented Jul 15, 2025

This needs to be bumped to integrate better with the new multi-ipc-proto stuff landed in #375 !1

@goodboy
Copy link
Copy Markdown
Owner Author

goodboy commented Sep 30, 2025

I think CI is running clean now?

Only thing left is to maybe also do the .delete_addr() call on ConnectionErrors with UDS as well?

@goodboy goodboy changed the base branch from asyncio_debugger_support to main September 30, 2025 05:26
Copilot AI review requested due to automatic review settings February 20, 2026 00:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the actor discovery system by implementing automatic cleanup of stale registry entries when connection failures occur. It changes the internal registry storage from a standard dict to a bidict (bidirectional dictionary) to enable efficient reverse lookups by address, and adds logic to delete stale entries when OSError is encountered during actor connection attempts.

Changes:

  • Switch registry storage from dict to bidict for bidirectional lookups between actor UIDs and addresses
  • Add delete_addr() method to the Arbiter class for removing stale registry entries by address
  • Modify maybe_open_portal() to catch OSError and delete stale registry entries when connection fails
  • Add comprehensive test to verify stale entry deletion behavior

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tractor/_runtime.py Changed registry from dict to bidict, added delete_addr() method for cleaning up stale entries by address
tractor/_discovery.py Modified query_actor() to return Portal along with address, updated maybe_open_portal() to delete stale entries on OSError, minor formatting improvements
tests/test_discovery.py Added test_stale_entry_is_deleted() to verify automatic cleanup, formatting improvements to existing tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +226 to +236
# registar actor.
uid: tuple[str, str] = await reg_portal.run_from_ns(
'self',
'delete_addr',
addr=addr,
)
log.warning(
f'Deleted stale registry entry !\n'
f'addr: {addr!r}\n'
f'uid: {uid!r}\n'
)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError when reg_portal is None. When query_actor() finds a peer directly (line 178-180), it yields (addr, None), meaning reg_portal will be None. If an OSError occurs when connecting, this code will attempt to call reg_portal.run_from_ns() on None, causing an AttributeError. This should check if reg_portal is not None before attempting to delete the stale entry.

Suggested change
# registar actor.
uid: tuple[str, str] = await reg_portal.run_from_ns(
'self',
'delete_addr',
addr=addr,
)
log.warning(
f'Deleted stale registry entry !\n'
f'addr: {addr!r}\n'
f'uid: {uid!r}\n'
)
# registar actor when a registry portal is available.
if reg_portal is not None:
uid: tuple[str, str] = await reg_portal.run_from_ns(
'self',
'delete_addr',
addr=addr,
)
log.warning(
f'Deleted stale registry entry !\n'
f'addr: {addr!r}\n'
f'uid: {uid!r}\n'
)
else:
log.warning(
'Failed to connect to peer at %r and no registry portal '
'is available to delete a potentially stale entry',
addr,
)

Copilot uses AI. Check for mistakes.
Comment on lines +2039 to +2040
) -> tuple[str, str]:
uid: tuple | None = self._registry.inverse.pop(
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect return type annotation. The method is declared to return tuple[str, str] but can return None when uid is None (line 2040-2043). The return type should be tuple[str, str] | None to accurately reflect this behavior.

Suggested change
) -> tuple[str, str]:
uid: tuple | None = self._registry.inverse.pop(
) -> tuple[str, str] | None:
uid: tuple[str, str] | None = self._registry.inverse.pop(

Copilot uses AI. Check for mistakes.
# contacted at that addr.
except OSError:
# NOTE: ensure we delete the stale entry from the
# registar actor.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in comment: 'registar' should be 'registrar'

Suggested change
# registar actor.
# registrar actor.

Copilot uses AI. Check for mistakes.
Comment on lines 169 to +171
name == 'registrar'
and actor.is_registrar
and
actor.is_registrar
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unusual line break placement. The 'and' operator is isolated on its own line (line 170), which is inconsistent with common Python style guides. Consider either keeping the condition on one line or breaking it more conventionally (e.g., after the 'and' on the previous line).

Copilot uses AI. Check for mistakes.
goodboy added a commit that referenced this pull request Mar 13, 2026
Removes the `pytest` deprecation warns and attempts to avoid
some de-registration raciness, though i'm starting to think the
real issue is due to not having the fixes from #366 (which handle
the new dereg on `OSError` case from UDS)?

- use `.channel.aid.uid` over deprecated `.channel.uid`
  throughout `test_discovery.py`.
- add polling loop (up to 5s) for subactor de-reg check
  in `spawn_and_check_registry()` to handle slower
  transports like UDS where teardown takes longer.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added a commit that referenced this pull request Mar 13, 2026
Removes the `pytest` deprecation warns and attempts to avoid
some de-registration raciness, though i'm starting to think the
real issue is due to not having the fixes from #366 (which handle
the new dereg on `OSError` case from UDS)?

- use `.channel.aid.uid` over deprecated `.channel.uid`
  throughout `test_discovery.py`.
- add polling loop (up to 5s) for subactor de-reg check
  in `spawn_and_check_registry()` to handle slower
  transports like UDS where teardown takes longer.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy force-pushed the dereg_on_oserror branch 2 times, most recently from 1a80c48 to 655b539 Compare March 13, 2026 20:58
goodboy added a commit that referenced this pull request Mar 13, 2026
Removes the `pytest` deprecation warns and attempts to avoid
some de-registration raciness, though i'm starting to think the
real issue is due to not having the fixes from #366 (which handle
the new dereg on `OSError` case from UDS)?

- use `.channel.aid.uid` over deprecated `.channel.uid`
  throughout `test_discovery.py`.
- add polling loop (up to 5s) for subactor de-reg check
  in `spawn_and_check_registry()` to handle slower
  transports like UDS where teardown takes longer.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added a commit that referenced this pull request Mar 14, 2026
Removes the `pytest` deprecation warns and attempts to avoid
some de-registration raciness, though i'm starting to think the
real issue is due to not having the fixes from #366 (which handle
the new dereg on `OSError` case from UDS)?

- use `.channel.aid.uid` over deprecated `.channel.uid`
  throughout `test_discovery.py`.
- add polling loop (up to 5s) for subactor de-reg check
  in `spawn_and_check_registry()` to handle slower
  transports like UDS where teardown takes longer.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 7 commits March 13, 2026 21:51
Since stale addrs can be leaked where the actor transport server task
crashes but doesn't (successfully) unregister from the registrar, we
need a remote way to remove such entries; hence this new (registrar)
method.

To implement this make use of the `bidict` lib for the `._registry`
table thus making it super simple to do reverse uuid lookups from an
input socket-address.
In cases where an actor's transport server task (by default handling new
TCP connections) terminates early but does not de-register from the
pertaining registry (aka the registrar) actor's address table, the
trying-to-connect client actor will get a connection error on that
address. In the case where client handles a (local) `OSError` (meaning
the target actor address is likely being contacted over `localhost`)
exception, make a further call to the registrar to delete the stale
entry and `yield None` gracefully indicating to calling code that no
`Portal` can be delivered to the target address.

This issue was originally discovered in `piker` where the `emsd`
(clearing engine) actor would sometimes crash on rapid client
re-connects and then leave a `pikerd` stale entry. With this fix new
clients will attempt connect via an endpoint which will re-spawn the
`emsd` when a `None` portal is delivered (via `maybe_spawn_em()`).
By spawning an actor task that immediately shuts down the transport
server and then sleeps, verify that attempting to connect via the
`._discovery.find_actor()` helper delivers `None` for the `Portal`
value.

Relates to #184 and #216
Fix potential `AttributeError` when `query_actor()` yields
a `None` portal (peer-found-locally path) and an `OSError`
is raised during transport connect.

Also,
- fix `Arbiter.delete_addr()` return type to
  `tuple[str, str]|None` bc it can return `None`.
- fix "registar" typo -> "registrar" in comment.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy
Copy link
Copy Markdown
Owner Author

goodboy commented Mar 17, 2026

I let claude address the copilot review in that last commit.

I'll do another re-review and then likely land dis puppy 🐶

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

tests/test_discovery.py:383

  • The nested async with (collapse_eg(), trio.open_nursery()...) block is indented deeper than the surrounding async with (... open_stream_from ...) block. As written, this will raise an IndentationError and prevent the test module from importing.
                    ):
                            async with (
                                collapse_eg(),
                                trio.open_nursery() as tn,
                            ):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


self._registry[uid] = addr
# XXX NOTE, value must also be hashable.
self._registry[uid] = tuple(addr)
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Good catch 👍 - switched to bidict.forceput() which silently overwrites any prior (stale) entry mapping a different uid to the same addr. This handles the unclean-shutdown / actor-restart-reusing-addr case without needing explicit exception handling.

📎 fixed in 85457cb8

# NOTE: ensure we delete the stale entry
# from the registrar actor when available.
if reg_portal is not None:
uid: tuple[str, str] = await reg_portal.run_from_ns(
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Valid - fixed the annotation to tuple[str, str]|None and now branch on the return value: log "Deleted stale registry entry" when a uid is returned, or "No registry entry found for addr" when None.

📎 fixed in 85457cb8

Comment on lines +492 to +496
tractor.open_nursery(
debug_mode=debug_mode,
) as an,
tractor.get_registry(reg_addr) as _reg_ptl,
):
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Good point - added explicit registry_addrs=[reg_addr] to both the open_nursery() and find_actor() calls so the test is guaranteed to use the remote registrar daemon started by the daemon fixture.

📎 fixed in 85457cb8

Comment on lines 155 to 158
) -> AsyncGenerator[
UnwrappedAddress|None,
tuple[UnwrappedAddress|None, Portal|None],
None,
]:
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Updated the docstring to document the new (addr, reg_portal) yield shape - explains when reg_portal is None (peer found locally via get_peer_by_name()) vs when it's a Portal to the registrar.

📎 fixed in 85457cb8

goodboy and others added 2 commits March 25, 2026 00:21
- Use `bidict.forceput()` in `register_actor()` to handle
  duplicate addr values from stale entries or actor restarts.
- Fix `uid` annotation to `tuple[str, str]|None` in
  `maybe_open_portal()` and handle the `None` return from
  `delete_addr()` in log output.
- Pass explicit `registry_addrs=[reg_addr]` to `open_nursery()`
  and `find_actor()` in `test_stale_entry_is_deleted` to ensure
  the test uses the remote registrar.
- Update `query_actor()` docstring to document the new
  `(addr, reg_portal)` yield shape.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
`msgpack` deserializes tuples as lists over IPC so
the `bidict.inverse.pop()` needs a `tuple`-cast to
match registry keys.

Regressed-by: 85457cb (`registry_addrs` change)
Found-via: `/run-tests` test_stale_entry_is_deleted

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2045 to +2050
addr: tuple[str, int|str],
) -> tuple[str, str]|None:
# NOTE: `addr` arrives as a `list` over IPC
# (msgpack deserializes tuples -> lists) so
# coerce to `tuple` for the bidict hash lookup.
uid: tuple | None = self._registry.inverse.pop(
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In delete_addr(), uid is annotated as tuple | None, but the function return type is tuple[str, str] | None and _registry.inverse is keyed to the UID tuple. Tighten the local variable annotation to tuple[str, str] | None (and consider broadening the addr parameter type if it can arrive as a list over IPC, as noted in the comment) to keep typing consistent.

Suggested change
addr: tuple[str, int|str],
) -> tuple[str, str]|None:
# NOTE: `addr` arrives as a `list` over IPC
# (msgpack deserializes tuples -> lists) so
# coerce to `tuple` for the bidict hash lookup.
uid: tuple | None = self._registry.inverse.pop(
addr: tuple[str, int | str] | list[str | int],
) -> tuple[str, str] | None:
# NOTE: `addr` arrives as a `list` over IPC
# (msgpack deserializes tuples -> lists) so
# coerce to `tuple` for the bidict hash lookup.
uid: tuple[str, str] | None = self._registry.inverse.pop(

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Good catch — tightened uid annotation to tuple[str, str]|None and broadened addr param type to tuple[str, int|str]|list[str|int] to document the msgpack list-over-IPC reality (runtime coercion via tuple(addr) was already in place from b557ec20).

📎 fixed in e71eec07

Comment on lines 155 to +168
) -> AsyncGenerator[
UnwrappedAddress|None,
tuple[UnwrappedAddress|None, Portal|None],
None,
]:
'''
Lookup a transport address (by actor name) via querying a registrar
listening @ `regaddr`.

Returns the transport protocol (socket) address or `None` if no
entry under that name exists.
Yields a `tuple` of `(addr, reg_portal)` where,
- `addr` is the transport protocol (socket) address or `None` if
no entry under that name exists,
- `reg_portal` is the `Portal` to the registrar used for the
lookup (or `None` when the peer was found locally via
`get_peer_by_name()`).
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_actor() can yield a LocalPortal from get_registry() when current_actor().is_registrar is true, but the yield type is annotated as Portal|None and the docstring refers to a Portal. Please widen the annotation (and docstring) to include LocalPortal so the type contract matches actual behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

@goodboy goodboy Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 response authored by claude-code

Valid — LocalPortal is a standalone class (not a Portal subclass), and get_registry() yields one when the current actor is the registrar querying for non-registrar names.

Widened the yield type annotation, reg_portal local annotation, and docstring to include LocalPortal.

📎 fixed in e71eec07

- Add `LocalPortal` union to `query_actor()` return
  type and `reg_portal` var annotation since the
  registrar yields a `LocalPortal` instance.
- Update docstring to note the `LocalPortal` case.
- Widen `.delete_addr()` `addr` param to accept
  `list[str|int]` bc msgpack deserializes tuples as
  lists over IPC.
- Tighten `uid` annotation to `tuple[str, str]|None`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy merged commit cc3bfac into main Mar 25, 2026
4 checks passed
@goodboy goodboy deleted the dereg_on_oserror branch March 25, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants