Skip to content

Coping with "finally footguns": avoiding trio.Cancelled exc masking as best we can..#387

Merged
goodboy merged 9 commits intomainfrom
the_finally_footgun
Jul 18, 2025
Merged

Coping with "finally footguns": avoiding trio.Cancelled exc masking as best we can..#387
goodboy merged 9 commits intomainfrom
the_finally_footgun

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Jul 15, 2025

As critiqued heavily (by me) over the past few months (after
discovering this issue during work on another dependent project) in
the trio chat room, i was able to hack together a near term soln
which at least resolves the bulk of critical potential issues (the
SC-cross-process semantics breaking kind) within the tractor
runtime - namely ensuring error propagation is maintained with
a (remote) child-task started with Portal.open_context() .


Summary of underlying "issue" in trio

It's actually been documented and discussed for much longer then
I had thought and is most simply detailed in this comment
which contains the following snippet.

  async def main():
     with trio.CancelScope() as scope:
         try:
             raise RuntimeError("Oooops")
         finally:
             scope.cancel()
             await trio.sleep(0)

Generally speaking python-trio/trio#455
seems to be the primary spot for following all this.

TLDR,

if you exec a checkpoint in a cancelled scope from some
exception handling block, any original propagating exc (in this case
an RTE) will be replaced (what I call "masked") due to normal python
exc control-flow semantics; the issue is that since trio.Cancelled
is often handled "gracefully", i.e. it is swallowed by some parent
scope which was .cancel_called, the original exc will never be
raised..

As you can imagine this is very problematic if you're trying to
relay errors from from one task to another when they're scheduled in
different processes!

Bp


Some other related discussions/resources I've found from pretty old
patches/issues in both trio core and/or surrounding ecosys,


What lands here

  • a WIP soln to this problem for the purposes of our
    ._rpc._invoke() machinery, namely ensuring we don't suppress
    @context-endpoint originating exceptions, those raised from our
    "user's code":

    • a new .trionics._taskc.maybe_raise_from_masking_exc() helper
      to, at least, warn about via the new Exception.add_note(), but
      also optionally directly (re)raise a(ny) "masked exception(s)",
      namely by default those masked by a trio.Cancelled (per all the
      preface above).

    • direct use in our ._rpc._invoke()-child side impl to avoid said
      trio.Cancelled masking, working kinda hard to avoid users
      shooting their toes (see test suite parametrizations).

  • a minor suite which audits the correctness of the unmasker util-fn:
    tests_trioisms::test_gatherctxs_with_memchan_breaks_multicancelled.

  • a more comprehensive suite which audits that we do not suppress
    exceptions raised by user code:
    test_remote_exc_relay::test_unmasked_remote_exc.


Deeper explanation why this is a bigger deal for tractor's RPC

Remote exception "masking" is a bigger deal in tractor's error relay
sys/semantics for a couple reasons,

  1. since there is no current way to override/customize
    trio.Cancelled and since that forces us to use
    CancelScope.cancel() and a cs.cancelled_caught check to determine if
    a side of a task-RPC-Context was gracefully cancelled (whether
    from a remote or local request), we require that,

    • a task-cancellation (taskc) has very special meaning when
      that taskc-requesting-scope is the Context._scope - the cs
      used in the RPC mgmt part of the runtime to relay/accept remote
      cancellations; this means knowing what scope is the requester of
      a taskc it that much more critical to ensure remote vs. local
      cancels are distinct.

    • when a remote task error's that error should never be masked
      bc otherwise (per the previous bullet) it can then result in
      a critical-remote-error masquerading as
      remote-graceful-cancellation, particularly if there is
      a Portal.cancel_actor()/Context.cancel() call or SIGINT
      emission which arrives out-of-band (meaning not called by a task
      in the Context pair) and propagates across peer actors at the
      same time as said critical-remote-error being masked.

  • you can see the various complex and often subtle use cases in our
    • test_context_stream_semantics and,
    • test_inter_peer_cancellation suites.
  1. since actors generally share no common state it is much more
    difficult to determine the source of a remote error or cancel
    request including (for the latter) there being multiple "runtime
    scopes of cancellation" as it were (not to overload the scope
    term but..), meaning a cancel request can happen at any of the
    multiple layers,

    • OS via signal (osc),
    • actor via Portal.cancel_actor() (actorc),
    • RPC via Context.cancel() (ctxc),
    • a normal trio.CancelScope (taskc)
    • any of these but relayed through inter-acting code at the python
      function level
      (for example, in a more complex actor tree setup
      this is a common case) here some remote peer has errored/cancelled and
      that condition is relayed through various multi-actor-spanning
      trio.Task-trees which are "distributed" over the actors/processes/sub-ints
      yet are SC-adhering in terms of task-tree/cancel-scope supervision semantics.

goodboy added 7 commits July 15, 2025 07:23
Deats are documented within, but basically a subtlety we already track
with `trio`'s masking of excs by a checkpoint-in-`finally` can cause
compounded issues with our `@context` endpoints, mostly in terms of
remote error and cancel-ack relay semantics.
Factor the `@acm`-closure it out of the
`test_trioisms::test_acm_embedded_nursery_propagates_enter_err` suite
for real use internally.
To handle captured non-egs (when the now optional `tn` isn't provided)
as well as yield up a `BoxedMaybeException` which contains any detected
and un-masked `exc_ctx` as its `.value`.

Also add some additional tooling,
- a `raise_unmasked: bool` toggle for when the caller just wants to
  report the masked exc and not raise-it-in-place of the masker.
- `extra_note: str` which by default is tuned to the default
  `unmask_from = (trio.Cancelled,)` but which can be used to deliver
  custom exception msg content.
- `always_warn_on: tuple[BaseException]` which will always emit
  a warning log of what would have been the raised-in-place-of
  `ctx_exc`'s msg for special cases where you want to report
  a masking case that might not be otherwise noticed by the runtime
  (cough like a `Cancelled` masking another `Cancelled) but which
  you'd still like to warn the caller about.
- factor out the masked-`ext_ctx` predicate logic into
  a `find_masked_excs()` and also use it for non-eg cases.

Still maybe todo?
- rewrapping multiple masked sub-excs in an eg back into an eg? left in
  #TODOs and a pause-point where applicable.
Namely that the more common-and-pertinent case is when
a `@context`-ep-fn contains the `finally`-footgun but without
a surrounding embedded `tn` (which currently still requires its own
scope embedded `trionics.maybe_raise_from_masking_exc()`) which can't
be compensated-for by `._rpc._invoke()` easily. Instead the test is
composed where the `._invoke()`-internal `tn` is the machinery being
addressed in terms of masking user-code excs with `trio.Cancelled`.

Deats,
- rename the test -> `test_unmasked_remote_exc` to reflect what the
  runtime should actually be addressing/solving.
- drop the embedded `tn` from `sleep_n_chkpt_in_finally()` (for now)
  since that case can't currently easily be addressed without the user
  code using its own `trionics.maybe_raise_from_masking_exc()` inside
  the nursery scope.
- as such drop all `tn` related params/logic/usage from the ep.
- add in a `Cancelled` handler block which checks for RTE masking and
  always prints the occurrence loudly.

Follow up,
- obvi this suite will currently fail until the appropriate adjustment
  is made to `._rpc._invoke()` to do the unmasking; coming next.
- we probably still need a case with an embedded user `tn` where if
  the default strict-eg mode is used then a ctxc from the parent might
  cause a non-graceful `Context.cancel()` outcome?
 |_since the embedded user-`tn` will raise
   `ExceptionGroup[trio.Cancelled]` upward despite the parent nursery's
   scope being the canceller, or will a `collapse_eg()` inside the
   `._invoke()` scope handle this as well?
To resolve the recently added and failing
`test_remote_exc_relay::test_unmasked_remote_exc`: never allow
`trio.Cancelled` to mask an underlying user-code exception, ever.

Our first real-world (runtime internal) use case for the new
`.trionics.maybe_raise_from_masking_exc()` such that the failing
test now passes with an properly relayed remote RTE unmasking B)

Details,
- flip the `Context._scope_nursery` to the default strict-eg behaviour
  and instead stack its outer scope with a `.trionics.collapse_eg()`.
- wrap the inner-most scope (after `msgops.maybe_limit_plds()`) with
  a `maybe_raise_from_masking_exc()` to ensure user-code errors are
  never masked by `trio.Cancelled`s.

Some err-reporting refinement,
- always capture any `scope_err` from the entire block for debug
  purposes; report it in the `finally` block's log.
- always capture any suppressed `maybe_re`, output from
  `ctx.maybe_raise()`, and `log.cancel()` report it.
@goodboy goodboy changed the title Coping with "the finally footgun": avoiding trio.Cancelled exc masking as best we can.. Coping with "finally footguns": avoiding trio.Cancelled exc masking as best we can.. Jul 15, 2025
goodboy added 2 commits July 16, 2025 15:49
Since it's for beg filtering, the current impl should be renamed anyway;
it's not just for filtering cancelled excs.

Deats,
- added a real doc string, links to official eg docs and fixed the
  return typing.
- adjust all internal imports to match.
@goodboy goodboy requested a review from guilledk July 16, 2025 22:37
@goodboy goodboy added cancellation SC teardown semantics and anti-zombie semantics trionics sc structurred concurrency related labels Jul 16, 2025
@goodboy goodboy merged commit cd16748 into main Jul 18, 2025
2 checks passed
@goodboy goodboy deleted the the_finally_footgun branch July 18, 2025 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cancellation SC teardown semantics and anti-zombie semantics sc structurred concurrency related trionics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants