Coping with "finally footguns": avoiding trio.Cancelled exc masking as best we can..#387
Merged
Coping with "finally footguns": avoiding trio.Cancelled exc masking as best we can..#387
finally footguns": avoiding trio.Cancelled exc masking as best we can..#387Conversation
Deats are documented within, but basically a subtlety we already track with `trio`'s masking of excs by a checkpoint-in-`finally` can cause compounded issues with our `@context` endpoints, mostly in terms of remote error and cancel-ack relay semantics.
Factor the `@acm`-closure it out of the `test_trioisms::test_acm_embedded_nursery_propagates_enter_err` suite for real use internally.
To handle captured non-egs (when the now optional `tn` isn't provided) as well as yield up a `BoxedMaybeException` which contains any detected and un-masked `exc_ctx` as its `.value`. Also add some additional tooling, - a `raise_unmasked: bool` toggle for when the caller just wants to report the masked exc and not raise-it-in-place of the masker. - `extra_note: str` which by default is tuned to the default `unmask_from = (trio.Cancelled,)` but which can be used to deliver custom exception msg content. - `always_warn_on: tuple[BaseException]` which will always emit a warning log of what would have been the raised-in-place-of `ctx_exc`'s msg for special cases where you want to report a masking case that might not be otherwise noticed by the runtime (cough like a `Cancelled` masking another `Cancelled) but which you'd still like to warn the caller about. - factor out the masked-`ext_ctx` predicate logic into a `find_masked_excs()` and also use it for non-eg cases. Still maybe todo? - rewrapping multiple masked sub-excs in an eg back into an eg? left in #TODOs and a pause-point where applicable.
Namely that the more common-and-pertinent case is when a `@context`-ep-fn contains the `finally`-footgun but without a surrounding embedded `tn` (which currently still requires its own scope embedded `trionics.maybe_raise_from_masking_exc()`) which can't be compensated-for by `._rpc._invoke()` easily. Instead the test is composed where the `._invoke()`-internal `tn` is the machinery being addressed in terms of masking user-code excs with `trio.Cancelled`. Deats, - rename the test -> `test_unmasked_remote_exc` to reflect what the runtime should actually be addressing/solving. - drop the embedded `tn` from `sleep_n_chkpt_in_finally()` (for now) since that case can't currently easily be addressed without the user code using its own `trionics.maybe_raise_from_masking_exc()` inside the nursery scope. - as such drop all `tn` related params/logic/usage from the ep. - add in a `Cancelled` handler block which checks for RTE masking and always prints the occurrence loudly. Follow up, - obvi this suite will currently fail until the appropriate adjustment is made to `._rpc._invoke()` to do the unmasking; coming next. - we probably still need a case with an embedded user `tn` where if the default strict-eg mode is used then a ctxc from the parent might cause a non-graceful `Context.cancel()` outcome? |_since the embedded user-`tn` will raise `ExceptionGroup[trio.Cancelled]` upward despite the parent nursery's scope being the canceller, or will a `collapse_eg()` inside the `._invoke()` scope handle this as well?
To resolve the recently added and failing `test_remote_exc_relay::test_unmasked_remote_exc`: never allow `trio.Cancelled` to mask an underlying user-code exception, ever. Our first real-world (runtime internal) use case for the new `.trionics.maybe_raise_from_masking_exc()` such that the failing test now passes with an properly relayed remote RTE unmasking B) Details, - flip the `Context._scope_nursery` to the default strict-eg behaviour and instead stack its outer scope with a `.trionics.collapse_eg()`. - wrap the inner-most scope (after `msgops.maybe_limit_plds()`) with a `maybe_raise_from_masking_exc()` to ensure user-code errors are never masked by `trio.Cancelled`s. Some err-reporting refinement, - always capture any `scope_err` from the entire block for debug purposes; report it in the `finally` block's log. - always capture any suppressed `maybe_re`, output from `ctx.maybe_raise()`, and `log.cancel()` report it.
trio.Cancelled exc masking as best we can..finally footguns": avoiding trio.Cancelled exc masking as best we can..
Since it's for beg filtering, the current impl should be renamed anyway; it's not just for filtering cancelled excs. Deats, - added a real doc string, links to official eg docs and fixed the return typing. - adjust all internal imports to match.
guilledk
approved these changes
Jul 16, 2025
Open
5 tasks
This was referenced Sep 10, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As critiqued heavily (by me) over the past few months (after
discovering this issue during work on another dependent project) in
the
triochat room, i was able to hack together a near term solnwhich at least resolves the bulk of critical potential issues (the
SC-cross-process semantics breaking kind) within the
tractorruntime - namely ensuring error propagation is maintained with
a (remote) child-task started with
Portal.open_context().Summary of underlying "issue" in
trioIt's actually been documented and discussed for much longer then
I had thought and is most simply detailed in this comment
which contains the following snippet.
Generally speaking python-trio/trio#455
seems to be the primary spot for following all this.
TLDR,
if you exec a checkpoint in a cancelled scope from some
exception handling block, any original propagating exc (in this case
an RTE) will be replaced (what I call "masked") due to normal python
exc control-flow semantics; the issue is that since
trio.Cancelledis often handled "gracefully", i.e. it is swallowed by some parent
scope which was
.cancel_called, the original exc will never beraised..
As you can imagine this is very problematic if you're trying to
relay errors from from one task to another when they're scheduled in
different processes!
Bp
Some other related discussions/resources I've found from pretty old
patches/issues in both
triocore and/or surrounding ecosys,pytest-trio rejected patch: Fix exception swallow in fixture due to parent cancel scope python-trio/pytest-trio#77 (comment)
relevant
flake8-asynclinter rules for which apply to for all asyncIO:finally: cancel_scope.cancel()python-trio/flake8-async#365What lands here
a WIP soln to this problem for the purposes of our
._rpc._invoke()machinery, namely ensuring we don't suppress@context-endpoint originating exceptions, those raised from our"user's code":
a new
.trionics._taskc.maybe_raise_from_masking_exc()helperto, at least, warn about via the new
Exception.add_note(), butalso optionally directly (re)raise a(ny) "masked exception(s)",
namely by default those masked by a
trio.Cancelled(per all thepreface above).
direct use in our
._rpc._invoke()-child side impl to avoid saidtrio.Cancelledmasking, working kinda hard to avoid usersshooting their toes (see test suite parametrizations).
a minor suite which audits the correctness of the unmasker util-fn:
tests_trioisms::test_gatherctxs_with_memchan_breaks_multicancelled.a more comprehensive suite which audits that we do not suppress
exceptions raised by user code:
test_remote_exc_relay::test_unmasked_remote_exc.Deeper explanation why this is a bigger deal for
tractor's RPCRemote exception "masking" is a bigger deal in tractor's error relay
sys/semantics for a couple reasons,
since there is no current way to override/customize
trio.Cancelledand since that forces us to useCancelScope.cancel()and acs.cancelled_caughtcheck to determine ifa side of a task-RPC-
Contextwas gracefully cancelled (whetherfrom a remote or local request), we require that,
a task-cancellation (taskc) has very special meaning when
that taskc-requesting-scope is the
Context._scope- the csused in the RPC mgmt part of the runtime to relay/accept remote
cancellations; this means knowing what scope is the requester of
a taskc it that much more critical to ensure remote vs. local
cancels are distinct.
when a remote task error's that error should never be masked
bc otherwise (per the previous bullet) it can then result in
a critical-remote-error masquerading as
remote-graceful-cancellation, particularly if there is
a
Portal.cancel_actor()/Context.cancel()call or SIGINTemission which arrives out-of-band (meaning not called by a task
in the
Contextpair) and propagates across peer actors at thesame time as said critical-remote-error being masked.
test_context_stream_semanticsand,test_inter_peer_cancellationsuites.since actors generally share no common state it is much more
difficult to determine the source of a remote error or cancel
request including (for the latter) there being multiple "runtime
scopes of cancellation" as it were (not to overload the scope
term but..), meaning a cancel request can happen at any of the
multiple layers,
Portal.cancel_actor()(actorc),Context.cancel()(ctxc),trio.CancelScope(taskc)function level (for example, in a more complex actor tree setup
this is a common case) here some remote peer has errored/cancelled and
that condition is relayed through various multi-actor-spanning
trio.Task-trees which are "distributed" over the actors/processes/sub-intsyet are SC-adhering in terms of task-tree/cancel-scope supervision semantics.