-
Notifications
You must be signed in to change notification settings - Fork 14
Distributed SC and remote cancellation, why Cancelled might not be enough? #400
Description
From the original (shoddy) write-up, now reworked (and will be more) from
#387.
The whole point of this content is to attempt to explain the
different caveats that come with the idea of enforcing distributed
structured concurrency versus the traditional
trio-single-threaded-scheduler assumptions we all take for
granted..
Bp
Note that this is all a WIP and i'll be incrementally updating it as
my mind experiences clear spells.
why this is a big deal for
tractor's RPC semantics, and in
general the idea of "distributed SC"..
Remote exception "masking" is a bigger deal for us because we have to
worry about "remote error relay semantics" and maintaining SC rules
under various parallel, non-deterministic IPC msg arrival edge cases.
Particularly these (numbered) premises,
-
since there is no current way to override
trio.Cancelled
(ideally by plugging intotriocore somehow) it forces us to use
a combo of,-
calling
Context._scope.cancel()to signal and interrupt ongoing
user code inside aPortal.open_context()or@contextbody, -
expect any such
Contextembeddedtrio.Tasks (children of
ctx.parent_task) to handletrio.Cancelleddespite its
"underlying cause" not necessarily being a local error or
ctx.cancel()call, -
subsequently check
Context._scope.cancelled_caughtalongside
other internalContextstate before re-raising or "silently
absorbing" any received remote error/ContextCancelledwhich
was the "underlying cause" of theContext's termination.- (eventually in the now released
trio>=0.31.0we can customize
theCancelled's.reason: strwhich may be a partial
solution to the more general problem..)
- (eventually in the now released
To both perform and/or determine the reason for
aContext's termination we need to handle any of a "side" being:-
requested to be "gracefully cancelled" by
Context.cancel()
called in the peer task, -
cancelled by the peer-actor's-task (the other "side") raising an
exception, -
transitively cancelled by some local error raised by a task
in the same scope as theContext.parent_task;Context._scope
is not.cancel_called. -
cancelled due to a so called "out-of-band" (local or remote) request
at a different "layer" of the runtime (see explainer below).
Therefore, a task-cancellation (taskc) has special meaning
when that taskc-requesting-scope is specifically the
Context._scopeand when it is not. -
Understanding the existing runtime machinery..
When an IPC msg arrives the ._rpc loop-task looks up the matching
Context by .cid and then calls Context._deliver_msg() which may
in turn call sub-methods to cancel the ._scope and raise
a RemoteActorError or ContextCancelled.
-
THUS, knowing which cancel-scope in each actor-process's
trio.run()tree is the explicit requester of an raised
trio.Cancelledis very important to ensure the distinction
between remote vs. local cancellation, local requests and errors
are note the same as remote equivalents. -
when a remote task error's that error should never be masked bc
otherwise (per the previous bullet) it can then result in
a possibly critical-remote-error masquerading as what appears to
be a remote-graceful-cancellation, particularly in various
complex OoB cancellation cases.-
an OoB case is a cancellation requested by an "out-of-band"
system/task, from the perspective of aContextit would be
a cancel request originating from a different layer of the
runtime such as
aPortal.cancel_actor()/Actor.cancel()/OS-delivered-SIGINT,
meaning the.cancel()request was not issues by one of the
tasks (parent or child) in the cross-actorContextpair and is
thus from "outside the distributed task tree" implemented by the
user'stractor"app code". -
masqueraded remote-cancellation (a
ContextCancelledwhich is
relayed due toCancelledmasking of a remote exc) can be
particularly problematic if the remote cancellation is relayed
across multiple peer actors where it could confuse any
distributed-sw supervision logic or resiliency subsystems.
-
You can see the various complex and often subtle use cases in our
various remote-cancellation test suites,
test_context_stream_semantics,test_inter_peer_cancellation,test_oob_cancellation(from OoB (out-of-band) cancellation testing, proper. #399),
Additionally,
-
since actors generally share no common state it is much more
difficult to determine the source of a remote error or cancellation
particularly when there there is multiple "runtime
scopes of cancellation" (i.e. from various OoB cancel requesters).To go into more detail, a cancel request can happen at any of many
"runtime layers" on each logical host,-
OS via signal (osc),
-
actor via
Portal.cancel_actor(),Actor.cancel()or
ActorNursery.cancel()(actorc), -
RPC-oriented distributed tasks via
Context.cancel()(ctxc), -
a std
trio.CancelScope.cancel()(taskc) -
any of these but relayed through inter-acting code at the python
function level- for example, in a more complex system setup
this is a fairly common pattern),
a peer actor (which may or may not be also the child of a common
parent) has errored/cancelled and thatRemoteActorError(RAE)
gets relayed through multiple peer actors (in turn spanning
a set of process-distributedtrio.Task-trees with their own
SC-adhering "orthogonal hierarchy") and since each received RAE
is boxing some remote's processes (prior) exception state, the
only avail info to determine a fault's source is the extra data
we pack as part of error packing/relay such as
RemoteActorError.src_uid/.relay_uid/.tb_str/.sender.. etc.
- for example, in a more complex system setup
-