Skip to content

Distributed SC and remote cancellation, why Cancelled might not be enough? #400

@goodboy

Description

@goodboy

From the original (shoddy) write-up, now reworked (and will be more) from
#387.

The whole point of this content is to attempt to explain the
different caveats that come with the idea of enforcing distributed
structured concurrency
versus the traditional
trio-single-threaded-scheduler assumptions we all take for
granted..

Bp

Note that this is all a WIP and i'll be incrementally updating it as
my mind experiences clear spells.


why this is a big deal for tractor's RPC semantics, and in
general the idea of "distributed SC"..

Remote exception "masking" is a bigger deal for us because we have to
worry about "remote error relay semantics" and maintaining SC rules
under various parallel, non-deterministic IPC msg arrival edge cases.

Particularly these (numbered) premises,

  1. since there is no current way to override trio.Cancelled
    (ideally by plugging into trio core somehow) it forces us to use
    a combo of,

    • calling Context._scope.cancel() to signal and interrupt ongoing
      user code inside a Portal.open_context() or @context body,

    • expect any such Context embedded trio.Tasks (children of
      ctx.parent_task) to handle trio.Cancelled despite its
      "underlying cause" not necessarily being a local error or
      ctx.cancel() call,

    • subsequently check Context._scope.cancelled_caught alongside
      other internal Context state before re-raising or "silently
      absorbing" any received remote error/ContextCancelled which
      was the "underlying cause" of the Context's termination.

      • (eventually in the now released trio>=0.31.0 we can customize
        the Cancelled's .reason: str which may be a partial
        solution to the more general problem..)

    To both perform and/or determine the reason for
    a Context's termination we need to handle any of a "side" being:

    • requested to be "gracefully cancelled" by Context.cancel()
      called in the peer task,

    • cancelled by the peer-actor's-task (the other "side") raising an
      exception,

    • transitively cancelled by some local error raised by a task
      in the same scope as the Context.parent_task; Context._scope
      is not .cancel_called.

    • cancelled due to a so called "out-of-band" (local or remote) request
      at a different "layer" of the runtime (see explainer below).

    Therefore, a task-cancellation (taskc) has special meaning
    when that taskc-requesting-scope is specifically the
    Context._scope and when it is not.


Understanding the existing runtime machinery..

When an IPC msg arrives the ._rpc loop-task looks up the matching
Context by .cid and then calls Context._deliver_msg() which may
in turn call sub-methods to cancel the ._scope and raise
a RemoteActorError or ContextCancelled.

  • THUS, knowing which cancel-scope in each actor-process's
    trio.run() tree is the explicit requester of an raised
    trio.Cancelled
    is very important to ensure the distinction
    between remote vs. local cancellation, local requests and errors
    are note the same as remote equivalents.

  • when a remote task error's that error should never be masked bc
    otherwise (per the previous bullet) it can then result in
    a possibly critical-remote-error masquerading as what appears to
    be a remote-graceful-cancellation, particularly in various
    complex OoB cancellation cases.

    • an OoB case is a cancellation requested by an "out-of-band"
      system/task, from the perspective of a Context it would be
      a cancel request originating from a different layer of the
      runtime such as
      a Portal.cancel_actor()/Actor.cancel()/OS-delivered-SIGINT,
      meaning the .cancel() request was not issues by one of the
      tasks (parent or child) in the cross-actor Context pair and is
      thus from "outside the distributed task tree" implemented by the
      user's tractor "app code".

    • masqueraded remote-cancellation (a ContextCancelled which is
      relayed due to Cancelled masking of a remote exc) can be
      particularly problematic if the remote cancellation is relayed
      across multiple peer actors where it could confuse any
      distributed-sw supervision logic or resiliency subsystems.

You can see the various complex and often subtle use cases in our
various remote-cancellation test suites,


Additionally,

  1. since actors generally share no common state it is much more
    difficult to determine the source of a remote error or cancellation
    particularly when there there is multiple "runtime
    scopes of cancellation" (i.e. from various OoB cancel requesters).

    To go into more detail, a cancel request can happen at any of many
    "runtime layers" on each logical host,

    • OS via signal (osc),

    • actor via Portal.cancel_actor(), Actor.cancel() or
      ActorNursery.cancel() (actorc),

    • RPC-oriented distributed tasks via Context.cancel() (ctxc),

    • a std trio.CancelScope.cancel() (taskc)

    • any of these but relayed through inter-acting code at the python
      function level

      • for example, in a more complex system setup
        this is a fairly common pattern),
        a peer actor (which may or may not be also the child of a common
        parent) has errored/cancelled and that RemoteActorError (RAE)
        gets relayed through multiple peer actors (in turn spanning
        a set of process-distributed trio.Task-trees with their own
        SC-adhering "orthogonal hierarchy") and since each received RAE
        is boxing some remote's processes (prior) exception state, the
        only avail info to determine a fault's source is the extra data
        we pack as part of error packing/relay such as
        RemoteActorError.src_uid/.relay_uid/.tb_str/.sender.. etc.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions