Distributed SC and remote cancellation, why `Cancelled` might not be enough?

From the original (shoddy) write-up, now reworked (and will be more) from
#387.

The whole point of this content is to attempt to explain the
different caveats that come with the idea of enforcing *distributed
structured concurrency* versus the traditional
`trio`-single-threaded-scheduler assumptions we all take for
granted..

Bp

Note that this is all a WIP and i'll be incrementally updating it as
my mind experiences *clear spells*.

---

> why this is a big deal for `tractor`'s RPC semantics, and in
> general the idea of "distributed SC"..

Remote exception "masking" is a bigger deal for us because we have to
worry about "remote error relay semantics" and maintaining SC rules
under various parallel, non-deterministic IPC msg arrival edge cases.

Particularly these (numbered) *premises*,

1. since there is no current way to override `trio.Cancelled`
   (ideally by plugging into `trio` core somehow) it forces us to use
   a combo of,

   - calling `Context._scope.cancel()` to signal and interrupt ongoing
     *user code* inside a `Portal.open_context()` or `@context` body,

   - expect any such `Context` embedded `trio.Task`s (children of
     `ctx.parent_task`) to handle `trio.Cancelled` despite its
     "underlying cause" not necessarily being a local error or
     `ctx.cancel()` call,

   - subsequently check `Context._scope.cancelled_caught` alongside
     other internal `Context` state before re-raising or "silently
     absorbing" any received remote error/`ContextCancelled` which
     was the "underlying cause" of the `Context`'s termination.
     * (eventually in the now released `trio>=0.31.0` we can customize
       the `Cancelled`'s `.reason: str` which may be a partial
       solution to the more general problem..)

   To both *perform* and/or *determine* the reason for
   a `Context`'s termination we need to handle any of a "side" being:

   - requested to be "gracefully cancelled" by `Context.cancel()`
     called in the peer task,

   - cancelled by the peer-actor's-task (the other "side") raising an
     exception,

   - transitively cancelled by some local error raised by a task
     in the same scope as the `Context.parent_task`; `Context._scope`
     is **not** `.cancel_called`.

   - cancelled due to a so called "out-of-band" (local or remote) request
     at a different "layer" of the runtime (see explainer below).

    Therefore, a task-cancellation (taskc) has **special meaning**
    when that taskc-requesting-scope **is specifically the**
    `Context._scope` and **when it is not**.

---

Understanding the existing runtime machinery..

When an IPC msg arrives the `._rpc` loop-task looks up the matching
`Context` by `.cid` and then calls `Context._deliver_msg()` which may
in turn call sub-methods to cancel the `._scope` and raise
a `RemoteActorError` or `ContextCancelled`.

- THUS, knowing **which cancel-scope** in each actor-process's
  `trio.run()` tree *is the explicit requester of an raised
  `trio.Cancelled`* is **very important** to ensure the distinction
  between remote vs. local cancellation, local requests and errors
  **are note the same** as remote equivalents.

- when a remote task error's that error should **never be masked** bc
  otherwise (per the previous bullet) it can then result in
  a possibly *critical-remote-error* masquerading as what appears to
  be a *remote-graceful-cancellation*, particularly in various
  complex OoB cancellation cases.

  * an OoB case is a cancellation requested by an "out-of-band"
    system/task, from the perspective of a `Context` it would be
    a cancel request originating from a different layer of the
    runtime such as
    a `Portal.cancel_actor()/Actor.cancel()`/OS-delivered-SIGINT,
    meaning the `.cancel()` request was **not** issues by one of the
    tasks (parent or child) in the cross-actor `Context` pair and is
    thus from "outside the distributed task tree" implemented by the
    user's `tractor` "app code".

  * masqueraded remote-cancellation (a `ContextCancelled` which is
    relayed due to `Cancelled` masking of a remote exc) can be
    particularly problematic if the remote cancellation is relayed
    across multiple peer actors where it could confuse any
    distributed-sw supervision logic or resiliency subsystems.

You can see the various complex and often subtle use cases in our
various remote-cancellation test suites,

  - `test_context_stream_semantics`,
  - `test_inter_peer_cancellation`,
  - `test_oob_cancellation` (from #399),

---

Additionally,

2. since actors generally share no common state it is much more
   difficult to determine the source of a remote error or cancellation
   particularly when there there is multiple "runtime
   scopes of cancellation" (i.e. from various OoB cancel requesters).

   To go into more detail, a cancel request can happen at any of many
   "runtime layers" on each logical host,

   - OS via signal (osc),
   - actor via `Portal.cancel_actor()`, `Actor.cancel()` or
     `ActorNursery.cancel()` (actorc),
   - RPC-oriented distributed tasks via `Context.cancel()` (ctxc),
   - a std `trio.CancelScope.cancel()` (taskc)
   - any of these but relayed through inter-acting code at the *python
     function level*

     * for example, in a more complex system setup
     [this is a fairly common pattern](https://github.com/goodboy/tractor/blob/main/tests/test_inter_peer_cancellation.py#L1018)),
     a peer actor (which may or may not be also the child of a common
     parent) has errored/cancelled and that `RemoteActorError` (RAE)
     gets relayed through multiple peer actors (in turn spanning
     a set of process-distributed `trio.Task`-trees with their own
     SC-adhering "orthogonal hierarchy") and since each received RAE
     is boxing some remote's processes (prior) exception state, the
     only avail info to determine a fault's source is the extra data
     we pack as part of error packing/relay such as
     `RemoteActorError.src_uid/.relay_uid/.tb_str/.sender`.. etc.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed SC and remote cancellation, why `Cancelled` might not be enough? #400

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Distributed SC and remote cancellation, why Cancelled might not be enough? #400

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Distributed SC and remote cancellation, why `Cancelled` might not be enough? #400