PF_EXITING goto-local path can strand exiting tasks on nohz_full cpus

## Summary
First time issue submitter here.

I've been hitting repeated `runnable task stall` watchdog ejections on a custom sched_ext scheduler under a wine/cargo workload, and I believe I've traced the enqueue-side trigger to the default `PF_EXITING` shortcut in `do_enqueue_task()` interacting badly with nohz_full iso cpus. Setting `SCX_OPS_ENQ_EXITING` and handling the PF_EXITING case explicitly in `ops.enqueue` with `SCX_ENQ_PREEMPT` empirically closes the window. Filing this for confirmation from maintainers — the enqueue-side race is clear-cut, but I don't have a complete model of the full dwell mechanism and would appreciate a sanity check before framing a patch.

The existing `SCX_OPS_ENQ_EXITING` kdoc describes a *different* failure mode (`bpf_task_from_pid()` lookup failure, RCU grace period stalls) and gives no indication that the default path is unsafe on nohz_full — that's the gap I'd like to close.

## Environment

- **Kernel:** 6.19.11-zen (verified the relevant paths are byte-identical on `torvalds/master` HEAD, see below)
- **Hardware:** Ryzen 7 5700X (8c/16t, single CCD)
- **Cmdline:** `nohz_full=1-7,9-15 rcu_nocbs=1-7,9-15 isolcpus=nohz,domain,managed_irq,1-7,9-15`
- **Scheduler:** custom `scx_cages` — strict-priority 4-tier DSQ scheduler. Iso cores (1-7, 9-15) run only game/"wrapped" and "promoted" tiers via custom DSQs; HK cores (0, 8) run janitor. Iso cores never run janitor, so there is no SCX slack work that would force `schedule()` on an idle iso cpu.
- **Workload that reproduces:** Steam/wine game (Helldivers 2) running on iso cores via the wrapped tier, plus a concurrent `cargo build --release -j16` under `iso.slice/promoted`. Reliably triggers within ~90s.

## Symptom

```
sched_ext: cages: runnable task stall (wineserver[285798] failed to run for 36.288s)
sched_ext: cages: runnable task stall (cargo[265224]    failed to run for 31.656s)
sched_ext: cages: runnable task stall (rustc[278231]    failed to run for 32.683s)
sched_ext: cages: runnable task stall (i386-linux-gnu-[287471] failed to run for 31.970s)
sched_ext: cages: runnable task stall (cc1[290477]      failed to run for 30.371s)
sched_ext: cages: runnable task stall (opt cgu.0[289296] failed to run for 30.789s)
```

10+ ejections in a single session across `wineserver`, `cargo`, `rustc`, `cc1`, `ld.lld`, `opt cgu` (rustc codegen units). The scheduler is auto-restarted by its service supervisor after each eject.

**Every stalled task is in the `do_exit → do_group_exit → __x64_sys_exit_group` family at eject time**

## Evidence from `scx_dump_state`

I enabled the scx exit-time debug dump (set `exit_dump_len` on the struct_ops shadow before load — required, `scx_dump_state` at `kernel/sched/ext.c:4581` is a no-op with `len=0`). Sample from the wineserver[285798] eject:

```
CPU 2   : nr_run=1 flags=0x1 ops_qseq=89544392
          curr=swapper/2[0] class=idle_sched_class

  R wineserver[285798] -36288ms
      scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002
      dsq_vtime=0 slice=20000000 weight=100
      cpus=fefe no_mig=0

    do_exit+0x32a/0xa60
    do_group_exit+0x8b/0x90
    __x64_sys_exit_group+0x17/0x20
    x64_sys_call+0x15e0/0x1870
    do_syscall_64+0x73/0x290
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Event counters
--------------
SCX_EV_SELECT_CPU_FALLBACK:               26
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE:         0
SCX_EV_DISPATCH_KEEP_LAST:                 0
SCX_EV_ENQ_SKIP_EXITING:                1009
SCX_EV_ENQ_SKIP_MIGRATION_DISABLED:        0
SCX_EV_REFILL_SLICE_DFL:                1012
SCX_EV_BYPASS_DURATION:              9754621
SCX_EV_BYPASS_DISPATCH:                    3
SCX_EV_BYPASS_ACTIVATE:                    1
```

Five facts from this:

1. **`dsq_id = 0x8000000000000002` = `SCX_DSQ_LOCAL`** (= `SCX_DSQ_FLAG_BUILTIN | 2`, `include/linux/sched/ext.h:59`). Task is associated with *a* per-cpu local DSQ. Combined with walking `rq->scx.runnable_list` from CPU 2's rq, it's CPU 2's local DSQ specifically.

2. **`ops_state/qseq = 0/0` (SCX_OPSS_NONE)**. Tasks dispatched via `ops.enqueue` transition through `SCX_OPSS_QUEUEING` (set at `ext.c:1393`). `NONE` with `qseq=0` means `ops.enqueue` was **never called** for this task on this enqueue — i.e. the goto-local shortcut was taken, bypassing the scheduler.

3. **`slice = 20000000` = 20ms = `SCX_SLICE_DFL`** (`include/linux/sched/ext.h:30`). The scheduler's `tier_slice()` would have returned 5ms for the wrapped tier and 2ms for janitor. The slice here was set by `refill_task_slice_dfl` on the `enqueue:` label at `ext.c:1435`, which only fires when goto-local / goto-global / goto-bypass was taken. Matches the bypass conclusion from (2).

4. **`SCX_EV_ENQ_SKIP_EXITING = 1009`**. This counter is only incremented in one place — the PF_EXITING goto-local shortcut at `ext.c:1375`. 1009 PF_EXITING tasks were routed through it in the scheduler's lifetime window before the eject.

5. **Every stalled task's stack ends in `exit_group`**. Confirms the category.

## Analysis (enqueue-side)

With `SCX_OPS_ENQ_EXITING` unset (default), `do_enqueue_task()` at `kernel/sched/ext.c:1372-1377`:

```c
    /* see %SCX_OPS_ENQ_EXITING */
    if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
        unlikely(p->flags & PF_EXITING)) {
        __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
        goto local;
    }
```

Flow: skip `ops.enqueue`, `goto local`, fall through to the `enqueue:` label at `ext.c:1428`, which does:

```c
enqueue:
    touch_core_sched(rq, p);
    refill_task_slice_dfl(sch, p);
    dispatch_enqueue(sch, dsq, p, enq_flags);
```

`enq_flags` here is **whatever the wake path passed in**, typically `ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK | ...` — **no `SCX_ENQ_PREEMPT`** and no `SCX_ENQ_HEAD`.

`dispatch_enqueue` ends up calling `local_dsq_post_enq()` at `ext.c:987`:

```c
static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p,
                               u64 enq_flags)
{
    struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
    bool preempt = false;

    if (rq->scx.flags & SCX_RQ_IN_BALANCE)
        return;

    if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
        rq->curr->sched_class == &ext_sched_class) {
        rq->curr->scx.slice = 0;
        preempt = true;
    }

    if (preempt || sched_class_above(&ext_sched_class, rq->curr->sched_class))
        resched_curr(rq);
}
```

Case analysis on `rq->curr` at the moment of this call:

- **`rq->curr` is ext-class** (common — under SCX ownership of the iso cpu, curr is usually a game thread). `(enq_flags & SCX_ENQ_PREEMPT) == 0`, so the PREEMPT branch doesn't fire (`preempt` stays false). `sched_class_above(ext_sched_class, ext_sched_class) == false`. Neither condition holds. **`resched_curr` is not called.** The exiting task lands in the local DSQ silently; no `TIF_NEED_RESCHED`, no IPI.

- **`rq->curr` is idle.** `sched_class_above(ext, idle) == true` → `resched_curr` is called. Normal case; no stranding.

- **`rq->curr` is RT/DL (higher class).** `sched_class_above(ext, rt) == false`. Neither condition holds. No resched. (Less relevant on SCX-owned cpus, but the same shape exists.)

So whenever an exiting task is ttwu'd (e.g., wakes from a mutex in `exit_mmap` / `release_task` / seccomp filter release, which happens during `do_exit` cleanup), and the target cpu's curr is ext-class at that exact instant, the exiting task is enqueued without any resched attempt. **This is what I believe the enqueue-side trigger is.**

## Open questions

This is the part I want to flag honestly and where I'd like maintainer input:

The enqueue-side race as described explains *how the task lands on a local DSQ without a resched*. It does not fully explain *why the task stays there for 30+ seconds*. Candidate continuations:

1. **Curr's slice eventually expires** → `schedule()` → `pick_task_scx` → `balance_one` → `first_local_task` picks the stranded task. At a 5-20ms slice this should bound the dwell at ~20ms, not 36s. So either:
   - Curr's slice doesn't decrement (nohz_full + `SCX_RQ_CAN_STOP_TICK`?), or
   - `balance_one` takes the `SCX_RQ_BAL_KEEP` path at `ext.c:2186-2189` (prev still has slice, keep running prev), and prev's slice is somehow indefinite, or
   - Curr goes idle silently without calling `schedule()` (not possible — idle transition goes through `schedule()`), which means idle is entered via `schedule()`, which should have picked the stranded task via `first_local_task`.

2. **The enqueue happens *while* the curr is in the middle of its run**, then curr does enter `schedule()` later, but `pick_task_scx` somehow skips the stranded task. I don't have a plausible mechanism for this.

3. **There's a subsequent race**: curr hits its slice, enters `schedule()`, `balance_one()` runs, sets `SCX_RQ_IN_BALANCE`, and the cpu is mid-pick when the exiting-task enqueue arrives on a different rq lock window. In that case `local_dsq_post_enq` returns early at `ext.c:998-999` (IN_BALANCE short-circuit). The IN_BALANCE path is designed to let the ongoing `pick_task_scx` pick up the new arrival via `first_local_task`, but if the race window closes after the `first_local_task` call inside `balance_one` — the newly-arrived task would miss this pick. I'm not sure whether the code handles that case.

I don't have proof that any specific continuation explains the 36-second dwell. What I *do* know is (a) the enqueue-side race as described is real and counter-confirmed, and (b) the empirical fix below unambiguously stops the stalls, across ~60 minutes of the same workload that previously stalled within 90 seconds. I'd appreciate a maintainer confirming or correcting the dwell mechanism.

## Workaround

```c
.flags = ... | SCX_OPS_ENQ_EXITING,

void BPF_STRUCT_OPS(my_enqueue, struct task_struct *p, u64 enq_flags)
{
    if (p->flags & PF_EXITING) {
        scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, default_slice,
                           enq_flags | SCX_ENQ_PREEMPT);
        return;
    }
    ...
}
```

`SCX_OPS_ENQ_EXITING` routes PF_EXITING tasks through `ops.enqueue` instead of the goto-local shortcut. `SCX_ENQ_PREEMPT` forces `local_dsq_post_enq` to take the first branch, which calls `resched_curr` regardless of curr's sched_class. Same destination as the kernel default (the task's current cpu's local DSQ — critical, because a PF_EXITING task's `cpus_allowed` is locked to its cgroup cpuset at fork time and must not be routed cross-domain), but the resched is guaranteed.

**Validation:**
- Before: `cargo build --release -j16` under `iso.slice/promoted` + a game running on `iso.slice/wrapped` → 4 ejects in 90 seconds (opt cgu, cc1, lto cgu, i386-linux-gnu-).
- After (same binary, only change is `SCX_OPS_ENQ_EXITING | SCX_ENQ_PREEMPT`): same workload, 17.7s wallclock build, 110 cpu-seconds, **zero ejects**, `SCX_EV_ENQ_SKIP_EXITING` stays at 0 (kernel no longer takes the goto-local path for exiting tasks).

**Anti-pattern that bit me during debug:** I first tried routing PF_EXITING tasks to a separate "janitor" HK-only DSQ (CPUs 0, 8), thinking it would bypass nohz_full entirely. This made things *worse* because PF_EXITING tasks born under `iso.slice` have `cpus_allowed = 0xfefe` (iso mask). An HK-only DSQ drained only by HK cpus is unreachable for them — `task_can_run_on_remote_rq()` rejects the pick, and tasks pile up with no eligible consumer. 6+ stranded per iso cpu within 30 seconds. **Don't reroute exiting tasks across cpu domains.**

## Mainline applicability

I verified the relevant functions are byte-identical on `torvalds/linux` master HEAD:

- `local_dsq_post_enq` body — identical
- The `PF_EXITING` goto-local shortcut (`SCX_EV_ENQ_SKIP_EXITING` / `goto local`) — identical
- The `enqueue:` label that does `refill_task_slice_dfl` + `dispatch_enqueue` — identical
- `SCX_OPS_ENQ_EXITING` kdoc comment in `kernel/sched/ext_internal.h` — same wording, same gap (mentions `bpf_task_from_pid` / RCU stalls as the motivation, does not mention nohz_full stranding)

So this is not a 6.19.x-only thing.

## Relationship to `SCX_ENQ_IMMED` (for-7.1)

Tejun's `SCX_ENQ_IMMED` patch queued in the `for-7.1` branch provides a general "never linger on local DSQ behind other tasks or on a cpu taken by a higher-priority class" primitive. From the public description:

> "Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class."

This is a **caller opt-in** flag on `scx_bpf_dsq_insert`. From what I can see in public searches I couldn't confirm whether the kernel-internal goto-local PF_EXITING path in `for-7.1` was also updated to pass `SCX_ENQ_IMMED`. If not, then:

- Schedulers on 7.1 that haven't been updated to use `SCX_ENQ_IMMED` remain vulnerable to the same race.
- Schedulers on pre-7.1 kernels (anyone on current stable) have no way to avoid the goto-local path except `SCX_OPS_ENQ_EXITING`.

Either way the kdoc change below is valuable independently of `SCX_ENQ_IMMED`. A maintainer who has the `for-7.1` tree open can tell me whether the goto-local path itself now passes `IMMED`, which would resolve the default case for 7.1+.

## Proposed resolutions

### Option A (low risk): kdoc patch

```diff
 /*
  * An exiting task may schedule after PF_EXITING is set. In such cases,
  * bpf_task_from_pid() may not be able to find the task and if the BPF
  * scheduler depends on pid lookup for dispatching, the task will be
  * lost leading to various issues including RCU grace period stalls.
  *
  * To mask this problem, by default, unhashed tasks are automatically
  * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't
  * depend on pid lookups and wants to handle these tasks directly, the
  * following flag can be used.
+ *
+ * Schedulers running on nohz_full cpus SHOULD set this flag. The default
+ * goto-local path passes the caller's enq_flags verbatim, which omits
+ * SCX_ENQ_PREEMPT. local_dsq_post_enq() only calls resched_curr() when
+ * PREEMPT is set or when the current sched_class is above ext_sched_class.
+ * Under SCX ownership of an iso cpu, rq->curr is typically ext-class, so
+ * the exiting task can be enqueued silently on an idle nohz_full cpu and
+ * strand on the local DSQ until the 30s runnable-task-stall watchdog
+ * ejects the scheduler. Setting this flag routes exiting tasks through
+ * ops.enqueue(), where the scheduler can insert with SCX_ENQ_PREEMPT
+ * (or, on 7.1+, SCX_ENQ_IMMED) to force resched.
  */
 SCX_OPS_ENQ_EXITING = 1LLU << 2,
```

Narrow, acknowledges the existing escape hatch, low-risk. I'm happy to send this as a proper patch with `Signed-off-by` if the wording works.

### Option B (broader fix): force PREEMPT in the goto-local path

```diff
-    /* see %SCX_OPS_ENQ_EXITING */
-    if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
-        unlikely(p->flags & PF_EXITING)) {
-        __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
-        goto local;
-    }
+    /* see %SCX_OPS_ENQ_EXITING */
+    if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
+        unlikely(p->flags & PF_EXITING)) {
+        __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
+        /* Force resched on the target cpu — the default
+         * local_dsq_post_enq path skips resched_curr for
+         * ext-vs-ext wakes without PREEMPT, which can strand
+         * exiting tasks on idle nohz_full cpus. */
+        enq_flags |= SCX_ENQ_PREEMPT;
+        goto local;
+    }
```

This makes behavior safe by default and obsoletes one reason for schedulers to set `SCX_OPS_ENQ_EXITING`. Risk: changes default preemption semantics for all exiting-task enqueues on all schedulers (preempts a running ext task mid-slice to run a dying task). I'd defer to maintainers on whether that's acceptable.

## What I'm asking for

1. **Am I reading the enqueue-side race correctly?** In particular, is there a reason the default goto-local path deliberately omits PREEMPT that I'm missing?
2. **Is the dwell-time gap (open question above) worth chasing?** Or is "it eventually strands" sufficient once we've established the missed-resched?
3. **Is the kdoc patch (Option A) acceptable as framed?** I'll send it as a formal patch if so.
4. **On for-7.1:** does the goto-local path itself pass `SCX_ENQ_IMMED` now? That would localize the problem to pre-7.1 stable kernels and tighten the doc patch scope.

Happy to provide more data, instrument further, or run experiments if anything is unclear.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PF_EXITING goto-local path can strand exiting tasks on nohz_full cpus #3530

Summary

Environment

Symptom

Evidence from `scx_dump_state`

Analysis (enqueue-side)

Open questions

Workaround

Mainline applicability

Relationship to `SCX_ENQ_IMMED` (for-7.1)

Proposed resolutions

Option A (low risk): kdoc patch

Option B (broader fix): force PREEMPT in the goto-local path

What I'm asking for

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PF_EXITING goto-local path can strand exiting tasks on nohz_full cpus #3530

Description

Summary

Environment

Symptom

Evidence from scx_dump_state

Analysis (enqueue-side)

Open questions

Workaround

Mainline applicability

Relationship to SCX_ENQ_IMMED (for-7.1)

Proposed resolutions

Option A (low risk): kdoc patch

Option B (broader fix): force PREEMPT in the goto-local path

What I'm asking for

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence from `scx_dump_state`

Relationship to `SCX_ENQ_IMMED` (for-7.1)