Summary
First time issue submitter here.
I've been hitting repeated runnable task stall watchdog ejections on a custom sched_ext scheduler under a wine/cargo workload, and I believe I've traced the enqueue-side trigger to the default PF_EXITING shortcut in do_enqueue_task() interacting badly with nohz_full iso cpus. Setting SCX_OPS_ENQ_EXITING and handling the PF_EXITING case explicitly in ops.enqueue with SCX_ENQ_PREEMPT empirically closes the window. Filing this for confirmation from maintainers — the enqueue-side race is clear-cut, but I don't have a complete model of the full dwell mechanism and would appreciate a sanity check before framing a patch.
The existing SCX_OPS_ENQ_EXITING kdoc describes a different failure mode (bpf_task_from_pid() lookup failure, RCU grace period stalls) and gives no indication that the default path is unsafe on nohz_full — that's the gap I'd like to close.
Environment
- Kernel: 6.19.11-zen (verified the relevant paths are byte-identical on
torvalds/master HEAD, see below)
- Hardware: Ryzen 7 5700X (8c/16t, single CCD)
- Cmdline:
nohz_full=1-7,9-15 rcu_nocbs=1-7,9-15 isolcpus=nohz,domain,managed_irq,1-7,9-15
- Scheduler: custom
scx_cages — strict-priority 4-tier DSQ scheduler. Iso cores (1-7, 9-15) run only game/"wrapped" and "promoted" tiers via custom DSQs; HK cores (0, 8) run janitor. Iso cores never run janitor, so there is no SCX slack work that would force schedule() on an idle iso cpu.
- Workload that reproduces: Steam/wine game (Helldivers 2) running on iso cores via the wrapped tier, plus a concurrent
cargo build --release -j16 under iso.slice/promoted. Reliably triggers within ~90s.
Symptom
sched_ext: cages: runnable task stall (wineserver[285798] failed to run for 36.288s)
sched_ext: cages: runnable task stall (cargo[265224] failed to run for 31.656s)
sched_ext: cages: runnable task stall (rustc[278231] failed to run for 32.683s)
sched_ext: cages: runnable task stall (i386-linux-gnu-[287471] failed to run for 31.970s)
sched_ext: cages: runnable task stall (cc1[290477] failed to run for 30.371s)
sched_ext: cages: runnable task stall (opt cgu.0[289296] failed to run for 30.789s)
10+ ejections in a single session across wineserver, cargo, rustc, cc1, ld.lld, opt cgu (rustc codegen units). The scheduler is auto-restarted by its service supervisor after each eject.
Every stalled task is in the do_exit → do_group_exit → __x64_sys_exit_group family at eject time
Evidence from scx_dump_state
I enabled the scx exit-time debug dump (set exit_dump_len on the struct_ops shadow before load — required, scx_dump_state at kernel/sched/ext.c:4581 is a no-op with len=0). Sample from the wineserver[285798] eject:
CPU 2 : nr_run=1 flags=0x1 ops_qseq=89544392
curr=swapper/2[0] class=idle_sched_class
R wineserver[285798] -36288ms
scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002
dsq_vtime=0 slice=20000000 weight=100
cpus=fefe no_mig=0
do_exit+0x32a/0xa60
do_group_exit+0x8b/0x90
__x64_sys_exit_group+0x17/0x20
x64_sys_call+0x15e0/0x1870
do_syscall_64+0x73/0x290
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Event counters
--------------
SCX_EV_SELECT_CPU_FALLBACK: 26
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE: 0
SCX_EV_DISPATCH_KEEP_LAST: 0
SCX_EV_ENQ_SKIP_EXITING: 1009
SCX_EV_ENQ_SKIP_MIGRATION_DISABLED: 0
SCX_EV_REFILL_SLICE_DFL: 1012
SCX_EV_BYPASS_DURATION: 9754621
SCX_EV_BYPASS_DISPATCH: 3
SCX_EV_BYPASS_ACTIVATE: 1
Five facts from this:
-
dsq_id = 0x8000000000000002 = SCX_DSQ_LOCAL (= SCX_DSQ_FLAG_BUILTIN | 2, include/linux/sched/ext.h:59). Task is associated with a per-cpu local DSQ. Combined with walking rq->scx.runnable_list from CPU 2's rq, it's CPU 2's local DSQ specifically.
-
ops_state/qseq = 0/0 (SCX_OPSS_NONE). Tasks dispatched via ops.enqueue transition through SCX_OPSS_QUEUEING (set at ext.c:1393). NONE with qseq=0 means ops.enqueue was never called for this task on this enqueue — i.e. the goto-local shortcut was taken, bypassing the scheduler.
-
slice = 20000000 = 20ms = SCX_SLICE_DFL (include/linux/sched/ext.h:30). The scheduler's tier_slice() would have returned 5ms for the wrapped tier and 2ms for janitor. The slice here was set by refill_task_slice_dfl on the enqueue: label at ext.c:1435, which only fires when goto-local / goto-global / goto-bypass was taken. Matches the bypass conclusion from (2).
-
SCX_EV_ENQ_SKIP_EXITING = 1009. This counter is only incremented in one place — the PF_EXITING goto-local shortcut at ext.c:1375. 1009 PF_EXITING tasks were routed through it in the scheduler's lifetime window before the eject.
-
Every stalled task's stack ends in exit_group. Confirms the category.
Analysis (enqueue-side)
With SCX_OPS_ENQ_EXITING unset (default), do_enqueue_task() at kernel/sched/ext.c:1372-1377:
/* see %SCX_OPS_ENQ_EXITING */
if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
unlikely(p->flags & PF_EXITING)) {
__scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
goto local;
}
Flow: skip ops.enqueue, goto local, fall through to the enqueue: label at ext.c:1428, which does:
enqueue:
touch_core_sched(rq, p);
refill_task_slice_dfl(sch, p);
dispatch_enqueue(sch, dsq, p, enq_flags);
enq_flags here is whatever the wake path passed in, typically ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK | ... — no SCX_ENQ_PREEMPT and no SCX_ENQ_HEAD.
dispatch_enqueue ends up calling local_dsq_post_enq() at ext.c:987:
static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p,
u64 enq_flags)
{
struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
bool preempt = false;
if (rq->scx.flags & SCX_RQ_IN_BALANCE)
return;
if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
rq->curr->sched_class == &ext_sched_class) {
rq->curr->scx.slice = 0;
preempt = true;
}
if (preempt || sched_class_above(&ext_sched_class, rq->curr->sched_class))
resched_curr(rq);
}
Case analysis on rq->curr at the moment of this call:
-
rq->curr is ext-class (common — under SCX ownership of the iso cpu, curr is usually a game thread). (enq_flags & SCX_ENQ_PREEMPT) == 0, so the PREEMPT branch doesn't fire (preempt stays false). sched_class_above(ext_sched_class, ext_sched_class) == false. Neither condition holds. resched_curr is not called. The exiting task lands in the local DSQ silently; no TIF_NEED_RESCHED, no IPI.
-
rq->curr is idle. sched_class_above(ext, idle) == true → resched_curr is called. Normal case; no stranding.
-
rq->curr is RT/DL (higher class). sched_class_above(ext, rt) == false. Neither condition holds. No resched. (Less relevant on SCX-owned cpus, but the same shape exists.)
So whenever an exiting task is ttwu'd (e.g., wakes from a mutex in exit_mmap / release_task / seccomp filter release, which happens during do_exit cleanup), and the target cpu's curr is ext-class at that exact instant, the exiting task is enqueued without any resched attempt. This is what I believe the enqueue-side trigger is.
Open questions
This is the part I want to flag honestly and where I'd like maintainer input:
The enqueue-side race as described explains how the task lands on a local DSQ without a resched. It does not fully explain why the task stays there for 30+ seconds. Candidate continuations:
-
Curr's slice eventually expires → schedule() → pick_task_scx → balance_one → first_local_task picks the stranded task. At a 5-20ms slice this should bound the dwell at ~20ms, not 36s. So either:
- Curr's slice doesn't decrement (nohz_full +
SCX_RQ_CAN_STOP_TICK?), or
balance_one takes the SCX_RQ_BAL_KEEP path at ext.c:2186-2189 (prev still has slice, keep running prev), and prev's slice is somehow indefinite, or
- Curr goes idle silently without calling
schedule() (not possible — idle transition goes through schedule()), which means idle is entered via schedule(), which should have picked the stranded task via first_local_task.
-
The enqueue happens while the curr is in the middle of its run, then curr does enter schedule() later, but pick_task_scx somehow skips the stranded task. I don't have a plausible mechanism for this.
-
There's a subsequent race: curr hits its slice, enters schedule(), balance_one() runs, sets SCX_RQ_IN_BALANCE, and the cpu is mid-pick when the exiting-task enqueue arrives on a different rq lock window. In that case local_dsq_post_enq returns early at ext.c:998-999 (IN_BALANCE short-circuit). The IN_BALANCE path is designed to let the ongoing pick_task_scx pick up the new arrival via first_local_task, but if the race window closes after the first_local_task call inside balance_one — the newly-arrived task would miss this pick. I'm not sure whether the code handles that case.
I don't have proof that any specific continuation explains the 36-second dwell. What I do know is (a) the enqueue-side race as described is real and counter-confirmed, and (b) the empirical fix below unambiguously stops the stalls, across ~60 minutes of the same workload that previously stalled within 90 seconds. I'd appreciate a maintainer confirming or correcting the dwell mechanism.
Workaround
.flags = ... | SCX_OPS_ENQ_EXITING,
void BPF_STRUCT_OPS(my_enqueue, struct task_struct *p, u64 enq_flags)
{
if (p->flags & PF_EXITING) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, default_slice,
enq_flags | SCX_ENQ_PREEMPT);
return;
}
...
}
SCX_OPS_ENQ_EXITING routes PF_EXITING tasks through ops.enqueue instead of the goto-local shortcut. SCX_ENQ_PREEMPT forces local_dsq_post_enq to take the first branch, which calls resched_curr regardless of curr's sched_class. Same destination as the kernel default (the task's current cpu's local DSQ — critical, because a PF_EXITING task's cpus_allowed is locked to its cgroup cpuset at fork time and must not be routed cross-domain), but the resched is guaranteed.
Validation:
- Before:
cargo build --release -j16 under iso.slice/promoted + a game running on iso.slice/wrapped → 4 ejects in 90 seconds (opt cgu, cc1, lto cgu, i386-linux-gnu-).
- After (same binary, only change is
SCX_OPS_ENQ_EXITING | SCX_ENQ_PREEMPT): same workload, 17.7s wallclock build, 110 cpu-seconds, zero ejects, SCX_EV_ENQ_SKIP_EXITING stays at 0 (kernel no longer takes the goto-local path for exiting tasks).
Anti-pattern that bit me during debug: I first tried routing PF_EXITING tasks to a separate "janitor" HK-only DSQ (CPUs 0, 8), thinking it would bypass nohz_full entirely. This made things worse because PF_EXITING tasks born under iso.slice have cpus_allowed = 0xfefe (iso mask). An HK-only DSQ drained only by HK cpus is unreachable for them — task_can_run_on_remote_rq() rejects the pick, and tasks pile up with no eligible consumer. 6+ stranded per iso cpu within 30 seconds. Don't reroute exiting tasks across cpu domains.
Mainline applicability
I verified the relevant functions are byte-identical on torvalds/linux master HEAD:
local_dsq_post_enq body — identical
- The
PF_EXITING goto-local shortcut (SCX_EV_ENQ_SKIP_EXITING / goto local) — identical
- The
enqueue: label that does refill_task_slice_dfl + dispatch_enqueue — identical
SCX_OPS_ENQ_EXITING kdoc comment in kernel/sched/ext_internal.h — same wording, same gap (mentions bpf_task_from_pid / RCU stalls as the motivation, does not mention nohz_full stranding)
So this is not a 6.19.x-only thing.
Relationship to SCX_ENQ_IMMED (for-7.1)
Tejun's SCX_ENQ_IMMED patch queued in the for-7.1 branch provides a general "never linger on local DSQ behind other tasks or on a cpu taken by a higher-priority class" primitive. From the public description:
"Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class."
This is a caller opt-in flag on scx_bpf_dsq_insert. From what I can see in public searches I couldn't confirm whether the kernel-internal goto-local PF_EXITING path in for-7.1 was also updated to pass SCX_ENQ_IMMED. If not, then:
- Schedulers on 7.1 that haven't been updated to use
SCX_ENQ_IMMED remain vulnerable to the same race.
- Schedulers on pre-7.1 kernels (anyone on current stable) have no way to avoid the goto-local path except
SCX_OPS_ENQ_EXITING.
Either way the kdoc change below is valuable independently of SCX_ENQ_IMMED. A maintainer who has the for-7.1 tree open can tell me whether the goto-local path itself now passes IMMED, which would resolve the default case for 7.1+.
Proposed resolutions
Option A (low risk): kdoc patch
/*
* An exiting task may schedule after PF_EXITING is set. In such cases,
* bpf_task_from_pid() may not be able to find the task and if the BPF
* scheduler depends on pid lookup for dispatching, the task will be
* lost leading to various issues including RCU grace period stalls.
*
* To mask this problem, by default, unhashed tasks are automatically
* dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't
* depend on pid lookups and wants to handle these tasks directly, the
* following flag can be used.
+ *
+ * Schedulers running on nohz_full cpus SHOULD set this flag. The default
+ * goto-local path passes the caller's enq_flags verbatim, which omits
+ * SCX_ENQ_PREEMPT. local_dsq_post_enq() only calls resched_curr() when
+ * PREEMPT is set or when the current sched_class is above ext_sched_class.
+ * Under SCX ownership of an iso cpu, rq->curr is typically ext-class, so
+ * the exiting task can be enqueued silently on an idle nohz_full cpu and
+ * strand on the local DSQ until the 30s runnable-task-stall watchdog
+ * ejects the scheduler. Setting this flag routes exiting tasks through
+ * ops.enqueue(), where the scheduler can insert with SCX_ENQ_PREEMPT
+ * (or, on 7.1+, SCX_ENQ_IMMED) to force resched.
*/
SCX_OPS_ENQ_EXITING = 1LLU << 2,
Narrow, acknowledges the existing escape hatch, low-risk. I'm happy to send this as a proper patch with Signed-off-by if the wording works.
Option B (broader fix): force PREEMPT in the goto-local path
- /* see %SCX_OPS_ENQ_EXITING */
- if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
- unlikely(p->flags & PF_EXITING)) {
- __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
- goto local;
- }
+ /* see %SCX_OPS_ENQ_EXITING */
+ if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
+ unlikely(p->flags & PF_EXITING)) {
+ __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
+ /* Force resched on the target cpu — the default
+ * local_dsq_post_enq path skips resched_curr for
+ * ext-vs-ext wakes without PREEMPT, which can strand
+ * exiting tasks on idle nohz_full cpus. */
+ enq_flags |= SCX_ENQ_PREEMPT;
+ goto local;
+ }
This makes behavior safe by default and obsoletes one reason for schedulers to set SCX_OPS_ENQ_EXITING. Risk: changes default preemption semantics for all exiting-task enqueues on all schedulers (preempts a running ext task mid-slice to run a dying task). I'd defer to maintainers on whether that's acceptable.
What I'm asking for
- Am I reading the enqueue-side race correctly? In particular, is there a reason the default goto-local path deliberately omits PREEMPT that I'm missing?
- Is the dwell-time gap (open question above) worth chasing? Or is "it eventually strands" sufficient once we've established the missed-resched?
- Is the kdoc patch (Option A) acceptable as framed? I'll send it as a formal patch if so.
- On for-7.1: does the goto-local path itself pass
SCX_ENQ_IMMED now? That would localize the problem to pre-7.1 stable kernels and tighten the doc patch scope.
Happy to provide more data, instrument further, or run experiments if anything is unclear.
Thanks.
Summary
First time issue submitter here.
I've been hitting repeated
runnable task stallwatchdog ejections on a custom sched_ext scheduler under a wine/cargo workload, and I believe I've traced the enqueue-side trigger to the defaultPF_EXITINGshortcut indo_enqueue_task()interacting badly with nohz_full iso cpus. SettingSCX_OPS_ENQ_EXITINGand handling the PF_EXITING case explicitly inops.enqueuewithSCX_ENQ_PREEMPTempirically closes the window. Filing this for confirmation from maintainers — the enqueue-side race is clear-cut, but I don't have a complete model of the full dwell mechanism and would appreciate a sanity check before framing a patch.The existing
SCX_OPS_ENQ_EXITINGkdoc describes a different failure mode (bpf_task_from_pid()lookup failure, RCU grace period stalls) and gives no indication that the default path is unsafe on nohz_full — that's the gap I'd like to close.Environment
torvalds/masterHEAD, see below)nohz_full=1-7,9-15 rcu_nocbs=1-7,9-15 isolcpus=nohz,domain,managed_irq,1-7,9-15scx_cages— strict-priority 4-tier DSQ scheduler. Iso cores (1-7, 9-15) run only game/"wrapped" and "promoted" tiers via custom DSQs; HK cores (0, 8) run janitor. Iso cores never run janitor, so there is no SCX slack work that would forceschedule()on an idle iso cpu.cargo build --release -j16underiso.slice/promoted. Reliably triggers within ~90s.Symptom
10+ ejections in a single session across
wineserver,cargo,rustc,cc1,ld.lld,opt cgu(rustc codegen units). The scheduler is auto-restarted by its service supervisor after each eject.Every stalled task is in the
do_exit → do_group_exit → __x64_sys_exit_groupfamily at eject timeEvidence from
scx_dump_stateI enabled the scx exit-time debug dump (set
exit_dump_lenon the struct_ops shadow before load — required,scx_dump_stateatkernel/sched/ext.c:4581is a no-op withlen=0). Sample from the wineserver[285798] eject:Five facts from this:
dsq_id = 0x8000000000000002=SCX_DSQ_LOCAL(=SCX_DSQ_FLAG_BUILTIN | 2,include/linux/sched/ext.h:59). Task is associated with a per-cpu local DSQ. Combined with walkingrq->scx.runnable_listfrom CPU 2's rq, it's CPU 2's local DSQ specifically.ops_state/qseq = 0/0(SCX_OPSS_NONE). Tasks dispatched viaops.enqueuetransition throughSCX_OPSS_QUEUEING(set atext.c:1393).NONEwithqseq=0meansops.enqueuewas never called for this task on this enqueue — i.e. the goto-local shortcut was taken, bypassing the scheduler.slice = 20000000= 20ms =SCX_SLICE_DFL(include/linux/sched/ext.h:30). The scheduler'stier_slice()would have returned 5ms for the wrapped tier and 2ms for janitor. The slice here was set byrefill_task_slice_dflon theenqueue:label atext.c:1435, which only fires when goto-local / goto-global / goto-bypass was taken. Matches the bypass conclusion from (2).SCX_EV_ENQ_SKIP_EXITING = 1009. This counter is only incremented in one place — the PF_EXITING goto-local shortcut atext.c:1375. 1009 PF_EXITING tasks were routed through it in the scheduler's lifetime window before the eject.Every stalled task's stack ends in
exit_group. Confirms the category.Analysis (enqueue-side)
With
SCX_OPS_ENQ_EXITINGunset (default),do_enqueue_task()atkernel/sched/ext.c:1372-1377:Flow: skip
ops.enqueue,goto local, fall through to theenqueue:label atext.c:1428, which does:enq_flagshere is whatever the wake path passed in, typicallyENQUEUE_WAKEUP | ENQUEUE_NOCLOCK | ...— noSCX_ENQ_PREEMPTand noSCX_ENQ_HEAD.dispatch_enqueueends up callinglocal_dsq_post_enq()atext.c:987:Case analysis on
rq->currat the moment of this call:rq->curris ext-class (common — under SCX ownership of the iso cpu, curr is usually a game thread).(enq_flags & SCX_ENQ_PREEMPT) == 0, so the PREEMPT branch doesn't fire (preemptstays false).sched_class_above(ext_sched_class, ext_sched_class) == false. Neither condition holds.resched_curris not called. The exiting task lands in the local DSQ silently; noTIF_NEED_RESCHED, no IPI.rq->curris idle.sched_class_above(ext, idle) == true→resched_curris called. Normal case; no stranding.rq->curris RT/DL (higher class).sched_class_above(ext, rt) == false. Neither condition holds. No resched. (Less relevant on SCX-owned cpus, but the same shape exists.)So whenever an exiting task is ttwu'd (e.g., wakes from a mutex in
exit_mmap/release_task/ seccomp filter release, which happens duringdo_exitcleanup), and the target cpu's curr is ext-class at that exact instant, the exiting task is enqueued without any resched attempt. This is what I believe the enqueue-side trigger is.Open questions
This is the part I want to flag honestly and where I'd like maintainer input:
The enqueue-side race as described explains how the task lands on a local DSQ without a resched. It does not fully explain why the task stays there for 30+ seconds. Candidate continuations:
Curr's slice eventually expires →
schedule()→pick_task_scx→balance_one→first_local_taskpicks the stranded task. At a 5-20ms slice this should bound the dwell at ~20ms, not 36s. So either:SCX_RQ_CAN_STOP_TICK?), orbalance_onetakes theSCX_RQ_BAL_KEEPpath atext.c:2186-2189(prev still has slice, keep running prev), and prev's slice is somehow indefinite, orschedule()(not possible — idle transition goes throughschedule()), which means idle is entered viaschedule(), which should have picked the stranded task viafirst_local_task.The enqueue happens while the curr is in the middle of its run, then curr does enter
schedule()later, butpick_task_scxsomehow skips the stranded task. I don't have a plausible mechanism for this.There's a subsequent race: curr hits its slice, enters
schedule(),balance_one()runs, setsSCX_RQ_IN_BALANCE, and the cpu is mid-pick when the exiting-task enqueue arrives on a different rq lock window. In that caselocal_dsq_post_enqreturns early atext.c:998-999(IN_BALANCE short-circuit). The IN_BALANCE path is designed to let the ongoingpick_task_scxpick up the new arrival viafirst_local_task, but if the race window closes after thefirst_local_taskcall insidebalance_one— the newly-arrived task would miss this pick. I'm not sure whether the code handles that case.I don't have proof that any specific continuation explains the 36-second dwell. What I do know is (a) the enqueue-side race as described is real and counter-confirmed, and (b) the empirical fix below unambiguously stops the stalls, across ~60 minutes of the same workload that previously stalled within 90 seconds. I'd appreciate a maintainer confirming or correcting the dwell mechanism.
Workaround
SCX_OPS_ENQ_EXITINGroutes PF_EXITING tasks throughops.enqueueinstead of the goto-local shortcut.SCX_ENQ_PREEMPTforceslocal_dsq_post_enqto take the first branch, which callsresched_currregardless of curr's sched_class. Same destination as the kernel default (the task's current cpu's local DSQ — critical, because a PF_EXITING task'scpus_allowedis locked to its cgroup cpuset at fork time and must not be routed cross-domain), but the resched is guaranteed.Validation:
cargo build --release -j16underiso.slice/promoted+ a game running oniso.slice/wrapped→ 4 ejects in 90 seconds (opt cgu, cc1, lto cgu, i386-linux-gnu-).SCX_OPS_ENQ_EXITING | SCX_ENQ_PREEMPT): same workload, 17.7s wallclock build, 110 cpu-seconds, zero ejects,SCX_EV_ENQ_SKIP_EXITINGstays at 0 (kernel no longer takes the goto-local path for exiting tasks).Anti-pattern that bit me during debug: I first tried routing PF_EXITING tasks to a separate "janitor" HK-only DSQ (CPUs 0, 8), thinking it would bypass nohz_full entirely. This made things worse because PF_EXITING tasks born under
iso.slicehavecpus_allowed = 0xfefe(iso mask). An HK-only DSQ drained only by HK cpus is unreachable for them —task_can_run_on_remote_rq()rejects the pick, and tasks pile up with no eligible consumer. 6+ stranded per iso cpu within 30 seconds. Don't reroute exiting tasks across cpu domains.Mainline applicability
I verified the relevant functions are byte-identical on
torvalds/linuxmaster HEAD:local_dsq_post_enqbody — identicalPF_EXITINGgoto-local shortcut (SCX_EV_ENQ_SKIP_EXITING/goto local) — identicalenqueue:label that doesrefill_task_slice_dfl+dispatch_enqueue— identicalSCX_OPS_ENQ_EXITINGkdoc comment inkernel/sched/ext_internal.h— same wording, same gap (mentionsbpf_task_from_pid/ RCU stalls as the motivation, does not mention nohz_full stranding)So this is not a 6.19.x-only thing.
Relationship to
SCX_ENQ_IMMED(for-7.1)Tejun's
SCX_ENQ_IMMEDpatch queued in thefor-7.1branch provides a general "never linger on local DSQ behind other tasks or on a cpu taken by a higher-priority class" primitive. From the public description:This is a caller opt-in flag on
scx_bpf_dsq_insert. From what I can see in public searches I couldn't confirm whether the kernel-internal goto-local PF_EXITING path infor-7.1was also updated to passSCX_ENQ_IMMED. If not, then:SCX_ENQ_IMMEDremain vulnerable to the same race.SCX_OPS_ENQ_EXITING.Either way the kdoc change below is valuable independently of
SCX_ENQ_IMMED. A maintainer who has thefor-7.1tree open can tell me whether the goto-local path itself now passesIMMED, which would resolve the default case for 7.1+.Proposed resolutions
Option A (low risk): kdoc patch
Narrow, acknowledges the existing escape hatch, low-risk. I'm happy to send this as a proper patch with
Signed-off-byif the wording works.Option B (broader fix): force PREEMPT in the goto-local path
This makes behavior safe by default and obsoletes one reason for schedulers to set
SCX_OPS_ENQ_EXITING. Risk: changes default preemption semantics for all exiting-task enqueues on all schedulers (preempts a running ext task mid-slice to run a dying task). I'd defer to maintainers on whether that's acceptable.What I'm asking for
SCX_ENQ_IMMEDnow? That would localize the problem to pre-7.1 stable kernels and tighten the doc patch scope.Happy to provide more data, instrument further, or run experiments if anything is unclear.
Thanks.