-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Description
Deleting a NodeSet CRD while Slurm jobs still reference its partition/node causes slurmctld to crash with SIGSEGV on the next reconfigure, entering an infinite CrashLoopBackOff.
The crash loop is self-perpetuating because job_state persists on PVC. Every restart causes slurmctld to attempt state recovery from the same corrupted job_state, hit the same null pointer dereference, and crash again.
Root Cause
NodeSet deleted
→ Controller reconciler regenerates slurm.conf (partition/node removed)
→ ConfigMap updated → /etc/slurm hash changes
→ reconfigure.sh sidecar detects hash change
→ scontrol reconfigure
→ slurmctld forks child to reload config + recover job state
→ child reads job_state: job references partition no longer in config
→ NULL pointer dereference → SIGSEGV (child)
→ restart → reads same job_state → crash again ...
Note that this is not limited to running jobs. Slurm retains job records in job_state even after jobs have ended, for up to MinJobAge seconds (default 300s) for accounting sync and dependency resolution. Deleting the NodeSet while any job record still references its partition triggers the SIGSEGV.
Steps to Reproduce
- Install slurm-operator with a
StateSaveLocationon a PVC - Create a NodeSet (e.g.,
test-partition,replicas=1) - Wait for the partition to appear in
sinfo - Submit a job:
sbatch --partition=test-partition --wrap="sleep 300" - Delete the NodeSet CRD directly (no
scancel, no scale-down) - Observe slurmctld crashing with SIGSEGV on the next reconfigure
Variant (MinJobAge race): Even if all jobs on the partition have already ended, deleting the NodeSet before MinJobAge expires causes the same crash.
Reproduction Log
Environment: slurm-operator v1.0.1, Slurm 25.11, Kubernetes v1.31.7
Timeline:
| Time (UTC) | Event |
|---|---|
| 06:13:26 | sbatch --partition=test-partition → job submitted (RUNNING) |
| 06:13:29 | kubectl delete nodeset test-partition |
| 06:14:03 | reconfigure.sh detects hash change → scontrol reconfigure |
| 06:14:08 | SIGSEGV — slurmctld child crashes |
| 06:15:03 | Container killed by Kubernetes (SIGTERM) |
| 06:15:04 | Restart #1 — on recovery reads stale job_state |
| 06:15:08 | error: Invalid partition (test-partition) for JobId=69 |
| 06:15~06:17 | Crash loop: 6 total restarts |
| 06:17:35 | Recovery: NodeSet re-created with replicas: 0 → partition restored → slurmctld stabilizes |
slurm-controller-0 supervisor log:
# Reconfigure triggered after NodeSet deletion
[2026-03-10 06:14:03+00:00] fakesystemd.sh: received PID=26947
2026-03-10 06:14:03,998 INFO reaped unknown pid 26804 (exit status 0)
2026-03-10 06:14:03,998 INFO reaped unknown pid 26838 (exit status 0)
# SIGSEGV — child process crashes during state recovery
2026-03-10 06:14:08,002 INFO reaped unknown pid 26947 (terminated by SIGSEGV (core dumped))
2026-03-10 06:14:08,002 INFO reaped unknown pid 26981 (exit status 0)
# Kubernetes kills the destabilized container
2026-03-10 06:15:03,060 WARN received SIGTERM indicating exit request
2026-03-10 06:15:04,062 WARN stopped: fakesystemd (terminated by SIGTERM)
slurmctld log on restart (reading stale job_state):
[2026-03-10T06:15:08] error: Invalid partition (test-partition) for JobId=69
[2026-03-10T06:15:08] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2026-03-10T06:15:08] Running as primary controller
Environment
- slurm-operator:
v1.0.1(image:ghcr.io/slinkyproject/slurm-operator:1.0.1) - Slurm:
25.11(image:ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04) - Kubernetes:
v1.31.7 StateSaveLocationon persistent PVCMinJobAge = 300(default)
Expected Behavior
Deleting a NodeSet should not cause slurmctld to crash, regardless of whether jobs referencing its partition exist or have recently completed.
One possible approach would be to ensure that partition/node definitions are not removed from slurm.conf while job_state still holds records referencing them — for example, via a finalizer that defers deletion until stale job records have been purged.
Additional Context
- The scale-in path (
processCondemned) already has drain-before-delete logic. The NodeSet deletion path could potentially benefit from equivalent safety guarantees. - This is distinct from PR Prevent job termination on Slurm node lookup failures #79 (fail-closed for node drain lookup) and PR Document NodeSet drain design and ops guide #134 (drain design docs): neither addresses the
job_staterace on full NodeSet deletion.