Skip to content

[Bug]: Deleting a NodeSet while jobs exist causes slurmctld SIGSEGV on reconfigure (job_state race with MinJobAge) #138

@pichiu

Description

@pichiu

Description

Deleting a NodeSet CRD while Slurm jobs still reference its partition/node causes slurmctld to crash with SIGSEGV on the next reconfigure, entering an infinite CrashLoopBackOff.

The crash loop is self-perpetuating because job_state persists on PVC. Every restart causes slurmctld to attempt state recovery from the same corrupted job_state, hit the same null pointer dereference, and crash again.

Root Cause

NodeSet deleted
  → Controller reconciler regenerates slurm.conf (partition/node removed)
    → ConfigMap updated → /etc/slurm hash changes
      → reconfigure.sh sidecar detects hash change
        → scontrol reconfigure
          → slurmctld forks child to reload config + recover job state
            → child reads job_state: job references partition no longer in config
              → NULL pointer dereference → SIGSEGV (child)
                → restart → reads same job_state → crash again ...

Note that this is not limited to running jobs. Slurm retains job records in job_state even after jobs have ended, for up to MinJobAge seconds (default 300s) for accounting sync and dependency resolution. Deleting the NodeSet while any job record still references its partition triggers the SIGSEGV.

Steps to Reproduce

  1. Install slurm-operator with a StateSaveLocation on a PVC
  2. Create a NodeSet (e.g., test-partition, replicas=1)
  3. Wait for the partition to appear in sinfo
  4. Submit a job: sbatch --partition=test-partition --wrap="sleep 300"
  5. Delete the NodeSet CRD directly (no scancel, no scale-down)
  6. Observe slurmctld crashing with SIGSEGV on the next reconfigure

Variant (MinJobAge race): Even if all jobs on the partition have already ended, deleting the NodeSet before MinJobAge expires causes the same crash.

Reproduction Log

Environment: slurm-operator v1.0.1, Slurm 25.11, Kubernetes v1.31.7

Timeline:

Time (UTC) Event
06:13:26 sbatch --partition=test-partition → job submitted (RUNNING)
06:13:29 kubectl delete nodeset test-partition
06:14:03 reconfigure.sh detects hash change → scontrol reconfigure
06:14:08 SIGSEGV — slurmctld child crashes
06:15:03 Container killed by Kubernetes (SIGTERM)
06:15:04 Restart #1 — on recovery reads stale job_state
06:15:08 error: Invalid partition (test-partition) for JobId=69
06:15~06:17 Crash loop: 6 total restarts
06:17:35 Recovery: NodeSet re-created with replicas: 0 → partition restored → slurmctld stabilizes

slurm-controller-0 supervisor log:

# Reconfigure triggered after NodeSet deletion
[2026-03-10 06:14:03+00:00] fakesystemd.sh: received PID=26947
2026-03-10 06:14:03,998 INFO reaped unknown pid 26804 (exit status 0)
2026-03-10 06:14:03,998 INFO reaped unknown pid 26838 (exit status 0)

# SIGSEGV — child process crashes during state recovery
2026-03-10 06:14:08,002 INFO reaped unknown pid 26947 (terminated by SIGSEGV (core dumped))
2026-03-10 06:14:08,002 INFO reaped unknown pid 26981 (exit status 0)

# Kubernetes kills the destabilized container
2026-03-10 06:15:03,060 WARN received SIGTERM indicating exit request
2026-03-10 06:15:04,062 WARN stopped: fakesystemd (terminated by SIGTERM)

slurmctld log on restart (reading stale job_state):

[2026-03-10T06:15:08] error: Invalid partition (test-partition) for JobId=69
[2026-03-10T06:15:08] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2026-03-10T06:15:08] Running as primary controller

Environment

  • slurm-operator: v1.0.1 (image: ghcr.io/slinkyproject/slurm-operator:1.0.1)
  • Slurm: 25.11 (image: ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04)
  • Kubernetes: v1.31.7
  • StateSaveLocation on persistent PVC
  • MinJobAge = 300 (default)

Expected Behavior

Deleting a NodeSet should not cause slurmctld to crash, regardless of whether jobs referencing its partition exist or have recently completed.

One possible approach would be to ensure that partition/node definitions are not removed from slurm.conf while job_state still holds records referencing them — for example, via a finalizer that defers deletion until stale job records have been purged.

Additional Context

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions