[Bug]: Deleting a NodeSet while jobs exist causes slurmctld SIGSEGV on reconfigure (job_state race with MinJobAge)

## Description

Deleting a NodeSet CRD while Slurm jobs still reference its partition/node causes `slurmctld` to crash with **SIGSEGV** on the next `reconfigure`, entering an infinite CrashLoopBackOff.

The crash loop is self-perpetuating because `job_state` persists on PVC. Every restart causes slurmctld to attempt state recovery from the same corrupted `job_state`, hit the same null pointer dereference, and crash again.

## Root Cause

```
NodeSet deleted
  → Controller reconciler regenerates slurm.conf (partition/node removed)
    → ConfigMap updated → /etc/slurm hash changes
      → reconfigure.sh sidecar detects hash change
        → scontrol reconfigure
          → slurmctld forks child to reload config + recover job state
            → child reads job_state: job references partition no longer in config
              → NULL pointer dereference → SIGSEGV (child)
                → restart → reads same job_state → crash again ...
```

Note that this is not limited to running jobs. Slurm retains job records in `job_state` even after jobs have ended, for up to `MinJobAge` seconds (default 300s) for accounting sync and dependency resolution. Deleting the NodeSet while any job record still references its partition triggers the SIGSEGV.

## Steps to Reproduce

1. Install slurm-operator with a `StateSaveLocation` on a PVC
2. Create a NodeSet (e.g., `test-partition`, `replicas=1`)
3. Wait for the partition to appear in `sinfo`
4. Submit a job: `sbatch --partition=test-partition --wrap="sleep 300"`
5. **Delete the NodeSet CRD directly** (no `scancel`, no scale-down)
6. Observe slurmctld crashing with SIGSEGV on the next reconfigure

**Variant (MinJobAge race):** Even if all jobs on the partition have already ended, deleting the NodeSet before `MinJobAge` expires causes the same crash.

## Reproduction Log

**Environment:** slurm-operator `v1.0.1`, Slurm `25.11`, Kubernetes `v1.31.7`

**Timeline:**

| Time (UTC) | Event |
| -- | -- |
| 06:13:26 | `sbatch --partition=test-partition` → job submitted (RUNNING) |
| 06:13:29 | `kubectl delete nodeset test-partition` |
| 06:14:03 | reconfigure.sh detects hash change → `scontrol reconfigure` |
| 06:14:08 | **SIGSEGV** — slurmctld child crashes |
| 06:15:03 | Container killed by Kubernetes (SIGTERM) |
| 06:15:04 | Restart #1 — on recovery reads stale `job_state` |
| 06:15:08 | `error: Invalid partition (test-partition) for JobId=69` |
| 06:15~06:17 | Crash loop: 6 total restarts |
| 06:17:35 | Recovery: NodeSet re-created with `replicas: 0` → partition restored → slurmctld stabilizes |

`slurm-controller-0` supervisor log:

```
# Reconfigure triggered after NodeSet deletion
[2026-03-10 06:14:03+00:00] fakesystemd.sh: received PID=26947
2026-03-10 06:14:03,998 INFO reaped unknown pid 26804 (exit status 0)
2026-03-10 06:14:03,998 INFO reaped unknown pid 26838 (exit status 0)

# SIGSEGV — child process crashes during state recovery
2026-03-10 06:14:08,002 INFO reaped unknown pid 26947 (terminated by SIGSEGV (core dumped))
2026-03-10 06:14:08,002 INFO reaped unknown pid 26981 (exit status 0)

# Kubernetes kills the destabilized container
2026-03-10 06:15:03,060 WARN received SIGTERM indicating exit request
2026-03-10 06:15:04,062 WARN stopped: fakesystemd (terminated by SIGTERM)
```

slurmctld log on restart (reading stale `job_state`):

```
[2026-03-10T06:15:08] error: Invalid partition (test-partition) for JobId=69
[2026-03-10T06:15:08] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2026-03-10T06:15:08] Running as primary controller
```

## Environment

* slurm-operator: `v1.0.1` (image: `ghcr.io/slinkyproject/slurm-operator:1.0.1`)
* Slurm: `25.11` (image: `ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04`)
* Kubernetes: `v1.31.7`
* `StateSaveLocation` on persistent PVC
* `MinJobAge = 300` (default)

## Expected Behavior

Deleting a NodeSet should not cause slurmctld to crash, regardless of whether jobs referencing its partition exist or have recently completed.

One possible approach would be to ensure that partition/node definitions are not removed from `slurm.conf` while `job_state` still holds records referencing them — for example, via a finalizer that defers deletion until stale job records have been purged.

## Additional Context

* The scale-in path (`processCondemned`) already has drain-before-delete logic. The NodeSet deletion path could potentially benefit from equivalent safety guarantees.
* This is distinct from PR #79 (fail-closed for node drain lookup) and PR #134 (drain design docs): neither addresses the `job_state` race on full NodeSet deletion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Deleting a NodeSet while jobs exist causes slurmctld SIGSEGV on reconfigure (job_state race with MinJobAge) #138

Description

Root Cause

Steps to Reproduce

Reproduction Log

Environment

Expected Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time (UTC)	Event
06:13:26	`sbatch --partition=test-partition` → job submitted (RUNNING)
06:13:29	`kubectl delete nodeset test-partition`
06:14:03	reconfigure.sh detects hash change → `scontrol reconfigure`
06:14:08	SIGSEGV — slurmctld child crashes
06:15:03	Container killed by Kubernetes (SIGTERM)
06:15:04	Restart #1 — on recovery reads stale `job_state`
06:15:08	`error: Invalid partition (test-partition) for JobId=69`
06:15~06:17	Crash loop: 6 total restarts
06:17:35	Recovery: NodeSet re-created with `replicas: 0` → partition restored → slurmctld stabilizes

[Bug]: Deleting a NodeSet while jobs exist causes slurmctld SIGSEGV on reconfigure (job_state race with MinJobAge) #138

Description

Description

Root Cause

Steps to Reproduce

Reproduction Log

Environment

Expected Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions