feat: implement node locking for NodeSet worker pods by giuliocalzo · Pull Request #130 · SlinkyProject/slurm-operator

giuliocalzo · 2026-02-24T13:35:37Z

Summary

Add lockNodes and lockNodeLifetime fields to NodeSetSpec to pin worker pods to their assigned Kubernetes nodes. When enabled, the controller records each pod-to-node mapping in NodeSetStatus and injects a requiredDuringSchedulingIgnoredDuringExecution NodeAffinity on pod recreation so each worker always returns to the same physical node.

The lockNodeLifetime field controls how long the lock persists: 0 means permanent, and a positive value (in seconds) causes the lock to expire after the pod stops running, allowing it to reschedule freely. Running pods continuously refresh their assignment timestamp so the countdown only begins once the pod is no longer active on the node.

Breaking Changes

none

Testing Notes

local testing with kind

status:
  nodeAssignments:
    "0":
      node: node-gpu-1
      at: 1740384000
    "1":
      node: node-gpu-2
      at: 1740384000

Additional Context

Add `lockNodes` and `lockNodeLifetime` fields to NodeSetSpec to pin worker pods to their assigned Kubernetes nodes. When enabled, the controller records each pod-to-node mapping in NodeSetStatus and injects a requiredDuringSchedulingIgnoredDuringExecution NodeAffinity on pod recreation so each worker always returns to the same physical node. The lockNodeLifetime field controls how long the lock persists: 0 means permanent, and a positive value (in seconds) causes the lock to expire after the pod stops running, allowing it to reschedule freely. Running pods continuously refresh their assignment timestamp so the countdown only begins once the pod is no longer active on the node.

Document the lockNodes and lockNodeLifetime features in the workload isolation guide, nodeset controller concept page, and Helm chart README.

Use ordinal index as map key instead of full pod name, Unix epoch int64 instead of RFC 3339 timestamp, and shorter JSON field names (node, at) to reduce per-entry size from ~90 bytes to ~42 bytes (~53% reduction).

giuliocalzo · 2026-03-03T08:26:13Z

good morning @vivian-hafener I rebase and adjust based on the last pre-commit checks, feel free to review it

giuliocalzo added 4 commits February 24, 2026 09:29

docs: add node locking documentation

b535fff

Document the lockNodes and lockNodeLifetime features in the workload isolation guide, nodeset controller concept page, and Helm chart README.

perf: optimize nodeAssignments status object size

bc24029

Use ordinal index as map key instead of full pod name, Unix epoch int64 instead of RFC 3339 timestamp, and shorter JSON field names (node, at) to reduce per-entry size from ~90 bytes to ~42 bytes (~53% reduction).

docs: fix stale field names in node locking documentation

c424fe2

vivian-hafener self-assigned this Mar 2, 2026

vivian-hafener self-requested a review March 2, 2026 22:44

giuliocalzo and others added 3 commits March 3, 2026 09:15

Merge branch 'main' into sync

bb556e1

doc: adjust docs

3d3afab

fix: restore helm unittest

9f1c4e3

fixx deepcopy

09c79ef

vivian-hafener removed their request for review March 3, 2026 15:13

vivian-hafener removed their assignment Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement node locking for NodeSet worker pods#130

feat: implement node locking for NodeSet worker pods#130
giuliocalzo wants to merge 8 commits intoSlinkyProject:mainfrom
giuliocalzo:sync

giuliocalzo commented Feb 24, 2026 •

edited

Loading

Uh oh!

giuliocalzo commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

giuliocalzo commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking Changes

Testing Notes

Additional Context

Uh oh!

giuliocalzo commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giuliocalzo commented Feb 24, 2026 •

edited

Loading