-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Description
When using Slurm on Kubernetes (Slinky) v1.0.0 and submitting jobs via Slurm REST API (slurmrestd), jobs are accepted successfully (HTTP 200) but remain in PENDING / JobHeldAdmin state and never start execution.
The job is created with:
UserId=nobody
and Slurm reports:
SystemComment=slurm_cred_create failure, holding job.
In the same environment, submitting jobs via the login module (CLI: sbatch/srun) works correctly, and jobs execute as expected.
This suggests that the REST API submission path cannot correctly map user identity, causing jobs to fall back to nobody, which then fails during credential creation.
Steps to Reproduce
Environment
-
Slurm on Kubernetes (Slinky): v1.0.0
- Helm chart
slurm: 1.0.0 - Helm chart
slurm-operator: 1.0.0
- Helm chart
-
Slurm version (all images):
25.11.1-ubuntu24.04 -
Login and compute modules are integrated with Active Directory (AD) via SSSD
-
Jobs submitted via CLI on login nodes work correctly
Reproduction Steps
- Generate a valid Slurm REST API token.
- Submit a batch job via
slurmrestdusing the OpenAPI endpoint.
Example submission flow:
#!/bin/bash
SLURM_TOKEN=xxxxxx
JOB_SCRIPT=$1
SCRIPT_JSON=$(jq -Rs . "$JOB_SCRIPT")
REQ_BODY=$(jq -n \
--arg name "llama2_7b-FSDP" \
--argjson script "$SCRIPT_JSON" \
--arg cwd "/mnt/home/user1" \
'{
job: {
name: $name,
script: $script,
current_working_directory: $cwd,
environment: ["test"]
}
}'
)
# Submit request (curl omitted for brevity)- REST API returns HTTP 200 with a valid job ID.
Example response (excerpt):
{
"job_id": 44,
"errors": [],
"warnings": []
}- Query job status via
squeueorscontrol.
Observed Behavior
The job remains in a pending state and never starts:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
44 all llama2_7 nobody PD 0:00 1 (JobHeldAdmin)
Detailed job information:
$ scontrol show job 44
UserId=nobody(99) GroupId=nobody(99)
JobState=PENDING Reason=JobHeldAdmin
SystemComment=slurm_cred_create failure, holding job.
Expected Behavior
Jobs submitted via Slurm REST API should:
- Be associated with a valid, mappable user identity (consistent with login/CLI submissions), or
- Provide a supported mechanism for identity mapping (e.g., JWT user claims, SSSD integration, or equivalent)
so that Slurm can successfully create credentials and execute the job.
The REST API submission path should behave consistently with the login module in multi-user / AD-integrated environments.
Additional Context
Identity Mapping Concern
- In the current Slinky design, the REST API Pod is forced to run as
nobody. - As a result, jobs submitted through
slurmrestdare created withUserId=nobody. - The
nobodyuser cannot be credentialed by Slurm, leading to:
slurm_cred_create failure
slurm-operator Implementation Detail
In slurm-operator-1.0.0, the REST API Pod security context is hard-coded:
const (
slurmrestdUser = "nobody"
slurmrestdUserUid = int64(65534)
slurmrestdUserGid = slurmrestdUserUid
)and enforced via:
SecurityContext: &corev1.PodSecurityContext{
RunAsNonRoot: true,
RunAsUser: slurmrestdUserUid,
RunAsGroup: slurmrestdUserGid,
FSGroup: slurmrestdUserGid,
}This design prevents slurmrestd from participating in the same identity mapping model used by the login module (e.g., via SSSD / AD).
Open Questions
-
Is it an intentional design decision that
slurmrestdcannot support user identity mapping? -
Is there a recommended or planned approach for:
- Multi-user REST API submissions?
- AD/SSSD-integrated environments?
-
Are there plans to align the REST API module’s identity model with the login module (without SSH access, but with proper user semantics)?