Skip to content

[Bug]: slurmrestd running as nobody prevents identity mapping (SSSD/sackd), causing jobs to be held #118

@cchangxi

Description

@cchangxi

Description

When using Slurm on Kubernetes (Slinky) v1.0.0 and submitting jobs via Slurm REST API (slurmrestd), jobs are accepted successfully (HTTP 200) but remain in PENDING / JobHeldAdmin state and never start execution.

The job is created with:

UserId=nobody

and Slurm reports:

SystemComment=slurm_cred_create failure, holding job.

In the same environment, submitting jobs via the login module (CLI: sbatch/srun) works correctly, and jobs execute as expected.

This suggests that the REST API submission path cannot correctly map user identity, causing jobs to fall back to nobody, which then fails during credential creation.


Steps to Reproduce

Environment

  • Slurm on Kubernetes (Slinky): v1.0.0

    • Helm chart slurm: 1.0.0
    • Helm chart slurm-operator: 1.0.0
  • Slurm version (all images): 25.11.1-ubuntu24.04

  • Login and compute modules are integrated with Active Directory (AD) via SSSD

  • Jobs submitted via CLI on login nodes work correctly


Reproduction Steps

  1. Generate a valid Slurm REST API token.
  2. Submit a batch job via slurmrestd using the OpenAPI endpoint.

Example submission flow:

#!/bin/bash

SLURM_TOKEN=xxxxxx
JOB_SCRIPT=$1

SCRIPT_JSON=$(jq -Rs . "$JOB_SCRIPT")

REQ_BODY=$(jq -n \
  --arg name "llama2_7b-FSDP" \
  --argjson script "$SCRIPT_JSON" \
  --arg cwd "/mnt/home/user1" \
  '{
    job: {
      name: $name,
      script: $script,
      current_working_directory: $cwd,
      environment: ["test"]
    }
  }'
)

# Submit request (curl omitted for brevity)
  1. REST API returns HTTP 200 with a valid job ID.

Example response (excerpt):

{
  "job_id": 44,
  "errors": [],
  "warnings": []
}
  1. Query job status via squeue or scontrol.

Observed Behavior

The job remains in a pending state and never starts:

$ squeue
JOBID PARTITION NAME        USER   ST  TIME NODES NODELIST(REASON)
44    all       llama2_7    nobody PD  0:00 1     (JobHeldAdmin)

Detailed job information:

$ scontrol show job 44
UserId=nobody(99) GroupId=nobody(99)
JobState=PENDING Reason=JobHeldAdmin
SystemComment=slurm_cred_create failure, holding job.

Expected Behavior

Jobs submitted via Slurm REST API should:

  • Be associated with a valid, mappable user identity (consistent with login/CLI submissions), or
  • Provide a supported mechanism for identity mapping (e.g., JWT user claims, SSSD integration, or equivalent)

so that Slurm can successfully create credentials and execute the job.

The REST API submission path should behave consistently with the login module in multi-user / AD-integrated environments.


Additional Context

Identity Mapping Concern

  • In the current Slinky design, the REST API Pod is forced to run as nobody.
  • As a result, jobs submitted through slurmrestd are created with UserId=nobody.
  • The nobody user cannot be credentialed by Slurm, leading to:
slurm_cred_create failure

slurm-operator Implementation Detail

In slurm-operator-1.0.0, the REST API Pod security context is hard-coded:

const (
    slurmrestdUser    = "nobody"
    slurmrestdUserUid = int64(65534)
    slurmrestdUserGid = slurmrestdUserUid
)

and enforced via:

SecurityContext: &corev1.PodSecurityContext{
    RunAsNonRoot: true,
    RunAsUser:    slurmrestdUserUid,
    RunAsGroup:   slurmrestdUserGid,
    FSGroup:      slurmrestdUserGid,
}

This design prevents slurmrestd from participating in the same identity mapping model used by the login module (e.g., via SSSD / AD).


Open Questions

  1. Is it an intentional design decision that slurmrestd cannot support user identity mapping?

  2. Is there a recommended or planned approach for:

    • Multi-user REST API submissions?
    • AD/SSSD-integrated environments?
  3. Are there plans to align the REST API module’s identity model with the login module (without SSH access, but with proper user semantics)?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions