Single-node RLVR run hangs waiting for placement group / worker crash on Slurm

Hi,

I am trying to run the RLVR code on a single node using the provided grpo_fast_8b_single_node.sh script, but the job either hangs indefinitely or crashes with a Ray worker error when launched under Slurm.

Below is my Slurm-side Ray setup (single node, 8 GPUs):
nodes_array=($nodes)
head_node=${nodes_array[0]}
NUM_GPUS=8

unset RAY_ADDRESS RAY_HEAD_NODE_ADDRESS RAY_NAMESPACE
export RAY_TEMP_DIR=/net/${SLURM_JOB_ID}
export MASTER_PORT=$(shuf -i 29500-29999 -n 1)
export RAY_NODE_PORT=$MASTER_PORT
export ip_head=$head_node:$MASTER_PORT
export RAY_ADDRESS=$head_node:$RAY_NODE_PORT

echo "IP Head: $ip_head"

ray stop --force

ray start --head \
  --node-ip-address=$head_node \
  --port=$RAY_NODE_PORT \
  --dashboard-host=0.0.0.0 \
  --num-gpus=$NUM_GPUS \
  --temp-dir=$RAY_TEMP_DIR

sleep 5
ray status

Issue 1: Job hangs waiting for placement group
When launching grpo_fast.py with the following configuration: 
--num_learners_per_node 6
--vllm_num_engines 2
--vllm_tensor_parallel_size 1
The logs report a placement bundle of:
[{'GPU': 6, 'CPU': 60}]
and then hang indefinitely with:
Waiting for placement group to be scheduled
This happens despite the node having 8 GPUs available.
Issue 2:  Worker crash with SYSTEM_ERROR

With a smaller configuration (e.g., num_learners_per_node=2, vllm_num_engines=1), the job progresses further but eventually crashes with:
A worker died or was killed while executing a task by an unexpected system error.
Worker exit type: SYSTEM_ERROR
Worker exit detail: Worker unexpectedly exits with a connection error code 2 (EOF).
Question:

a)Is there a recommended configuration for single-node RLVR runs (especially for num_learners_per_node, vllm_num_engines, and placement group sizing)?

b) Is the grpo_fast_8b_single_node.sh script expected to work as-is on Slurm, or are additional Ray/Slurm-specific adjustments required?
 Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-node RLVR run hangs waiting for placement group / worker crash on Slurm #1431

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single-node RLVR run hangs waiting for placement group / worker crash on Slurm #1431

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions