Skip to content

Single-node RLVR run hangs waiting for placement group / worker crash on Slurm #1431

@LalchandPandia

Description

@LalchandPandia

Hi,

I am trying to run the RLVR code on a single node using the provided grpo_fast_8b_single_node.sh script, but the job either hangs indefinitely or crashes with a Ray worker error when launched under Slurm.

Below is my Slurm-side Ray setup (single node, 8 GPUs):
nodes_array=($nodes)
head_node=${nodes_array[0]}
NUM_GPUS=8

unset RAY_ADDRESS RAY_HEAD_NODE_ADDRESS RAY_NAMESPACE
export RAY_TEMP_DIR=/net/${SLURM_JOB_ID}
export MASTER_PORT=$(shuf -i 29500-29999 -n 1)
export RAY_NODE_PORT=$MASTER_PORT
export ip_head=$head_node:$MASTER_PORT
export RAY_ADDRESS=$head_node:$RAY_NODE_PORT

echo "IP Head: $ip_head"

ray stop --force

ray start --head
--node-ip-address=$head_node
--port=$RAY_NODE_PORT
--dashboard-host=0.0.0.0
--num-gpus=$NUM_GPUS
--temp-dir=$RAY_TEMP_DIR

sleep 5
ray status

Issue 1: Job hangs waiting for placement group
When launching grpo_fast.py with the following configuration:
--num_learners_per_node 6
--vllm_num_engines 2
--vllm_tensor_parallel_size 1
The logs report a placement bundle of:
[{'GPU': 6, 'CPU': 60}]
and then hang indefinitely with:
Waiting for placement group to be scheduled
This happens despite the node having 8 GPUs available.
Issue 2: Worker crash with SYSTEM_ERROR

With a smaller configuration (e.g., num_learners_per_node=2, vllm_num_engines=1), the job progresses further but eventually crashes with:
A worker died or was killed while executing a task by an unexpected system error.
Worker exit type: SYSTEM_ERROR
Worker exit detail: Worker unexpectedly exits with a connection error code 2 (EOF).
Question:

a)Is there a recommended configuration for single-node RLVR runs (especially for num_learners_per_node, vllm_num_engines, and placement group sizing)?

b) Is the grpo_fast_8b_single_node.sh script expected to work as-is on Slurm, or are additional Ray/Slurm-specific adjustments required?
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions