-
Notifications
You must be signed in to change notification settings - Fork 497
Description
Hi,
I am trying to run the RLVR code on a single node using the provided grpo_fast_8b_single_node.sh script, but the job either hangs indefinitely or crashes with a Ray worker error when launched under Slurm.
Below is my Slurm-side Ray setup (single node, 8 GPUs):
nodes_array=($nodes)
head_node=${nodes_array[0]}
NUM_GPUS=8
unset RAY_ADDRESS RAY_HEAD_NODE_ADDRESS RAY_NAMESPACE
export RAY_TEMP_DIR=/net/${SLURM_JOB_ID}
export MASTER_PORT=$(shuf -i 29500-29999 -n 1)
export RAY_NODE_PORT=$MASTER_PORT
export ip_head=$head_node:$MASTER_PORT
export RAY_ADDRESS=$head_node:$RAY_NODE_PORT
echo "IP Head: $ip_head"
ray stop --force
ray start --head
--node-ip-address=$head_node
--port=$RAY_NODE_PORT
--dashboard-host=0.0.0.0
--num-gpus=$NUM_GPUS
--temp-dir=$RAY_TEMP_DIR
sleep 5
ray status
Issue 1: Job hangs waiting for placement group
When launching grpo_fast.py with the following configuration:
--num_learners_per_node 6
--vllm_num_engines 2
--vllm_tensor_parallel_size 1
The logs report a placement bundle of:
[{'GPU': 6, 'CPU': 60}]
and then hang indefinitely with:
Waiting for placement group to be scheduled
This happens despite the node having 8 GPUs available.
Issue 2: Worker crash with SYSTEM_ERROR
With a smaller configuration (e.g., num_learners_per_node=2, vllm_num_engines=1), the job progresses further but eventually crashes with:
A worker died or was killed while executing a task by an unexpected system error.
Worker exit type: SYSTEM_ERROR
Worker exit detail: Worker unexpectedly exits with a connection error code 2 (EOF).
Question:
a)Is there a recommended configuration for single-node RLVR runs (especially for num_learners_per_node, vllm_num_engines, and placement group sizing)?
b) Is the grpo_fast_8b_single_node.sh script expected to work as-is on Slurm, or are additional Ray/Slurm-specific adjustments required?
Thanks!