Skip to content

Race Condition for vllm #133

@fyqqyf

Description

@fyqqyf
Image

Ask Gemini2.5 Pro:

好的,这是一个典型的在并行环境(例如使用 Ray)中运行多个 vLLM 或 torch.compile 工作进程(worker)时发生的错误。

核心问题是“竞态条件”(Race Condition):多个进程在同一时刻试图创建、写入和删除同一个共享缓存目录中的文件,导致了冲突。

错误分解
你的错误日志显示了一个清晰的连锁反应:

根本原因 (OSError 和 FileNotFoundError):

OSError: [Errno 16] Device or resource busy (设备或资源正忙): 这是最关键的一条信息。当 torch.compile 使用 Triton 后端时,它会在 ~/.cache/vllm/torch_compile_cache/... 路径下为编译好的 GPU 内核创建一个缓存。你的多个 ActorWorker 进程(如 actor_infer-3, actor_infer-7)在启动时,都尝试编译相同的模型。因此,它们会同时去写入和修改这一个共享的缓存目录。当一个进程锁定了某个文件或目录时,其他进程再去访问就会失败,并报出“资源正忙”的错误。

FileNotFoundError: [Errno 2] No such file or directory (文件或目录不存在): 这是同一种竞态条件导致的并发症状。可能一个进程正在尝试读取某个临时文件,而另一个进程已经完成了自己的编译/缓存步骤并将其删除了。

直接后果 (BackendCompilerFailed):

由于上述的 OSError,Triton 内核的编译失败了。

这个失败被 PyTorch 的 Dynamo 捕获,然后抛出一个 torch._dynamo.exc.BackendCompilerFailed 异常。这表示后端(在这里是 vllm)执行编译任务失败。

最终崩溃 (ActorDiedError):

这个未被处理的 BackendCompilerFailed 异常导致了 ActorWorker 进程的崩溃。

Ray 作为管理这些工作进程的分布式框架,检测到它的子进程(Actor)意外死亡,于是向主程序抛出 ray.exceptions.ActorDiedError,最终导致整个脚本停止。

Ray 日志中提到的 OOM killer (内存不足) 是一个通用性的猜测。虽然内存问题也可能导致进程死亡,但从你的完整错误日志来看,OSError: Device or resource busy 是更直接、更明确的根本原因。

My env:

8*H100

success to pip install -r requirements_torch260_vllm.txt

My config:

defaults:
  - ../config/envs@_here_
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

exp_name: "agentic_pipeline"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
render_save_dir: ./output/render
system_envs:
  USE_MODELSCOPE: '1'

#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_sokoban
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  log_dir: /data/home/zdhs0086/ROLL/log/qwen2.5-0.5B-Instruct/agentic_frozen_lake


checkpoint_config:
  type: file_system
  output_dir: /data/home/zdhs0086/ROLL/rl_examples/models/${exp_name}

num_gpus_per_node: 8

max_steps: 1024
save_steps: 10000
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false

rollout_batch_size: 1024
val_batch_size: 1024
sequence_length: 8192

advantage_clip: 0.2
ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0

pretrain: /data/home/zdhs0086/zdhs0086_src_data/Qwen2.5-0.5B-Instruct
reward_pretrain: /data/home/zdhs0086/zdhs0086_src_data/Qwen2.5-0.5B-Instruct

actor_train:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 2
    gradient_accumulation_steps: 64
    warmup_steps: 10
    lr_scheduler_type: cosine
  data_args:
    template: qwen2_5
  strategy_args:
#    strategy_name: deepspeed_train
#    strategy_config: ${deepspeed_zero3}
    strategy_name: megatron_train
    strategy_config:
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      expert_model_parallel_size: 1
      use_distributed_optimizer: true
      recompute_granularity: full
  device_mapping: list(range(0,8))
  infer_batch_size: 2

actor_infer:
  model_args:
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: 128 # single-turn response length
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: 1
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.8
      block_size: 16
      load_format: auto
  device_mapping: list(range(0,8))

reference:
  model_args:
    attn_implementation: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(0,8))
  infer_batch_size: 2


action_pattern: <answer>(.*?)</answer>
think_action_pattern: <think>(.*?)</think>\s*<answer>(.*?)</answer>
user_prompt_no_think_format: <answer> [your answer] </answer>
user_prompt_think_format: <think> [Your thoughts] </think> <answer> [your answer] </answer>
added_text_no_think: <answer>
added_text_think: <think>

max_tokens_per_step: 128
max_actions_per_traj: 10

reward_normalization:
  grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
  method: mean_std # asym_clip / identity / mean_std

train_env_manager:
  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
  max_env_num_per_worker: 16
  num_env_groups: 128
  # under the same group, the env config and env seed are ensured to be equal
  group_size: 8
  tags: [FrozenLake]
  num_groups_partition: [128] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation

val_env_manager:
  max_env_num_per_worker: 32
  num_env_groups: 1024
  group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
  tags: [SimpleSokoban, LargerSokoban, SokobanDifferentGridVocab, FrozenLake]
  num_groups_partition: [256, 256, 256, 256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation


custom_envs:
  SimpleSokoban:
    env_type: sokoban
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_no_think_format}
    added_text: ${added_text_no_think}
    env_manager_cls: roll.pipeline.agentic.env_manager.traj_env_manager.TrajEnvManager
    use_thread_lock: true
    env_config: # keys should be a subset of SokobanConfig
      env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${action_pattern}
      max_steps: ${max_actions_per_traj}
      dim_x: 6
      dim_y: 6
      num_boxes: 1
  LargerSokoban:
    env_type: sokoban
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_no_think_format}
    added_text: ${added_text_no_think}
    env_manager_cls: roll.pipeline.agentic.env_manager.traj_env_manager.TrajEnvManager
    use_thread_lock: true
    env_config:
      env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${action_pattern}
      max_steps: ${max_actions_per_traj}
      dim_x: 8
      dim_y: 8
      num_boxes: 2
      search_depth: 10
  SokobanDifferentGridVocab:
    env_type: sokoban
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_no_think_format}
    added_text: ${added_text_no_think}
    env_manager_cls: roll.pipeline.agentic.env_manager.traj_env_manager.TrajEnvManager
    use_thread_lock: true
    env_config: # keys should be a subset of SokobanConfig
      env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${action_pattern}
      max_steps: ${max_actions_per_traj}
      search_depth: 30
      dim_x: 6
      dim_y: 6
      num_boxes: 1
      grid_lookup: { 0: "W", 1: ".", 2: "G", 3: "C", 4: "B", 5: "A", 6: "@" }
      grid_vocab: { "W": "wall", ".": "empty", "G": "target", "C": "box on target", "B": "box", "A": "player", "@": "player on target" }
  FrozenLake:
    env_type: frozen_lake
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_no_think_format}
    added_text: ${added_text_no_think}
    env_manager_cls: roll.pipeline.agentic.env_manager.traj_env_manager.TrajEnvManager
    use_thread_lock: true
    env_config:
      env_instruction: "You are solving the FrozenLake puzzle. Forbid the whole and go to the target. You may move to the unintended direction due to the slippery ice. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${action_pattern}
      max_steps: ${max_actions_per_traj}
      is_slippery: false
  FrozenLakeThink:
    env_type: frozen_lake
    max_tokens_per_step: ${max_tokens_per_step}
    user_prompt_format: ${user_prompt_think_format}
    added_text: ${added_text_think}
    env_manager_cls: roll.pipeline.agentic.env_manager.traj_env_manager.TrajEnvManager
    use_thread_lock: true
    env_config:
      env_instruction: "You are solving the FrozenLake puzzle. Forbid the whole and go to the target. You may move to the unintended direction due to the slippery ice. The answer must be one of action in a turn, format is <answer>Right</answer>"
      action_pattern: ${think_action_pattern}
      max_steps: ${max_actions_per_traj}
      is_slippery: false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions