Skip to content

定制多模态数据集在训练开始出现oom问题 #125

@zpc2090

Description

@zpc2090

过去的代码在多模态数据集上会运行一些step之后出现OOM,更新代码后程序无法运行就会出现OOM的错误,我使用4*40g训练Qwen2.5_3b_vl_Instruct,并且关闭了测试直接进行训练,vllm版本也按照最新代码更新到0.7.4.dev65+g22757848,报错信息如下:

Traceback (most recent call last):
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/zpc/EasyR1/verl/trainer/main.py", line 106, in <module>
    main()
  File "/nfs/zpc/EasyR1/verl/trainer/main.py", line 102, in main
    ray.get(main_task.remote(ppo_config))
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/worker.py", line 908, in get_objects
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.129.166.50, ID: ccaef5975b56b1fafa55edbec0dec438af8cb81e9bc3fad0c80ec173) where the task (task ID: 6c0ba3bb901c4e5a21972ebd9f34749c86c13d3301000000, name=main_task, pid=782941, memory used=0.59GB) was running was 242.92GB / 251.53GB (0.965788), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a97eb820c2c75e5623bfd06e71c3466b536930c3e3940de9b003c0f6) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.129.166.50`. To see the logs of the worker, use `ray logs worker-a97eb820c2c75e5623bfd06e71c3466b536930c3e3940de9b003c0f6*out -ip 10.129.166.50. Top 10 memory users:
PID     MEM(GB) COMMAND
785330  23.87   ray::WorkerDict.actor_rollout_init_model
784337  23.61   ray::WorkerDict.actor_rollout_init_model
785331  20.13   ray::WorkerDict.actor_rollout_init_model
785329  19.94   ray::WorkerDict.actor_rollout_init_model
3162739 12.94   ray::main_task
3162421 12.93   ray::main_task
3162547 12.93   ray::main_task
3162806 12.93   ray::main_task
3162673 12.93   ray::main_task
3162610 10.41   ray::main_task

这是我的配置文件

data:
  train_files: hiyouga/math12k@train
  val_files: hiyouga/math12k@test
  prompt_key: problem
  answer_key: answer
  image_key: images
  max_prompt_length: 6000 # 原来是2048
  max_response_length: 2048
  rollout_batch_size: 8  # 原来是512
  shuffle: true
  seed: 1
  max_pixels: 4194304
  min_pixels: 262144

algorithm:
  adv_estimator: grpo
  kl_coef: 0.0

worker:
  actor:
    global_batch_size: 8 # 原来是128
    micro_batch_size_per_device_for_update: 1 # 原来是4
    micro_batch_size_per_device_for_experience: 1 # 原来是16
    max_grad_norm: 1.0
    entropy_coeff: 1.0e-3
    use_kl_loss: true
    kl_loss_coef: 1.0e-2
    kl_loss_type: low_var_kl
    padding_free: true
    ulysses_sequence_parallel_size: 1
    model:
      model_path: Qwen/Qwen2.5-7B-Instruct
      enable_gradient_checkpointing: true
      trust_remote_code: false
      freeze_vision_tower: false
    optim:
      lr: 1.0e-6
      weight_decay: 1.0e-2
      lr_warmup_ratio: 0.0
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: false 
      enable_rank0_init: true
    offload:
      offload_params: true
      offload_optimizer: true

  rollout:
    temperature: 1.0
    n: 5
    gpu_memory_utilization: 0.5
    enforce_eager: false
    enable_chunked_prefill: false
    tensor_parallel_size: 2
    limit_images: 8 # 输入图像的数量

  ref:
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: true
      enable_rank0_init: true
    offload:
      offload_params: false

  reward:
    reward_type: function
    compute_score: math

trainer:
  total_episodes: 1  # 原来是15
  logger: ["console", "wandb"]
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math
  n_gpus_per_node: 8
  nnodes: 1
  val_freq: 5
  val_before_train: true
  val_only: false
  val_generations_to_log: 1
  save_freq: 5
  remove_previous_ckpt: false
  remove_ckpt_after_load: false
  save_checkpoint_path: ./exp_result
  load_checkpoint_path: null

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions