过去的代码在多模态数据集上会运行一些step之后出现OOM,更新代码后程序无法运行就会出现OOM的错误,我使用4*40g训练Qwen2.5_3b_vl_Instruct,并且关闭了测试直接进行训练,vllm版本也按照最新代码更新到0.7.4.dev65+g22757848,报错信息如下:
Traceback (most recent call last):
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/nfs/zpc/EasyR1/verl/trainer/main.py", line 106, in <module>
main()
File "/nfs/zpc/EasyR1/verl/trainer/main.py", line 102, in main
ray.get(main_task.remote(ppo_config))
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/nfs/zpc/envs/easyr1_new/lib/python3.10/site-packages/ray/_private/worker.py", line 908, in get_objects
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.129.166.50, ID: ccaef5975b56b1fafa55edbec0dec438af8cb81e9bc3fad0c80ec173) where the task (task ID: 6c0ba3bb901c4e5a21972ebd9f34749c86c13d3301000000, name=main_task, pid=782941, memory used=0.59GB) was running was 242.92GB / 251.53GB (0.965788), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a97eb820c2c75e5623bfd06e71c3466b536930c3e3940de9b003c0f6) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.129.166.50`. To see the logs of the worker, use `ray logs worker-a97eb820c2c75e5623bfd06e71c3466b536930c3e3940de9b003c0f6*out -ip 10.129.166.50. Top 10 memory users:
PID MEM(GB) COMMAND
785330 23.87 ray::WorkerDict.actor_rollout_init_model
784337 23.61 ray::WorkerDict.actor_rollout_init_model
785331 20.13 ray::WorkerDict.actor_rollout_init_model
785329 19.94 ray::WorkerDict.actor_rollout_init_model
3162739 12.94 ray::main_task
3162421 12.93 ray::main_task
3162547 12.93 ray::main_task
3162806 12.93 ray::main_task
3162673 12.93 ray::main_task
3162610 10.41 ray::main_task
data:
train_files: hiyouga/math12k@train
val_files: hiyouga/math12k@test
prompt_key: problem
answer_key: answer
image_key: images
max_prompt_length: 6000 # 原来是2048
max_response_length: 2048
rollout_batch_size: 8 # 原来是512
shuffle: true
seed: 1
max_pixels: 4194304
min_pixels: 262144
algorithm:
adv_estimator: grpo
kl_coef: 0.0
worker:
actor:
global_batch_size: 8 # 原来是128
micro_batch_size_per_device_for_update: 1 # 原来是4
micro_batch_size_per_device_for_experience: 1 # 原来是16
max_grad_norm: 1.0
entropy_coeff: 1.0e-3
use_kl_loss: true
kl_loss_coef: 1.0e-2
kl_loss_type: low_var_kl
padding_free: true
ulysses_sequence_parallel_size: 1
model:
model_path: Qwen/Qwen2.5-7B-Instruct
enable_gradient_checkpointing: true
trust_remote_code: false
freeze_vision_tower: false
optim:
lr: 1.0e-6
weight_decay: 1.0e-2
lr_warmup_ratio: 0.0
fsdp:
enable_full_shard: true
enable_cpu_offload: false
enable_rank0_init: true
offload:
offload_params: true
offload_optimizer: true
rollout:
temperature: 1.0
n: 5
gpu_memory_utilization: 0.5
enforce_eager: false
enable_chunked_prefill: false
tensor_parallel_size: 2
limit_images: 8 # 输入图像的数量
ref:
fsdp:
enable_full_shard: true
enable_cpu_offload: true
enable_rank0_init: true
offload:
offload_params: false
reward:
reward_type: function
compute_score: math
trainer:
total_episodes: 1 # 原来是15
logger: ["console", "wandb"]
project_name: easy_r1
experiment_name: qwen2_5_7b_math
n_gpus_per_node: 8
nnodes: 1
val_freq: 5
val_before_train: true
val_only: false
val_generations_to_log: 1
save_freq: 5
remove_previous_ckpt: false
remove_ckpt_after_load: false
save_checkpoint_path: ./exp_result
load_checkpoint_path: null
过去的代码在多模态数据集上会运行一些step之后出现OOM,更新代码后程序无法运行就会出现OOM的错误,我使用4*40g训练Qwen2.5_3b_vl_Instruct,并且关闭了测试直接进行训练,vllm版本也按照最新代码更新到0.7.4.dev65+g22757848,报错信息如下:
这是我的配置文件