Skip to content

Bug: memory leak with vllm_grpo_trainer_modified.py #182

@zhangyuygss

Description

@zhangyuygss

When training with vllm_grpo_trainer_modified.py, the memory (system memory, not cuda memory) keeps growing.
It leads to OOM in the middle of training (640G memory machine).
I tried to locate the leak with tracemalloc, it shows that transformers/models/qwen2_vl/image_processing_qwen2_vl.py:455 growing fast during training.
image_processing_qwen2_vl.py:455: pixel_values = np.array(pixel_values)
Seems the pixel_values was never relesed, I tried to namually del related variables like in the trainer, but did not work.
Any idea on this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions