Skip to content

[Feature]: improve sleep mode to not break torch.cuda memory counters #33625

@stas00

Description

@stas00

🚀 The feature, motivation and pitch

Currently when vllm is used in conjunction with other gpu programs, e.g. RL (verl+vllm) and the sleep mode is used, we end up with very bogus torch.cuda memory counters.

The torch.cuda memory reporting is all broken in that situation, since vllm somehow frees up all the kv-cache and weights when put to "sleep" (because we need the same gpu's mem to be free to do a training step) but torch isn't the wiser the freeing happened - the same gpus are shared between inference and training - each loading and releasing all memory it uses - so only one of them used at a time.

So when vllm did its unloading torch.cuda still reports memory_allocated() from vllm's run, even though it has been actually freed, which makes it quite difficult to debug memory-related problems.

The other weird related thing is that torch.cuda.memory_reserved and torch.cuda.max_memory_reserved report: 198.68 GB and 199 GB on H200, so there are only 140GB! How could it possibly report more than the physical size of the memory? (and in this particular use case vllm's memory usage was about 60GB so the diff with 200-60=140GB checks out)

So the workaround proposed here #11743 (comment) is to use:  mem_get_info() and manually calculate the Used memory:

    mem_free, mem_total = get_torch_device().mem_get_info()
    mem_used = mem_total - mem_free

but all other counters are still wrong, and getting just Used memory is insufficient when dealing with memory issues.

Is it possible to fix the sleep mode to correctly tell torch.cuda that tensors used by vllm have been freed?

Thank you.

cc: @youkaichao, who created the sleep PR #11743

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions