Skip to content

Conversation

@HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented Feb 3, 2025

FIX #10714 for AMD GPUs

Related to #11743

@github-actions
Copy link

github-actions bot commented Feb 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Feb 3, 2025
@HollowMan6
Copy link
Contributor Author

Hi @youkaichao! Thank you for your amazing work in #11743. I've checked all the APIs and found that all the CuMem calls used here actually have equivalent Hip versions, so I tried to migrate the feature into ROCm as well. I almost succeeded with all the code compiles, but we had a strange OOM issue with hipMemAddressReserve when we tried to allocate within allocator.use_memory_pool(). So just want to check with you / anyone familiar with this to see how we could resolve this.

For easier reproduction, I modified the demo and pushed it here: https://github.com/HollowMan6/vllm_allocator_adaptor

The error log when running test.py

amdgpu.ids: No such file or directory
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
[vllm_allocator_adaptor_c] create_and_map: device=0x3, size=20971520
[vllm_allocator_adaptor_c] CUDA Error: out of memory at vllm_allocator_adaptor_c.cpp:197
[vllm_allocator_adaptor_c] CUDA Error: out of memory at vllm_allocator_adaptor_c.cpp:137
[vllm_allocator_adaptor_c] CUDA Error: invalid argument at vllm_allocator_adaptor_c.cpp:138
[vllm_allocator_adaptor_c] CUDA Error: invalid argument at vllm_allocator_adaptor_c.cpp:145
[vllm_allocator_adaptor_c] create_and_map: device=0, size=20971520, d_mem=0, p_memHandle=0x80b7a00
Traceback (most recent call last):
  File "vllm_allocator_adaptor/test.py", line 53, in <module>
    y = torch.empty(shape, device='cuda')
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 63.73 GiB is free. Of the allocated memory 4.00 MiB is allocated by PyTorch, and 18.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@hongxiayang hongxiayang added the rocm Related to AMD ROCm label Feb 3, 2025
@HollowMan6 HollowMan6 force-pushed the sleep_amd branch 2 times, most recently from 15e1489 to 261ee50 Compare February 3, 2025 21:19
@mreso
Copy link

mreso commented Feb 4, 2025

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work and the " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

INFO 02-03 16:07:28 model_runner.py:1118] Loading model weights took 14.9888 GB
ERROR 02-03 16:07:28 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 202, in run_engine_core
ERROR 02-03 16:07:28 core.py:210]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 02-03 16:07:28 core.py:210]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 156, in __init__
ERROR 02-03 16:07:28 core.py:210]     super().__init__(vllm_config, executor_class)
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 54, in __init__
ERROR 02-03 16:07:28 core.py:210]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 02-03 16:07:28 core.py:210]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/engine/core.py", line 75, in _initialize_kv_caches
ERROR 02-03 16:07:28 core.py:210]     kv_cache_spec = self.model_executor.get_kv_cache_spec()
ERROR 02-03 16:07:28 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/v1/executor/abstract.py", line 68, in get_kv_cache_spec
ERROR 02-03 16:07:28 core.py:210]     output = self.collective_rpc("get_kv_cache_spec")
ERROR 02-03 16:07:28 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-03 16:07:28 core.py:210]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-03 16:07:28 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 16:07:28 core.py:210]   File "/home/mreso/vllm/vllm/utils.py", line 2206, in run_method
ERROR 02-03 16:07:28 core.py:210]     raise NotImplementedError(f"Method {method!r} is not"
ERROR 02-03 16:07:28 core.py:210] NotImplementedError: Method 'get_kv_cache_spec' is not implemented.
ERROR 02-03 16:07:28 core.py:210]
CRITICAL 02-03 16:07:28 core_client.py:158] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

Not sure if you're aware of this, so I thought I quickly leave comment. Let me know if you need more info to repro.

EDIT:
Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

@youkaichao
Copy link
Member

but we had a strange OOM issue with hipMemAddressReserve when we tried to allocate within allocator.use_memory_pool()

this usually means error happens for hipMemAddressReserve . it is not really an OOM error.

when I implement the PR, the most difficult part is to make sure the call to cuda driver API succeeds. I don't know if simply replacing some functions works to migrate these API calls. We need rocm experts to confirm the API usage, as they are quite low-level (and error-prone).

@HollowMan6
Copy link
Contributor Author

HollowMan6 commented Feb 4, 2025

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work

EDIT: Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

Hi @mreso! Thank you for your feedback! Are you sure it's working for the sleep/wake without the strange OOM error on AMD with this PR? It looks like you failed to link to the ROCm library, so the sleep/wake shouldn't work here. Remove list(APPEND CUMEM_LIBS amdhip64) is not the correct solution as it will stop it from linking to the library correctly. The issue you have is more like an environment issue, so maybe you can try something like export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH for now, maybe there is a better way to handle this properly in the CMakeLists.txt, but let's first get the hipMemAddressReserve working.

The " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs. I can't produce this on my side, though, so maybe you want to file a separate issue about this. Regarding the V1 one, neither this PR nor #11743 made any modification to this, so it should be a separate issue, too.

@HollowMan6
Copy link
Contributor Author

Okay, it turns out to be a hardware issue on my side, I'm using MI250x, and it looks like it doesn't support virtual memory management, as is confirmed by https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/memory/hip_vmm_common.hh#L27-L37

Maybe this can work on some advanced AMD hardware, such as MI300+? Unfortunately, I don't have access to hardware such as MI300+, so I can't help with this PR any further. I will set this PR as ready as directly mapping those CUDA calls seems to be OK, we also have some test cases here for reference: https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc#L213-L256

Feel free to take this PR over @youkaichao or anyone who is interested and has those GPUs. I have set this PR to allow edits and access to secrets by maintainers

@HollowMan6 HollowMan6 marked this pull request as ready for review February 4, 2025 13:26
@DarkLight1337
Copy link
Member

@hongxiayang are you able to help with this?

@hongxiayang
Copy link
Collaborator

@hongxiayang are you able to help with this?

Thanks for your contribution. @HollowMan6

@DarkLight1337 I will check this when I have more bandwidth.

@mreso
Copy link

mreso commented Feb 5, 2025

Thanks @HollowMan6 you're right, that will be an env issue and I was not specifically testing sleep/wake explicitly so it will most likely crash if I did.

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs.

Interesting that this does not work in my case. This line is causing an exception on my end and I assume if cuda is not present its supposed to swallow the ModuleNotFoundError and disable the CuMemAllocator that way, right? But in my case an AssertionError is thrown here which is not caught. If I raise a ModuleNotFoundError error instead its working.

Is there another mechanism I am missing that should disable CuMemAllocator?

@HollowMan6
Copy link
Contributor Author

HollowMan6 commented Feb 5, 2025

@mreso

I was not specifically testing sleep/wake explicitly so it will most likely crash if I did.

What hardware are you using? If it's MI300 and above, then it's likely that it supports virtual memory management and the sleep/wake mode will not crash in that OOM way, you are welcome to test this out and let us know what you found out!

Interesting that this does not work in my case. This line is causing an exception on my end and I assume if cuda is not present its supposed to swallow the ModuleNotFoundError and disable the CuMemAllocator that way, right? But in my case an AssertionError is thrown here which is not caught. If I raise a ModuleNotFoundError error instead its working.

Is there another mechanism I am missing that should disable CuMemAllocator?

In main, vllm.cumem_allocator shouldn't exist in the first place for AMD, so that's why it catches the ModuleNotFoundError only. That module comes from the C side, so you would need to find out why we have vllm/cumem_allocator.abi3.so (gets compiled in your CMakeLists.txt), while in that CMakeLists.txt, it's specified that it will only compile for CUDA, so most likely an env issue again. Did you install CUDA somehow on your AMD machine? I would also suggest you clean up your workdir, to ensure we don't have any built cumem_allocator.abi3.so, especially if you switch from this PR to the main.

@mreso
Copy link

mreso commented Feb 5, 2025

In main, vllm.cumem_allocator shouldn't exist in the first place for AMD, so that's why it catches the ModuleNotFoundError only. That module comes from the C side, so you would need to find out why we have vllm/cumem_allocator.abi3.so (gets compiled in your CMakeLists.txt), while in that CMakeLists.txt, it's specified that it will only compile for CUDA, so most likely an env issue again. Did you install CUDA somehow on your AMD machine? I would also suggest you clean up your workdir, to ensure we don't have any built cumem_allocator.abi3.so, especially if you switch from this PR to the main.

Thats it, thanks a lot! Was an unclean build env. There is no trace of CUDA on the machines but it must have created the file anyways and so the ModuleNotFoundError was not raised... After removing all files and rebuilding it works.

I am on MI300s and will give it a try.

@mreso
Copy link

mreso commented Feb 5, 2025

Gave it a quick try using this test script:

import torch
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)

def run_inference(prompt):
    outputs = llm.generate(prompt)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("CUDA Memory Usage (after inference):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("San Francisco is")
llm.sleep()

print("CUDA Memory Usage (after sleep):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

llm.wake_up()

print("CUDA Memory Usage (after wakeup):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("Paris is")

but seems like sleep did not free any memory and after wakeup I got an OOM error:

INFO 02-05 11:19:42 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.63 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168654098432
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.85it/s, est. speed input: 15.41 toks/s, output: 61.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
INFO 02-05 11:19:46 worker.py:133] Sleep mode freed 0.00 GiB memory, 160.18 GiB memory is still in use.
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168654097920
CUDA Error: out of memory at /home/mreso/vllm/csrc/cumem_allocator.cpp:56
Segmentation fault (core dumped)

This is my hw:

=========================================== Concise Hardware Info ============================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS             BUS           PARTITION ID
0    2     0x74a1  51140  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:08:00.0  0
1    5     0x74a1  36564  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:28:00.0  0
2    4     0x74a1  18917  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:48:00.0  0
3    3     0x74a1  28918  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:68:00.0  0
4    8     0x74a1  15111  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:88:00.0  0
5    9     0x74a1  57880  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:A8:00.0  0
6    7     0x74a1  44328  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:C8:00.0  0
7    6     0x74a1  13826  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:E9:00.0  0
==============================================================================================================

Let me know if the test script is incorrect for this or I can try anything else.

@HollowMan6
Copy link
Contributor Author

Thank you @mreso! It looks like MI300 supports virtual memory management, but it gives an OOM error at a different line and then ends up with a segment fault, maybe Hip APIs should be called in a different way than CUDA. Unfortunately, I can't help with this PR any further as I don't have MI300 in my hand, but thank you anyway, @mreso!

@YangWang92
Copy link

YangWang92 commented Feb 13, 2025

Gave it a quick try using this test script:

import torch
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)

def run_inference(prompt):
    outputs = llm.generate(prompt)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("CUDA Memory Usage (after inference):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("San Francisco is")
llm.sleep()

print("CUDA Memory Usage (after sleep):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

llm.wake_up()

print("CUDA Memory Usage (after wakeup):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("Paris is")

but seems like sleep did not free any memory and after wakeup I got an OOM error:

INFO 02-05 11:19:42 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.63 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168654098432
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.85it/s, est. speed input: 15.41 toks/s, output: 61.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
INFO 02-05 11:19:46 worker.py:133] Sleep mode freed 0.00 GiB memory, 160.18 GiB memory is still in use.
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168654097920
CUDA Error: out of memory at /home/mreso/vllm/csrc/cumem_allocator.cpp:56
Segmentation fault (core dumped)

This is my hw:

=========================================== Concise Hardware Info ============================================
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS             BUS           PARTITION ID
0    2     0x74a1  51140  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:08:00.0  0
1    5     0x74a1  36564  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:28:00.0  0
2    4     0x74a1  18917  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:48:00.0  0
3    3     0x74a1  28918  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:68:00.0  0
4    8     0x74a1  15111  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:88:00.0  0
5    9     0x74a1  57880  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:A8:00.0  0
6    7     0x74a1  44328  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:C8:00.0  0
7    6     0x74a1  13826  gfx942   ENABLED  ENABLED   ENABLED  113-M3000100-102  0000:E9:00.0  0
==============================================================================================================

Let me know if the test script is incorrect for this or I can try anything else.

I can also confirm that the current solution does not work on mi300 (on v0.7.2).

INFO 02-13 15:46:13 model_runner.py:1110] Starting to load model meta-llama/Llama-3.1-8B-Instruct...
INFO 02-13 15:46:18 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.77it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:03,  1.66s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:02,  2.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.08s/it]

INFO 02-13 15:46:27 model_runner.py:1115] Loading model weights took 14.9888 GB
INFO 02-13 15:47:10 worker.py:284] Memory profiling takes 43.04 seconds
INFO 02-13 15:47:10 worker.py:284] the current vLLM instance can use total_gpu_memory (191.45GiB) x gpu_memory_utilization (0.90) = 172.31GiB
INFO 02-13 15:47:10 worker.py:284] model weights take 14.99GiB; non_torch_memory takes 2.21GiB; PyTorch activation peak memory takes 13.50GiB; the rest of the memory reserved for KV Cache is 141.61GiB.
INFO 02-13 15:47:10 executor_base.py:110] # CUDA blocks: 72501, # CPU blocks: 2048
INFO 02-13 15:47:10 executor_base.py:115] Maximum concurrency for 131072 tokens per request: 8.85x
INFO 02-13 15:48:44 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00,  1.73it/s]
INFO 02-13 15:49:04 model_runner.py:1562] Graph capturing finished in 20 secs, took 0.20 GiB
INFO 02-13 15:49:04 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 157.36 seconds
CUDA Memory Usage (after inference):
torch.cuda.memory_allocated()=168163364864
Processed prompts: 100%|████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.84s/it, est. speed input: 1.41 toks/s, output: 5.64 toks/s]
Prompt: 'San Francisco is', Generated text: ' a top tourist destination, with millions of visitors each year. Whether you’re traveling'
CUDA Memory Usage (after sleep):
torch.cuda.memory_allocated()=168163364352
CUDA Memory Usage (after wakeup):
torch.cuda.memory_allocated()=168163364352
Processed prompts: 100%|████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.47s/it, est. speed input: 1.22 toks/s, output: 6.49 toks/s]
Prompt: 'Paris is', Generated text: ' a battleground for the culinary world. Street vendors, high-end restaurants, bakers'
[rank0]:[W213 15:49:11.474759635 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of an

@YangWang92
Copy link

By the way, feel free to tell me what I can help with on this pull.

@mergify
Copy link

mergify bot commented Mar 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @HollowMan6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@tjtanaa
Copy link
Contributor

tjtanaa commented Mar 13, 2025

Hi @HollowMan6, just a quick feedback data point as I was testing this PR. For V0 inference seems to work
EDIT: Oh and I had to remove "list(APPEND CUMEM_LIBS amdhip64)" in my env, otherwise I got a linker error: /usr/bin/ld: cannot find -lamdhip64.

Hi @mreso! Thank you for your feedback! Are you sure it's working for the sleep/wake without the strange OOM error on AMD with this PR? It looks like you failed to link to the ROCm library, so the sleep/wake shouldn't work here. Remove list(APPEND CUMEM_LIBS amdhip64) is not the correct solution as it will stop it from linking to the library correctly. The issue you have is more like an environment issue, so maybe you can try something like export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH for now, maybe there is a better way to handle this properly in the CMakeLists.txt, but let's first get the hipMemAddressReserve working.

The " from vllm.device_allocator.cumem import CuMemAllocator" import error I was facing on main is fixed but for V1 I get this error:

Oh really? it sounds like they have some failed logic in checking whether you are using NVIDIA or AMD, as in the original codebase, CuMemAllocator is disabled for AMD GPUs. I can't produce this on my side, though, so maybe you want to file a separate issue about this. Regarding the V1 one, neither this PR nor #11743 made any modification to this, so it should be a separate issue, too.

@HollowMan6 @hongxiayang I can't seem to compile due to missing linker -lamdhip64 when building DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm-ellm:sleepamd .

 deprecated: This API is marked as deprecated and might not be supported in future releases. For more details please re[194/1943]
//github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md [-Wdeprecated-declarations]                             
57.87    57 |   return hipDevicePrimaryCtxRetain(ctx, dev);                                                                      
57.87       |          ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~                                                                       
57.87 In file included from /app/vllm/csrc/cumem_allocator_compat.h:7,                                                           
57.87                  from /app/vllm/csrc/cumem_allocator.cpp:6:                                                                
57.87 /opt/rocm/include/hip/hip_runtime_api.h:5437:12: note: declared here                                                       
57.87  5437 | hipError_t hipDevicePrimaryCtxRetain(hipCtx_t* pctx, hipDevice_t dev);                                             
57.87       |            ^~~~~~~~~~~~~~~~~~~~~~~~~                                                                               
58.04 [2/28] Linking CXX shared module cumem_allocator.abi3.so                                                                   
58.04 FAILED: cumem_allocator.abi3.so                                                                                            
58.04 : && /usr/bin/c++ -fPIC -Wno-unused-result -O2 -g -DNDEBUG   -shared  -o cumem_allocator.abi3.so CMakeFiles/cumem_allocator
.dir/csrc/cumem_allocator.cpp.o  -Wl,-rpath,/usr/local/lib/python3.12/dist-packages/torch/lib:/opt/rocm-6.3.1/lib:/opt/rocm/lib: 
 -lamdhip64  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so  /usr/local/lib/python3.12/dist-packages/torch/lib/lib
c10.so  -Wl,--no-as-needed,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed  -Wl,--no-as-neede
d,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_hip.so" -Wl,--as-needed  /usr/local/lib/python3.12/dist-packages/to
rch/lib/libc10_hip.so  /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so  /opt/rocm-6.3.1/lib/libMIOpen.so.1.0.60301  /
opt/rocm/lib/libhiprtc.so.6.3.60301  -ldl  /opt/rocm-6.3.1/lib/libhipblas.so.2.3.60301  /opt/rocm-6.3.1/lib/libhipfft.so.0.1.6030
1  /opt/rocm-6.3.1/lib/libhiprand.so.1.1.60301  /opt/rocm-6.3.1/lib/librocrand.so.1.1.60301  /opt/rocm-6.3.1/lib/libhipsparse.so.
1.1.0.60301  /opt/rocm-6.3.1/lib/libhipsolver.so.0.3.60301  /opt/rocm-6.3.1/lib/libhipblaslt.so.0.13  /opt/rocm/lib/libamdhip64.s
o.6.3.60301  -Wl,--no-as-needed,"/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so" -Wl,--as-needed  -Wl,-rpath-link,
/opt/rocm-6.3.1/lib && :
58.04 /usr/bin/ld: cannot find -lamdhip64: No such file or directory
58.04 collect2: error: ld returned 1 exit status

This compilation error is resolved through:

sudo ln -sf /opt/rocm/lib/libamdhip64.so.6.3.60301 /usr/lib/libamdhip64.so
sudo ldconfig

HIP-TESTs of VirtualMemoryManagement

I am using the hip and rocm in the vLLM docker image to build the test

I am getting two failure when running VirtualMemoryManagement tests

# ./VirtualMemoryManagementTest 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
VirtualMemoryManagementTest is a Catch v2.13.4 host application.
Run with -? for options

-------------------------------------------------------------------------------
Unit_hipMemCreate_MapNonContiguousChunks
-------------------------------------------------------------------------------
/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemCreate.cc:272
...............................................................................

/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemCreate.cc:323: FAILED:
  REQUIRE( true == std::equal(B_h.begin(), B_h.end(), C_h.data()) )
with expansion:
  true == false

Memory access fault by GPU node-9 (Agent handle: 0xd13970) on address 0x7f3214c47000. Reason: Unknown.
GPU core dump created: gpucore.28281
-------------------------------------------------------------------------------
Unit_hipMemSetAccess_Vmm2UnifiedMemCpy
-------------------------------------------------------------------------------
/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc:552
...............................................................................

/app/hip-tests/catch/unit/virtualMemoryManagement/hipMemSetGetAccess.cc:552: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:    24 |    22 passed | 2 failed
assertions: 73852 | 73850 passed | 2 failed

Aborted

HIP-TEST setup

   20  export ROCM_BRANCH=rocm-6.3.x
   21  git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git
   22  export HIPTESTS_DIR="$(readlink -f hip-tests)"
   23  cd "$HIPTESTS_DIR"
   24  mkdir -p build; cd build
   25  cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=/opt/rocm
   26  make build_tests
   27  cd catch_tests/
   28  ls
   29  cd unit/
   30  ls
   31  cd virtualMemoryManagement/
   32  ls
   33  ./VirtualMemoryManagementTest 

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the fully transparent sleep mode to the ROCm platform, enabling it for AMD GPUs. The changes are extensive, touching CMake build files, C++ allocator code, Python wrappers, and configuration files to support ROCm alongside CUDA. The core logic introduces chunked memory allocation for ROCm as a workaround for platform limitations. While the overall approach is sound, I've identified critical resource leak issues in the error handling paths of the new C++ code for ROCm. Specifically, if memory creation or mapping fails, previously allocated resources are not cleaned up, which will lead to GPU memory leaks. I've provided suggestions to fix these issues.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully migrates the fully transparent sleep mode to the ROCm platform, which is a significant feature enhancement. The changes span across CMake build scripts, C++ extensions, and Python code, and are generally well-implemented. However, I've identified a few critical issues in the C++ memory allocator (csrc/cumem_allocator.cpp) concerning resource leaks and incorrect memory management API usage. These issues could lead to memory corruption or crashes and should be addressed before merging.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Hollow Man <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully migrates the fully transparent sleep mode to the ROCm platform, which is a significant feature enhancement. The implementation is well-structured, utilizing a compatibility header for CUDA/HIP APIs and conditional compilation for platform-specific logic. The changes across CMake, Python, and C++ are consistent and well-executed. However, I've identified a critical issue in the C++ extension concerning the lack of validation for Python objects passed from the Python layer. This could lead to process crashes if malformed data is provided. I've included a detailed comment with a code suggestion to address this vulnerability. Overall, this is a great contribution, and with the suggested fix, it will be a robust addition to the project.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Hollow Man <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully migrates the fully transparent sleep mode to the ROCm platform for AMD GPUs, which is a significant feature enhancement. The changes involve conditional compilation for CUDA and ROCm, a new compatibility header for HIP APIs, and modifications to the build system. The core logic for memory management has been adapted to handle memory in chunks for ROCm, including robust error handling in most places. However, I've identified a few critical areas in the C++ extension where error handling for Python C-API calls is missing, which could lead to process crashes. Addressing these will make the implementation much more robust.

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@HollowMan6
Copy link
Contributor Author

I addressed all the potential memory leak issues above with the help of LLMs reviews above, now both sleep level 2 and level 1 looks Okay when testing. Increasing VLLM_SLEEP_MEM_CHUNK_SIZE to 256MB (default) does improved the sleep/wake up speed a lot:

Test with default VLLM_SLEEP_MEM_CHUNK_SIZE (256MB) now:

Testing Qwen/Qwen3-4B-Instruct-2507 with 1 tensor parallel size and 0.9 GPU memory utilization
INFO 10-31 13:13:47 [utils.py:253] non-default args: {'max_model_len': 8192, 'max_num_batched_tokens': 8192, 'disable_log_stats': True, 'disable_custom_all_reduce': True, 'disable_mm_preprocessor_cache': True, 'compilation_config': {'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'use_inductor': None, 'compile_sizes': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': None, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None}, 'enable_sleep_mode': True, 'model': 'Qwen/Qwen3-4B-Instruct-2507'}
INFO 10-31 13:13:47 [model.py:658] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 13:13:47 [model.py:1747] Using max model len 8192
INFO 10-31 13:13:47 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 10-31 13:13:48 [system_utils.py:103] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:51 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10896+gd8617ccc7) with config: model='Qwen/Qwen3-4B-Instruct-2507', speculative_config=None, tokenizer='Qwen/Qwen3-4B-Instruct-2507', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-4B-Instruct-2507, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 1, 'local_cache_dir': None}
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:52 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:52 [gpu_model_runner.py:2865] Starting to load model Qwen/Qwen3-4B-Instruct-2507...
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:53 [rocm.py:352] Using Triton Attention backend on V1 engine.
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.6M/99.6M [00:00<00:00, 168MB/s]
model-00001-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.96G/3.96G [00:02<00:00, 1.35GB/s]
model-00002-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [00:02<00:00, 1.33GB/s]
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:56 [weight_utils.py:440] Time spent downloading weights for Qwen/Qwen3-4B-Instruct-2507: 3.199524 seconds████████████████████████████████▋                 | 3.55G/3.99G [00:02<00:00, 1.91GB/s]
model.safetensors.index.json: 32.8kB [00:00, 184MB/s]██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 3.82G/3.99G [00:02<00:00, 2.01GB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.35s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.06it/s]
(EngineCore_DP0 pid=1860) 
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:59 [default_loader.py:314] Loading weights took 2.93 seconds
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:00 [gpu_model_runner.py:2930] Model loading took 7.6719 GiB and 6.916153 seconds
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [gpu_worker.py:342] Available KV cache memory: 158.72 GiB
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [kv_cache_utils.py:1229] GPU KV cache size: 1,155,776 tokens
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [kv_cache_utils.py:1234] Maximum concurrency for 8,192 tokens per request: 141.09x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.55it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.13s/it]
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:05 [gpu_model_runner.py:3858] Graph capturing finished in 1 secs, took 0.02 GiB
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:05 [core.py:237] init engine (profile, create kv cache, warmup model) took 5.81 seconds
INFO 10-31 13:14:06 [llm.py:345] Supported tasks: ['generate']
CUDA Memory Usage (initial):
gpu memory used (GB): 0=173.79; 
WARNING 10-31 13:14:11 [model.py:1577] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 605.68it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s, est. speed input: 4.16 toks/s, output: 22.21 toks/s]
Prompt: 'San Francisco is', Generated text: ' a city in the U.S. state of California, located on the west coast'
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:12 [block_pool.py:388] Successfully reset prefix cache
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [cumem.py:250] CuMemAllocator: sleep freed 166.41 GiB memory in total, of which 7.64 GiB is backed up in CPU and the rest 158.77 GiB is discarded directly.
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [gpu_worker.py:133] Sleep mode freed 172.04 GiB memory, 0.98 GiB memory is still in use.
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [abstract.py:288] It took 2.004416 seconds to fall asleep.
CUDA Memory Usage (after sleep):
gpu memory used (GB): 0=1.76; 
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:29 [abstract.py:306] It took 0.174478 seconds to wake up tags {'kv_cache', 'weights'}.
CUDA Memory Usage (after wakeup):                                                                                                                                                                                                                                                                                          gpu memory used (GB): 0=168.16;                                                                                                                                                                                                                                                                                            Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1960.87it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.72it/s, est. speed input: 11.45 toks/s, output: 91.58 toks/s]
Prompt: 'Paris is', Generated text: ' a city of contrasts and beauty, where the streets are lined with elegant buildings and'

Test with VLLM_SLEEP_MEM_CHUNK_SIZE=2 (2MB):

Testing Qwen/Qwen3-4B-Instruct-2507 with 1 tensor parallel size and 0.9 GPU memory utilization
INFO 10-31 13:18:10 [utils.py:253] non-default args: {'max_model_len': 8192, 'max_num_batched_tokens': 8192, 'disable_log_stats': True, 'disable_custom_all_reduce': True, 'disable_mm_preprocessor_cache': True, 'compilation_config': {'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'use_inductor': None, 'compile_sizes': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': None, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None}, 'enable_sleep_mode': True, 'model': 'Qwen/Qwen3-4B-Instruct-2507'}
INFO 10-31 13:18:10 [model.py:658] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 13:18:10 [model.py:1747] Using max model len 8192
INFO 10-31 13:18:11 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 10-31 13:18:11 [system_utils.py:103] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10896+gd8617ccc7) with config: model='Qwen/Qwen3-4B-Instruct-2507', speculative_config=None, tokenizer='Qwen/Qwen3-4B-Instruct-2507', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-4B-Instruct-2507, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 1, 'local_cache_dir': None}
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [gpu_model_runner.py:2865] Starting to load model Qwen/Qwen3-4B-Instruct-2507...
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:16 [rocm.py:352] Using Triton Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.41s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.02it/s]
(EngineCore_DP0 pid=2459) 
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:19 [default_loader.py:314] Loading weights took 3.04 seconds
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:20 [gpu_model_runner.py:2930] Model loading took 7.6719 GiB and 3.822409 seconds
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [gpu_worker.py:342] Available KV cache memory: 158.72 GiB
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [kv_cache_utils.py:1229] GPU KV cache size: 1,155,776 tokens
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [kv_cache_utils.py:1234] Maximum concurrency for 8,192 tokens per request: 141.09x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.61it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.26it/s]
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:10 [gpu_model_runner.py:3858] Graph capturing finished in 1 secs, took 0.02 GiB
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:10 [core.py:237] init engine (profile, create kv cache, warmup model) took 50.00 seconds
INFO 10-31 13:19:10 [llm.py:345] Supported tasks: ['generate']
CUDA Memory Usage (initial):
gpu memory used (GB): 0=173.79; 
WARNING 10-31 13:19:15 [model.py:1577] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 620.37it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.46it/s, est. speed input: 13.39 toks/s, output: 71.43 toks/s]
Prompt: 'San Francisco is', Generated text: ' a city in the U.S. state of California, located on the west coast'
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:16 [block_pool.py:388] Successfully reset prefix cache
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [cumem.py:250] CuMemAllocator: sleep freed 166.41 GiB memory in total, of which 7.64 GiB is backed up in CPU and the rest 158.77 GiB is discarded directly.
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [gpu_worker.py:133] Sleep mode freed 172.04 GiB memory, 0.98 GiB memory is still in use.
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [abstract.py:288] It took 9.961728 seconds to fall asleep.
CUDA Memory Usage (after sleep):
gpu memory used (GB): 0=1.76; 
(EngineCore_DP0 pid=2459) INFO 10-31 13:20:38 [abstract.py:306] It took 57.746950 seconds to wake up tags {'weights', 'kv_cache'}.
CUDA Memory Usage (after wakeup):
gpu memory used (GB): 0=168.16; 
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2009.73it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.84it/s, est. speed input: 11.69 toks/s, output: 93.49 toks/s]
Prompt: 'Paris is', Generated text: ' a city of contrasts and beauty, where the streets are lined with elegant buildings and'

I think this PR is now ready for merge, please let me know if you have more concerns here @tjtanaa @kliuae @gshtras @mgoin

@HollowMan6 HollowMan6 requested review from gshtras and mgoin October 31, 2025 13:28
@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 31, 2025

I addressed all the potential memory leak issues above with the help of LLMs reviews above, now both sleep level 2 and level 1 looks Okay when testing. Increasing VLLM_SLEEP_MEM_CHUNK_SIZE to 256MB (default) does improved the sleep/wake up speed a lot:

Test with default VLLM_SLEEP_MEM_CHUNK_SIZE (256MB) now:

Testing Qwen/Qwen3-4B-Instruct-2507 with 1 tensor parallel size and 0.9 GPU memory utilization
INFO 10-31 13:13:47 [utils.py:253] non-default args: {'max_model_len': 8192, 'max_num_batched_tokens': 8192, 'disable_log_stats': True, 'disable_custom_all_reduce': True, 'disable_mm_preprocessor_cache': True, 'compilation_config': {'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'use_inductor': None, 'compile_sizes': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': None, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None}, 'enable_sleep_mode': True, 'model': 'Qwen/Qwen3-4B-Instruct-2507'}
INFO 10-31 13:13:47 [model.py:658] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 13:13:47 [model.py:1747] Using max model len 8192
INFO 10-31 13:13:47 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 10-31 13:13:48 [system_utils.py:103] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:51 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10896+gd8617ccc7) with config: model='Qwen/Qwen3-4B-Instruct-2507', speculative_config=None, tokenizer='Qwen/Qwen3-4B-Instruct-2507', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-4B-Instruct-2507, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 1, 'local_cache_dir': None}
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:52 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:52 [gpu_model_runner.py:2865] Starting to load model Qwen/Qwen3-4B-Instruct-2507...
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:53 [rocm.py:352] Using Triton Attention backend on V1 engine.
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.6M/99.6M [00:00<00:00, 168MB/s]
model-00001-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.96G/3.96G [00:02<00:00, 1.35GB/s]
model-00002-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [00:02<00:00, 1.33GB/s]
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:56 [weight_utils.py:440] Time spent downloading weights for Qwen/Qwen3-4B-Instruct-2507: 3.199524 seconds████████████████████████████████▋                 | 3.55G/3.99G [00:02<00:00, 1.91GB/s]
model.safetensors.index.json: 32.8kB [00:00, 184MB/s]██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 3.82G/3.99G [00:02<00:00, 2.01GB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.35s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.06it/s]
(EngineCore_DP0 pid=1860) 
(EngineCore_DP0 pid=1860) INFO 10-31 13:13:59 [default_loader.py:314] Loading weights took 2.93 seconds
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:00 [gpu_model_runner.py:2930] Model loading took 7.6719 GiB and 6.916153 seconds
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [gpu_worker.py:342] Available KV cache memory: 158.72 GiB
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [kv_cache_utils.py:1229] GPU KV cache size: 1,155,776 tokens
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:04 [kv_cache_utils.py:1234] Maximum concurrency for 8,192 tokens per request: 141.09x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.55it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.13s/it]
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:05 [gpu_model_runner.py:3858] Graph capturing finished in 1 secs, took 0.02 GiB
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:05 [core.py:237] init engine (profile, create kv cache, warmup model) took 5.81 seconds
INFO 10-31 13:14:06 [llm.py:345] Supported tasks: ['generate']
CUDA Memory Usage (initial):
gpu memory used (GB): 0=173.79; 
WARNING 10-31 13:14:11 [model.py:1577] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 605.68it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.39it/s, est. speed input: 4.16 toks/s, output: 22.21 toks/s]
Prompt: 'San Francisco is', Generated text: ' a city in the U.S. state of California, located on the west coast'
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:12 [block_pool.py:388] Successfully reset prefix cache
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [cumem.py:250] CuMemAllocator: sleep freed 166.41 GiB memory in total, of which 7.64 GiB is backed up in CPU and the rest 158.77 GiB is discarded directly.
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [gpu_worker.py:133] Sleep mode freed 172.04 GiB memory, 0.98 GiB memory is still in use.
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:14 [abstract.py:288] It took 2.004416 seconds to fall asleep.
CUDA Memory Usage (after sleep):
gpu memory used (GB): 0=1.76; 
(EngineCore_DP0 pid=1860) INFO 10-31 13:14:29 [abstract.py:306] It took 0.174478 seconds to wake up tags {'kv_cache', 'weights'}.
CUDA Memory Usage (after wakeup):                                                                                                                                                                                                                                                                                          gpu memory used (GB): 0=168.16;                                                                                                                                                                                                                                                                                            Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1960.87it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.72it/s, est. speed input: 11.45 toks/s, output: 91.58 toks/s]
Prompt: 'Paris is', Generated text: ' a city of contrasts and beauty, where the streets are lined with elegant buildings and'

Test with VLLM_SLEEP_MEM_CHUNK_SIZE=2 (2MB):

Testing Qwen/Qwen3-4B-Instruct-2507 with 1 tensor parallel size and 0.9 GPU memory utilization
INFO 10-31 13:18:10 [utils.py:253] non-default args: {'max_model_len': 8192, 'max_num_batched_tokens': 8192, 'disable_log_stats': True, 'disable_custom_all_reduce': True, 'disable_mm_preprocessor_cache': True, 'compilation_config': {'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'use_inductor': None, 'compile_sizes': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': None, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None}, 'enable_sleep_mode': True, 'model': 'Qwen/Qwen3-4B-Instruct-2507'}
INFO 10-31 13:18:10 [model.py:658] Resolved architecture: Qwen3ForCausalLM
INFO 10-31 13:18:10 [model.py:1747] Using max model len 8192
INFO 10-31 13:18:11 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 10-31 13:18:11 [system_utils.py:103] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10896+gd8617ccc7) with config: model='Qwen/Qwen3-4B-Instruct-2507', speculative_config=None, tokenizer='Qwen/Qwen3-4B-Instruct-2507', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-4B-Instruct-2507, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 1, 'local_cache_dir': None}
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:15 [gpu_model_runner.py:2865] Starting to load model Qwen/Qwen3-4B-Instruct-2507...
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:16 [rocm.py:352] Using Triton Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:02,  1.41s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:01<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00,  1.02it/s]
(EngineCore_DP0 pid=2459) 
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:19 [default_loader.py:314] Loading weights took 3.04 seconds
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:20 [gpu_model_runner.py:2930] Model loading took 7.6719 GiB and 3.822409 seconds
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [gpu_worker.py:342] Available KV cache memory: 158.72 GiB
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [kv_cache_utils.py:1229] GPU KV cache size: 1,155,776 tokens
(EngineCore_DP0 pid=2459) INFO 10-31 13:18:23 [kv_cache_utils.py:1234] Maximum concurrency for 8,192 tokens per request: 141.09x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.61it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.26it/s]
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:10 [gpu_model_runner.py:3858] Graph capturing finished in 1 secs, took 0.02 GiB
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:10 [core.py:237] init engine (profile, create kv cache, warmup model) took 50.00 seconds
INFO 10-31 13:19:10 [llm.py:345] Supported tasks: ['generate']
CUDA Memory Usage (initial):
gpu memory used (GB): 0=173.79; 
WARNING 10-31 13:19:15 [model.py:1577] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 620.37it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.46it/s, est. speed input: 13.39 toks/s, output: 71.43 toks/s]
Prompt: 'San Francisco is', Generated text: ' a city in the U.S. state of California, located on the west coast'
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:16 [block_pool.py:388] Successfully reset prefix cache
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [cumem.py:250] CuMemAllocator: sleep freed 166.41 GiB memory in total, of which 7.64 GiB is backed up in CPU and the rest 158.77 GiB is discarded directly.
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [gpu_worker.py:133] Sleep mode freed 172.04 GiB memory, 0.98 GiB memory is still in use.
(EngineCore_DP0 pid=2459) INFO 10-31 13:19:25 [abstract.py:288] It took 9.961728 seconds to fall asleep.
CUDA Memory Usage (after sleep):
gpu memory used (GB): 0=1.76; 
(EngineCore_DP0 pid=2459) INFO 10-31 13:20:38 [abstract.py:306] It took 57.746950 seconds to wake up tags {'weights', 'kv_cache'}.
CUDA Memory Usage (after wakeup):
gpu memory used (GB): 0=168.16; 
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2009.73it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.84it/s, est. speed input: 11.69 toks/s, output: 93.49 toks/s]
Prompt: 'Paris is', Generated text: ' a city of contrasts and beauty, where the streets are lined with elegant buildings and'

I think this PR is now ready for merge, please let me know if you have more concerns here @tjtanaa @kliuae @gshtras @mgoin

@HollowMan6 Yea, so we hope upcoming ROCm can resolve the VMM API can allocate memory without chunking, like on NVIDIA. Theoretically, that should be the fastest.

@HollowMan6 HollowMan6 requested a review from YangWang92 October 31, 2025 13:55
@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 31, 2025

@HollowMan6 do you mind if we help update the documentation on your branch directly?

@HollowMan6
Copy link
Contributor Author

@HollowMan6 do you mind if we help update the documentation on your branch directly?

That's OKay, feel free to do so, not only related to the documentation, but also code if you feel necessary.

Co-authored-by: kliuae <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
@mergify
Copy link

mergify bot commented Oct 31, 2025

Documentation preview: https://vllm--12695.org.readthedocs.build/en/12695/

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 31, 2025
Copy link
Contributor

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@mgoin @youkaichao PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: API for evicting all KV cache from GPU memory (or sleep mode)