[Core] avoid too many cuda context by caching p2p test #4021

youkaichao · 2024-04-11T23:06:20Z

It is observed in #3821 that every worker takes memory in every GPU in _can_p2p , because the test will make all process allocate cuda context for every GPU, in total leading to $n * (n-1)$ cuda context.

To avoid this, we can cache the test of p2p test.

Before this PR (tp=4):

nvidia-smi
Thu Apr 11 15:43:50 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB-LS        On  |   00000000:06:00.0 Off |                    0 |
| N/A   35C    P0             96W /  250W |   29767MiB /  32768MiB |     61%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB-LS        On  |   00000000:07:00.0 Off |                    0 |
| N/A   38C    P0             96W /  250W |   29609MiB /  32768MiB |     62%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB-LS        On  |   00000000:0A:00.0 Off |                    0 |
| N/A   37C    P0             78W /  250W |   29717MiB /  32768MiB |     56%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB-LS        On  |   00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0             75W /  250W |   29657MiB /  32768MiB |     33%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-32GB-LS        On  |   00000000:85:00.0 Off |                    0 |
| N/A   28C    P0             40W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-32GB-LS        On  |   00000000:86:00.0 Off |                    0 |
| N/A   29C    P0             41W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-32GB-LS        On  |   00000000:89:00.0 Off |                    0 |
| N/A   32C    P0             41W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  Tesla V100-SXM2-32GB-LS        On  |   00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0             47W /  250W |     343MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1448586      C   python                                      28810MiB |
|    0   N/A  N/A   1456481      C   ray::RayWorkerVllm                            306MiB |
|    0   N/A  N/A   1458504      C   ray::RayWorkerVllm                            306MiB |
|    0   N/A  N/A   1460718      C   ray::RayWorkerVllm                            306MiB |
|    1   N/A  N/A   1448586      C   python                                        306MiB |
|    1   N/A  N/A   1456481      C   ray::RayWorkerVllm                          28652MiB |
|    1   N/A  N/A   1458504      C   ray::RayWorkerVllm                            306MiB |
|    1   N/A  N/A   1460718      C   ray::RayWorkerVllm                            306MiB |
|    2   N/A  N/A   1448586      C   python                                        306MiB |
|    2   N/A  N/A   1456481      C   ray::RayWorkerVllm                            306MiB |
|    2   N/A  N/A   1458504      C   ray::RayWorkerVllm                          28760MiB |
|    2   N/A  N/A   1460718      C   ray::RayWorkerVllm                            306MiB |
|    3   N/A  N/A   1448586      C   python                                        306MiB |
|    3   N/A  N/A   1456481      C   ray::RayWorkerVllm                            306MiB |
|    3   N/A  N/A   1458504      C   ray::RayWorkerVllm                            306MiB |
|    3   N/A  N/A   1460718      C   ray::RayWorkerVllm                          28700MiB |
+-----------------------------------------------------------------------------------------+

GPU blocks: 11762, CPU blocks: 2048
Throughput: 8.17 requests/s, 3932.45 tokens/s

After this PR (tp=4):

✗ nvidia-smi
Thu Apr 11 15:52:19 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB-LS        On  |   00000000:06:00.0 Off |                    0 |
| N/A   36C    P0             87W /  250W |   29717MiB /  32768MiB |     72%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB-LS        On  |   00000000:07:00.0 Off |                    0 |
| N/A   38C    P0             87W /  250W |   29559MiB /  32768MiB |     66%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB-LS        On  |   00000000:0A:00.0 Off |                    0 |
| N/A   37C    P0             85W /  250W |   29667MiB /  32768MiB |     61%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB-LS        On  |   00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0             79W /  250W |   29607MiB /  32768MiB |     64%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-32GB-LS        On  |   00000000:85:00.0 Off |                    0 |
| N/A   28C    P0             40W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-32GB-LS        On  |   00000000:86:00.0 Off |                    0 |
| N/A   29C    P0             41W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-32GB-LS        On  |   00000000:89:00.0 Off |                    0 |
| N/A   30C    P0             41W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  Tesla V100-SXM2-32GB-LS        On  |   00000000:8A:00.0 Off |                    0 |
| N/A   29C    P0             42W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1491851      C   python                                      29706MiB |
|    1   N/A  N/A   1496915      C   ray::RayWorkerVllm                          29548MiB |
|    2   N/A  N/A   1497165      C   ray::RayWorkerVllm                          29656MiB |
|    3   N/A  N/A   1497328      C   ray::RayWorkerVllm                          29596MiB |
+-----------------------------------------------------------------------------------------+

GPU blocks: 12222, CPU blocks: 2048
Throughput: 8.21 requests/s, 3953.50 tokens/s

Conclusion

Fight really hard to save 306MiB * tp * tp memory.

rkooo567

QQ: why do we write the cache to a file? Is it so that all the workers can access the cache info?

In this case, what's happening if there are 2 vllm instances running at the same time? (isn't the cache file overwriting each other, or is it just to check gpu to gpu accessibility, so that they can share the same result?)

vllm/distributed/utils.py

youkaichao · 2024-04-11T23:35:03Z

why do we write the cache to a file? Is it so that all the workers can access the cache info?

If we don't cache it into a file, every master process still needs to initialize cuda context in every gpu, that will lead to tp * 306MiB memory cost. In one machine, the p2p access for a fixed set of GPUs should not change. So we can safely cache it into a file, and later even master process does not need to pay the cost.

what's happening if there are 2 vllm instances running at the same time?

Note that the cache file name is suffixed by cuda_visible_devices. So 2 vllm instances will not have conflict.

WoosukKwon · 2024-04-11T23:50:25Z

QQ: Can we directly use cudaIpcOpenMemHandle instead?

youkaichao · 2024-04-11T23:53:19Z

QQ: Can we directly use cudaIpcOpenMemHandle instead?

We should seek help from @hanzhi713 , I'm not familiar with this :(

cadedaniel · 2024-04-12T00:03:51Z

QQ: Can we directly use cudaIpcOpenMemHandle instead?

We should seek help from @hanzhi713 , I'm not familiar with this :(

We can use cudaDeviceCanAccessPeer, accessible with cupy: https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.runtime.deviceCanAccessPeer.html

rkooo567 · 2024-04-12T00:04:11Z

Makes sense! Maybe it'd be great to have a simple comment in the function for the motivation of writing it to a file!

youkaichao · 2024-04-12T00:11:50Z

Technically many functions can be used to detect p2p access, but the details of which function will cost additional cuda context is extemely vague. cudaIpcOpenMemHandle or cudaDeviceCanAccessPeer might work, but it is hard to say if they will place tp * tp cuda context.

That said, I think caching the result is a universal way, regardless of how we detect p2p access.

youkaichao · 2024-04-12T00:39:54Z

Anyone stamp my PR? Or any additional modification required?

esmeetu · 2024-04-12T00:51:45Z

multi-node seems not working?
Should we delete this cache after p2p test done?

rkooo567 · 2024-04-12T11:38:52Z

Multi node is not working with vllm anyway, so not sure if we should handle it in this PR

youkaichao · 2024-04-13T02:57:14Z

Multi node is not working with vllm anyway

This is not correct. vllm indeed supports multi-node.

@esmeetu I added the case for multi-node, by letting one process per node to create the cache (i.e. local_rank == 0). Could you please test if this works for multi-node setting?

hanzhi713 · 2024-04-13T19:19:03Z

Technically either cudaDeviceCanAccessPeer or cudaIpcOpenMemHandle would suffice. cudaDeviceCanAccessPeer is called by torch.cuda.can_device_access_peer. However, cuda driver sometimes is buggy and might report p2p being supported even though it's not. This can occur on 3090 and 4090. Thus, we need to perform actual p2p copies and check whether the result is correct to mitigate the driver bug.

youkaichao · 2024-04-13T19:23:06Z

I can confirm cudaDeviceCanAccessPeer will cost about tp * tp * 300MB memory. Not sure if cudaIpcOpenMemHandle behaves the same.

Either way, the caching in this PR is reasonable. p2p access pattern between GPUs seldom change.

esmeetu · 2024-04-14T06:16:32Z

Multi node is not working with vllm anyway

This is not correct. vllm indeed supports multi-node.

@esmeetu I added the case for multi-node, by letting one process per node to create the cache (i.e. local_rank == 0). Could you please test if this works for multi-node setting?

Multi-node was not supported when using custom-all-reduce. So this PR looks good to me.

…4021)

youkaichao added 8 commits April 11, 2024 13:07

have a try

b955914

fix code

587a36c

use for both distributed and non-distributed

c81a0e1

use gpu_p2p_access_check

9fcfc85

move _can_actually_p2p to utils

70dab12

move cpu_world_group under is_distributed

f5be9eb

save python rather than pytorch

a51404a

add logger for p2p gpu cache

a413767

youkaichao linked an issue Apr 11, 2024 that may be closed by this pull request

[Bug]: tp>1 every worker takes memory in every GPU after upgrade to 0.4.0 #3821

Closed

rkooo567 reviewed Apr 11, 2024

View reviewed changes

vllm/distributed/utils.py Outdated Show resolved Hide resolved

vllm/distributed/utils.py Show resolved Hide resolved

vllm/distributed/utils.py Outdated Show resolved Hide resolved

rkooo567 reviewed Apr 11, 2024

View reviewed changes

vllm/distributed/utils.py Outdated Show resolved Hide resolved

youkaichao added 2 commits April 11, 2024 16:46

add type annotation

72e1128

fix isort

0199690

add comments

5710359

rkooo567 approved these changes Apr 12, 2024

View reviewed changes

youkaichao added 3 commits April 12, 2024 19:45

add local_rank in parallel state

7dd8c0d

create cache per node

5d8ba95

test nvlink first

0efd68b

test num_dev and world_size first

e4ab2be

esmeetu approved these changes Apr 14, 2024

View reviewed changes

youkaichao merged commit 2cd6b4f into vllm-project:main Apr 14, 2024

youkaichao deleted the cache_p2p branch April 14, 2024 06:40

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

a5b1fec

…4021)

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

fbeff2b

…4021)

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

c948e61

…4021)

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

f39e0b5

…4021)

youkaichao mentioned this pull request May 2, 2024

[Misc] allow user to specify where to write gpu_p2p_access_cache through VLLM_CACHE_DIR env var #4491

Closed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Uh oh!

[Core] avoid too many cuda context by caching p2p test #4021

[Core] avoid too many cuda context by caching p2p test #4021

Uh oh!

Conversation

youkaichao commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before this PR (tp=4):

After this PR (tp=4):

Conclusion

Uh oh!

rkooo567 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youkaichao commented Apr 11, 2024

Uh oh!

WoosukKwon commented Apr 11, 2024

Uh oh!

youkaichao commented Apr 11, 2024

Uh oh!

cadedaniel commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Apr 12, 2024

Uh oh!

youkaichao commented Apr 12, 2024

Uh oh!

youkaichao commented Apr 12, 2024

Uh oh!

esmeetu commented Apr 12, 2024

Uh oh!

rkooo567 commented Apr 12, 2024

Uh oh!

youkaichao commented Apr 13, 2024

Uh oh!

hanzhi713 commented Apr 13, 2024

Uh oh!

youkaichao commented Apr 13, 2024

Uh oh!

esmeetu commented Apr 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

youkaichao commented Apr 11, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading

cadedaniel commented Apr 12, 2024 •

edited

Loading