[Bug]: tp>1 every worker takes memory in every GPU after upgrade to 0.4.0

### Your current environment

```text
$ python collect_env.py
Collecting environment information...
PyTorch version: 2.1.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 5.4.0
Clang version: Could not collect
CMake version: version 3.27.7
Libc version: glibc-2.17

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 515.65.01
cuDNN version: /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    1
Core(s) per socket:    36
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
Stepping:              3
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              28160K
NUMA node0 CPU(s):     0-35
NUMA node1 CPU(s):     36-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 arat

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.2+cu118
[pip3] torchaudio==2.1.2+cu118
[pip3] torchvision==0.15.2a0
[pip3] torchviz==0.0.2
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit               11.8.0              h4ba93d1_12    conda-forge
[conda] libfaiss                  1.7.4            h2bc3f7f_0_cpu    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46343    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service               2.4.0           py311h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft                   1.3.8           py311h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random                1.2.4           py311hdb19cb5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] torch                     2.1.2+cu118              pypi_0    pypi
[conda] torchaudio                2.1.2+cu118              pypi_0    pypi
[conda] torchvision               0.15.2          cuda118py311h4cc2eb7_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] torchviz                  0.0.2                    pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV2     NV1     PHB     PHB     PHB     NV2     0-71    0-1
GPU1    NV1      X      NV1     NV2     PHB     PHB     NV2     PHB     0-71    0-1
GPU2    NV2     NV1      X      NV2     PHB     NV1     PHB     PHB     0-71    0-1
GPU3    NV1     NV2     NV2      X      NV1     PHB     PHB     PHB     0-71    0-1
GPU4    PHB     PHB     PHB     NV1      X      NV2     NV2     NV1     0-71    0-1
GPU5    PHB     PHB     NV1     PHB     NV2      X      NV1     NV2     0-71    0-1
GPU6    PHB     NV2     PHB     PHB     NV2     NV1      X      NV1     0-71    0-1
GPU7    NV2     PHB     PHB     PHB     NV1     NV2     NV1      X      0-71    0-1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```


### 🐛 Describe the bug


```
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    995011      C   python                          27473MiB |
|    0   N/A  N/A    999164      C   ray::RayWorkerVllm                897MiB |
|    0   N/A  N/A    999315      C   ray::RayWorkerVllm                897MiB |
|    0   N/A  N/A    999400      C   ray::RayWorkerVllm                897MiB |
|    1   N/A  N/A    995011      C   python                            897MiB |
|    1   N/A  N/A    999164      C   ray::RayWorkerVllm              27577MiB |
|    1   N/A  N/A    999315      C   ray::RayWorkerVllm                897MiB |
|    1   N/A  N/A    999400      C   ray::RayWorkerVllm                897MiB |
|    2   N/A  N/A    995011      C   python                            897MiB |
|    2   N/A  N/A    999164      C   ray::RayWorkerVllm                897MiB |
|    2   N/A  N/A    999315      C   ray::RayWorkerVllm              27745MiB |
|    2   N/A  N/A    999400      C   ray::RayWorkerVllm                897MiB |
|    3   N/A  N/A    995011      C   python                            897MiB |
|    3   N/A  N/A    999164      C   ray::RayWorkerVllm                897MiB |
|    3   N/A  N/A    999315      C   ray::RayWorkerVllm                897MiB |
|    3   N/A  N/A    999400      C   ray::RayWorkerVllm              27585MiB |
+-----------------------------------------------------------------------------+
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: tp>1 every worker takes memory in every GPU after upgrade to 0.4.0 #3821

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: tp>1 every worker takes memory in every GPU after upgrade to 0.4.0 #3821

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions