-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
root@3b4826375ab0:/workspace# python collect_env.py
Collecting environment information...
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (ppc64le)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.28.4
Libc version: glibc-2.35
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-ppc64le-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-12.2/targets/ppc64le-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False
CPU:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Model name: POWER9, altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
Frequency boost: enabled
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 8 MiB (16 instances)
L3 cache: 160 MiB (16 instances)
NUMA node(s): 6
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds: Not affected
Vulnerability Meltdown: Mitigation; RFI Flush, L1D private per thread
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Vulnerability Spectre v2: Mitigation; Indirect branch serialisation (kernel only)
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] cudatoolkit 11.8.0 hedcfb66_13 conda-forge
[conda] libmagma 2.7.2 he288b6c_2 conda-forge
[conda] libmagma_sparse 2.7.2 h5b5c57a_3 conda-forge
[conda] magma 2.7.2 h097a1ca_3 conda-forge
[conda] numpy 1.24.3 py310h87cc683_0
[conda] numpy-base 1.24.3 py310hac71eb6_0
[conda] torch 2.1.2 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.3
vLLM Build Flags:
CUDA Archs: 7.0; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV3 SYS SYS 0-63 0 N/A
GPU1 NV3 X SYS SYS 0-63 0 N/A
GPU2 SYS SYS X NV3 64-127 8 N/A
GPU3 SYS SYS NV3 X 64-127 8 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="./models",
dtype="float16",
tensor_parallel_size=4,
enforce_eager=False,
trust_remote_code=True,
load_format='safetensors'
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")if enforce_eager=False then i am getting the following error, but enforce_eager=True i am not getting any error as of this stage.
root@3b4826375ab0:/workspace# python3 example.py
WARNING 03-25 16:50:03 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-25 16:50:06,721 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 03-25 16:50:09 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./models', tokenizer='./models', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
File "/workspace/example.py", line 10, in <module>
llm = LLM(
File "/root/vllm/vllm/entrypoints/llm.py", line 109, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/root/vllm/vllm/engine/llm_engine.py", line 146, in from_engine_args
engine = cls(*engine_configs,
File "/root/vllm/vllm/engine/llm_engine.py", line 103, in __init__
self.model_executor = executor_class(model_config, cache_config,
File "/root/vllm/vllm/executor/ray_gpu_executor.py", line 60, in __init__
self._init_workers_ray(placement_group)
File "/root/vllm/vllm/executor/ray_gpu_executor.py", line 190, in _init_workers_ray
self._run_workers("init_device",
File "/root/vllm/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/root/vllm/vllm/worker/worker.py", line 92, in init_device
init_distributed_environment(self.parallel_config, self.rank,
File "/root/vllm/vllm/worker/worker.py", line 276, in init_distributed_environment
cupy_utils.init_process_group(
File "/root/vllm/vllm/model_executor/parallel_utils/cupy_utils.py", line 90, in init_process_group
_NCCL_BACKEND = NCCLBackendWithBFloat16(world_size, rank, host, port)
File "/root/miniconda3/lib/python3.10/site-packages/cupyx/distributed/_nccl_comm.py", line 70, in __init__
self._init_with_tcp_store(n_devices, rank, host, port)
File "/root/miniconda3/lib/python3.10/site-packages/cupyx/distributed/_nccl_comm.py", line 93, in _init_with_tcp_store
shifted_nccl_id = bytes([b + 128 for b in nccl_id])
ValueError: bytes must be in range(0, 256)Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working