-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Your current environment
==============================
System Info
OS : macOS 15.5 (arm64)
GCC version : Could not collect
Clang version : 17.0.0 (clang-1700.0.13.5)
CMake version : version 4.0.3
Libc version : N/A
==============================
PyTorch Info
PyTorch version : 2.7.0
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
Python version : 3.12.11 (main, Jun 3 2025, 15:41:47) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform : macOS-15.5-arm64-arm-64bit
==============================
CUDA / GPU Info
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
Apple M3 Max
==============================
Versions of relevant libraries
[pip3] numpy==2.2.6
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.53.1
[conda] Could not collect
==============================
vLLM Info
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.2.dev442+gf73d02aad (git sha: f73d02a)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
VLLM_CPU_KVCACHE_SPACE=5
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
Hello,
I have tried to use vllm with google/gemma-3n-E4B-it however when I execute vllm serve "google/gemma-3n-E4B-it"
I get the following error: NotImplementedError KV sharing is not supported in V0.
I am getting the same error irrespective if the the VLLM_CPU_KVCACHE_SPACE variable is set or not.
See information on my system below (running Python 3.12.11):
posix.uname_result(sysname='Darwin', nodename='mac.home', release='24.5.0', version='Darwin Kernel Version 24.5.0: Tue Apr 22 19:52:00 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6031', machine='arm64')
Many thanks in advance for your support,
Achilleas
ERROR 07-06 10:33:15 [engine.py:458] KV sharing is not supported in V0.
ERROR 07-06 10:33:15 [engine.py:458] Traceback (most recent call last):
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-06 10:33:15 [engine.py:458] engine = MQLLMEngine.from_vllm_config(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-06 10:33:15 [engine.py:458] return cls(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in init
ERROR 07-06 10:33:15 [engine.py:458] self.engine = LLMEngine(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in init
ERROR 07-06 10:33:15 [engine.py:458] self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in init
ERROR 07-06 10:33:15 [engine.py:458] self._init_executor()
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-06 10:33:15 [engine.py:458] self.collective_rpc("load_model")
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-06 10:33:15 [engine.py:458] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/init.py", line 2736, in run_method
ERROR 07-06 10:33:15 [engine.py:458] return func(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
ERROR 07-06 10:33:15 [engine.py:458] self.model_runner.load_model()
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
ERROR 07-06 10:33:15 [engine.py:458] self.model = get_model(vllm_config=self.vllm_config)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/init.py", line 59, in get_model
ERROR 07-06 10:33:15 [engine.py:458] return loader.load_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 07-06 10:33:15 [engine.py:458] model = initialize_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
ERROR 07-06 10:33:15 [engine.py:458] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in init
ERROR 07-06 10:33:15 [engine.py:458] self.model = Gemma3nModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in init
ERROR 07-06 10:33:15 [engine.py:458] self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in init
ERROR 07-06 10:33:15 [engine.py:458] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in init
ERROR 07-06 10:33:15 [engine.py:458] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
ERROR 07-06 10:33:15 [engine.py:458] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in
ERROR 07-06 10:33:15 [engine.py:458] lambda prefix: Gemma3nDecoderLayer(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in init
ERROR 07-06 10:33:15 [engine.py:458] self.self_attn = Gemma3nAttention(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in init
ERROR 07-06 10:33:15 [engine.py:458] self.attn = Attention(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in init
ERROR 07-06 10:33:15 [engine.py:458] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in init
ERROR 07-06 10:33:15 [engine.py:458] raise NotImplementedError("KV sharing is not supported in V0.")
ERROR 07-06 10:33:15 [engine.py:458] NotImplementedError: KV sharing is not supported in V0.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
raise e from None
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
return cls(
^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in init
self.model_executor = executor_class(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in init
self._init_executor()
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
self.collective_rpc("load_model")
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/init.py", line 2736, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
self.model_runner.load_model()
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/init.py", line 59, in get_model
return loader.load_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
model = initialize_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in init
self.model = Gemma3nModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in init
self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in init
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in init
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in
lambda prefix: Gemma3nDecoderLayer(
^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in init
self.self_attn = Gemma3nAttention(
^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in init
self.attn = Attention(
^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in init
self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in init
raise NotImplementedError("KV sharing is not supported in V0.")
NotImplementedError: KV sharing is not supported in V0.
Traceback (most recent call last):
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/bin/vllm", line 10, in
sys.exit(main())
^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.