Skip to content

[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError #20521

@avoutsas67

Description

@avoutsas67

Your current environment

==============================
System Info

OS : macOS 15.5 (arm64)
GCC version : Could not collect
Clang version : 17.0.0 (clang-1700.0.13.5)
CMake version : version 4.0.3
Libc version : N/A

==============================
PyTorch Info

PyTorch version : 2.7.0
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A

==============================
Python Environment

Python version : 3.12.11 (main, Jun 3 2025, 15:41:47) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform : macOS-15.5-arm64-arm-64bit

==============================
CUDA / GPU Info

Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True

==============================
CPU Info

Apple M3 Max

==============================
Versions of relevant libraries

[pip3] numpy==2.2.6
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.53.1
[conda] Could not collect

==============================
vLLM Info

ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.2.dev442+gf73d02aad (git sha: f73d02a)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

==============================
Environment Variables

VLLM_CPU_KVCACHE_SPACE=5
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

Hello,

I have tried to use vllm with google/gemma-3n-E4B-it however when I execute vllm serve "google/gemma-3n-E4B-it"
I get the following error: NotImplementedError KV sharing is not supported in V0.

I am getting the same error irrespective if the the VLLM_CPU_KVCACHE_SPACE variable is set or not.
See information on my system below (running Python 3.12.11):
posix.uname_result(sysname='Darwin', nodename='mac.home', release='24.5.0', version='Darwin Kernel Version 24.5.0: Tue Apr 22 19:52:00 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6031', machine='arm64')

Many thanks in advance for your support,

Achilleas
ERROR 07-06 10:33:15 [engine.py:458] KV sharing is not supported in V0.
ERROR 07-06 10:33:15 [engine.py:458] Traceback (most recent call last):
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-06 10:33:15 [engine.py:458] engine = MQLLMEngine.from_vllm_config(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-06 10:33:15 [engine.py:458] return cls(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in init
ERROR 07-06 10:33:15 [engine.py:458] self.engine = LLMEngine(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in init
ERROR 07-06 10:33:15 [engine.py:458] self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in init
ERROR 07-06 10:33:15 [engine.py:458] self._init_executor()
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-06 10:33:15 [engine.py:458] self.collective_rpc("load_model")
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-06 10:33:15 [engine.py:458] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/init.py", line 2736, in run_method
ERROR 07-06 10:33:15 [engine.py:458] return func(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
ERROR 07-06 10:33:15 [engine.py:458] self.model_runner.load_model()
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
ERROR 07-06 10:33:15 [engine.py:458] self.model = get_model(vllm_config=self.vllm_config)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/init.py", line 59, in get_model
ERROR 07-06 10:33:15 [engine.py:458] return loader.load_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 07-06 10:33:15 [engine.py:458] model = initialize_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
ERROR 07-06 10:33:15 [engine.py:458] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in init
ERROR 07-06 10:33:15 [engine.py:458] self.model = Gemma3nModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in init
ERROR 07-06 10:33:15 [engine.py:458] self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in init
ERROR 07-06 10:33:15 [engine.py:458] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in init
ERROR 07-06 10:33:15 [engine.py:458] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
ERROR 07-06 10:33:15 [engine.py:458] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in
ERROR 07-06 10:33:15 [engine.py:458] lambda prefix: Gemma3nDecoderLayer(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in init
ERROR 07-06 10:33:15 [engine.py:458] self.self_attn = Gemma3nAttention(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in init
ERROR 07-06 10:33:15 [engine.py:458] self.attn = Attention(
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in init
ERROR 07-06 10:33:15 [engine.py:458] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 07-06 10:33:15 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458] File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in init
ERROR 07-06 10:33:15 [engine.py:458] raise NotImplementedError("KV sharing is not supported in V0.")
ERROR 07-06 10:33:15 [engine.py:458] NotImplementedError: KV sharing is not supported in V0.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
raise e from None
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
return cls(
^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in init
self.model_executor = executor_class(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in init
self._init_executor()
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
self.collective_rpc("load_model")
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/init.py", line 2736, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
self.model_runner.load_model()
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/init.py", line 59, in get_model
return loader.load_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
model = initialize_model(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in init
self.model = Gemma3nModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in init
self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in init
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in init
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in
lambda prefix: Gemma3nDecoderLayer(
^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in init
self.self_attn = Gemma3nAttention(
^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in init
self.attn = Attention(
^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in init
self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in init
raise NotImplementedError("KV sharing is not supported in V0.")
NotImplementedError: KV sharing is not supported in V0.
Traceback (most recent call last):
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/bin/vllm", line 10, in
sys.exit(main())
^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions