[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError

### Your current environment

==============================
        System Info
==============================
OS                           : macOS 15.5 (arm64)
GCC version                  : Could not collect
Clang version                : 17.0.0 (clang-1700.0.13.5)
CMake version                : version 4.0.3
Libc version                 : N/A

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  3 2025, 15:41:47) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform              : macOS-15.5-arm64-arm-64bit

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Apple M3 Max

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.53.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.2.dev442+gf73d02aad (git sha: f73d02aad)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
VLLM_CPU_KVCACHE_SPACE=5
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

### 🐛 Describe the bug

Hello, 

I have tried to use vllm with google/gemma-3n-E4B-it however when I execute vllm serve "google/gemma-3n-E4B-it"
I get the following error: NotImplementedError KV sharing is not supported in V0.

I am getting the same error irrespective if the the VLLM_CPU_KVCACHE_SPACE variable is set or not. 
See information on my system below (running Python 3.12.11):
`posix.uname_result(sysname='Darwin', nodename='mac.home', release='24.5.0', version='Darwin Kernel Version 24.5.0: Tue Apr 22 19:52:00 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6031', machine='arm64')`

Many thanks in advance for your support, 

Achilleas
ERROR 07-06 10:33:15 [engine.py:458] KV sharing is not supported in V0.
ERROR 07-06 10:33:15 [engine.py:458] Traceback (most recent call last):
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-06 10:33:15 [engine.py:458]     engine = MQLLMEngine.from_vllm_config(
ERROR 07-06 10:33:15 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-06 10:33:15 [engine.py:458]     return cls(
ERROR 07-06 10:33:15 [engine.py:458]            ^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.engine = LLMEngine(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 07-06 10:33:15 [engine.py:458]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self._init_executor()
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-06 10:33:15 [engine.py:458]     self.collective_rpc("load_model")
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-06 10:33:15 [engine.py:458]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-06 10:33:15 [engine.py:458]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/__init__.py", line 2736, in run_method
ERROR 07-06 10:33:15 [engine.py:458]     return func(*args, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
ERROR 07-06 10:33:15 [engine.py:458]     self.model_runner.load_model()
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
ERROR 07-06 10:33:15 [engine.py:458]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 07-06 10:33:15 [engine.py:458]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
ERROR 07-06 10:33:15 [engine.py:458]     return loader.load_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 07-06 10:33:15 [engine.py:458]     model = initialize_model(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
ERROR 07-06 10:33:15 [engine.py:458]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 07-06 10:33:15 [engine.py:458]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.model = Gemma3nModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
ERROR 07-06 10:33:15 [engine.py:458]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 07-06 10:33:15 [engine.py:458]                                                     ^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
ERROR 07-06 10:33:15 [engine.py:458]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 07-06 10:33:15 [engine.py:458]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in <lambda>
ERROR 07-06 10:33:15 [engine.py:458]     lambda prefix: Gemma3nDecoderLayer(
ERROR 07-06 10:33:15 [engine.py:458]                    ^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.self_attn = Gemma3nAttention(
ERROR 07-06 10:33:15 [engine.py:458]                      ^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.attn = Attention(
ERROR 07-06 10:33:15 [engine.py:458]                 ^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 07-06 10:33:15 [engine.py:458]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-06 10:33:15 [engine.py:458]   File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in __init__
ERROR 07-06 10:33:15 [engine.py:458]     raise NotImplementedError("KV sharing is not supported in V0.")
ERROR 07-06 10:33:15 [engine.py:458] NotImplementedError: KV sharing is not supported in V0.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
    raise e from None
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
    engine = MQLLMEngine.from_vllm_config(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
    return cls(
           ^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/multiprocessing/engine.py", line 87, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/engine/llm_engine.py", line 265, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
    self.collective_rpc("load_model")
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/utils/__init__.py", line 2736, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_worker.py", line 239, in load_model
    self.model_runner.load_model()
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/worker/cpu_model_runner.py", line 486, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/__init__.py", line 59, in get_model
    return loader.load_model(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
    model = initialize_model(vllm_config=vllm_config,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
    return model_class(vllm_config=vllm_config, prefix=prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 774, in __init__
    self.model = Gemma3nModel(vllm_config=vllm_config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 737, in __init__
    self.language_model = Gemma3nTextModel(vllm_config=vllm_config,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/compilation/decorators.py", line 152, in __init__
    old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 579, in __init__
    self.start_layer, self.end_layer, self.layers = make_layers(
                                                    ^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/utils.py", line 640, in make_layers
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 581, in <lambda>
    lambda prefix: Gemma3nDecoderLayer(
                   ^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 389, in __init__
    self.self_attn = Gemma3nAttention(
                     ^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/model_executor/models/gemma3n.py", line 331, in __init__
    self.attn = Attention(
                ^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/layer.py", line 140, in __init__
    self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/attention/backends/torch_sdpa.py", line 415, in __init__
    raise NotImplementedError("KV sharing is not supported in V0.")
NotImplementedError: KV sharing is not supported in V0.
Traceback (most recent call last):
  File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/cli/serve.py", line 55, in cmd
    uvloop.run(run_server(args))
  File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/Users/achilleas.voutsas/Development/Tools/vllm/.venv-py312/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.11/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achilleas.voutsas/Development/Tools/vllm/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError #20521

Your current environment

==============================
System Info

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError #20521

Description

Your current environment

============================== System Info

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

==============================
System Info

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables