Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile)

# Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile)

## Summary

`ggml_backend_vk_device_supports_op` has a tensor-size guard (`maxStorageBufferRange` / `maxBufferSize`) that runs **before** the per-op switch. The guard applies to every op including pure view ops (`GGML_OP_NONE`, `GGML_OP_VIEW`, `GGML_OP_RESHAPE`, `GGML_OP_PERMUTE`, `GGML_OP_TRANSPOSE`) — but those ops don't bind storage buffers in any kernel dispatch, so the limit doesn't apply to them.

When a model contains a single tensor larger than `maxStorageBufferRange` and that tensor flows through a view (or is referenced as a leaf already pre-allocated in a Vulkan buffer), the scheduler aborts at `ggml-backend.cpp:809`:

```
pre-allocated tensor (leaf_N) in a buffer (Vulkan0) that cannot run the operation (NONE)
```

Discrete AMD/NVIDIA Vulkan drivers report 4 GB+ for this limit and never trip it. The bug shows up on:

- **Dozen** (Vulkan-on-D3D12 — the AMD/Intel WSL2 path on modern Windows): `maxStorageBufferRange = 128 MB`
- **MoltenVK** (Vulkan-on-Metal): typically `256 MB`
- **Some mobile drivers** (Mali/Adreno) on older devices

## Reproduction

```
Model:  whisper-large-v3 (ggml-large-v3.bin, 3.1 GB, F16)
Driver: Mesa Dozen 26.0.5 (kisak-mesa) on AMD RX 6800 XT via WSL2 → D3D12
Image:  ghcr.io/ggml-org/whisper.cpp:main-vulkan (latest, sha256:86cfd92...)

$ /app/build/bin/whisper-cli -m /models/ggml-large-v3.bin -f sample.wav -l he -t 4 -np

WARNING: dzn is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: 0 = Microsoft Direct3D12 (AMD Radeon RX 6800 XT) (Dozen) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
whisper_model_load: Vulkan0 total size = 3094.36 MB
whisper_init_state: compute buffer (encode) = 55.35 MB
operator(): processing 'sample.wav' (...), 4 threads, 1 processors, lang = he, task = transcribe ...
/app/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (leaf_7) in a buffer (Vulkan0) that cannot run the operation (NONE)
Aborted (core dumped)
```

`vulkaninfo` confirms the limit:
```
maxStorageBufferRange = 134217728  (128 MiB)
maxMemoryAllocationSize = 0x80000000  (2 GiB)
```

Whisper-large-v3's token embedding matrix is **51866 × 1280 × F16 = 132,776,960 bytes (~127 MiB on the storage-buffer-range axis, ~133 MB raw)**, which exceeds Dozen's 128 MiB limit by a few MB. Same model on the same hardware via CPU (`-ng`) transcribes cleanly. Whisper-large-v3-turbo on the same Vulkan path also runs cleanly (its graph happens not to surface the offending leaf the same way).

## Root cause

In `ggml/src/ggml-vulkan/ggml-vulkan.cpp` (current master, function starts at L15167):

```cpp
static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
    ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
    const vk_device& device = ggml_vk_get_device(ctx->device);

    const bool uses_bda = (op->op == GGML_OP_IM2COL || op->op == GGML_OP_IM2COL_3D) &&
                          device->shader_int64 && device->buffer_device_address;

    auto const & tensor_size_supported = [&](size_t tensor_size) {
        if (tensor_size > device->max_buffer_size) { return false; }
        if (!uses_bda && !device->shader_64b_indexing) {
            if (tensor_size > device->properties.limits.maxStorageBufferRange) { return false; }
        }
        return true;
    };
    // reject any tensors larger than the max buffer size
    for (int i = 0; i < GGML_MAX_SRC; i++) {
        if (op->src[i] && !tensor_size_supported(ggml_nbytes(op->src[i]))) { return false; }
    }
    if (!tensor_size_supported(ggml_nbytes(op))) { return false; }

    switch (op->op) {
        // ... ops that DO launch kernels each have their own type/shape checks ...
        case GGML_OP_NONE:
        case GGML_OP_RESHAPE:
        case GGML_OP_VIEW:
        case GGML_OP_PERMUTE:
        case GGML_OP_TRANSPOSE:
            return true;
        // ...
    }
}
```

The `tensor_size_supported` guard exists because most ops bind their operand tensors as Vulkan storage buffers in a kernel dispatch, and `maxStorageBufferRange` is the maximum addressable extent of a single descriptor. That reasoning is correct for compute ops.

It is **not** correct for view-only ops. `OP_NONE` is a leaf (no kernel runs). `OP_VIEW`/`OP_RESHAPE`/`OP_PERMUTE`/`OP_TRANSPOSE` only manipulate `ne[]`/`nb[]` strides on an existing buffer — they don't dispatch any shader, don't bind any descriptor, and don't read/write through `maxStorageBufferRange`-bound bindings. Note that the code already has explicit `> maxStorageBufferRange` fallbacks in the actual matmul kernels (e.g. lines 7460, 7778, 8084, 8295) — that's where the limit *should* be enforced, and it is.

When `supports_op` returns false on a leaf that's already pre-allocated to the Vulkan buffer (which happens because the model loaded into VRAM successfully — the per-buffer allocation goes through `maxBufferSize`/`maxMemoryAllocationSize`, not `maxStorageBufferRange`), the scheduler reaches the abort path in `ggml_backend_sched_backend_id_from_cur` (`ggml/src/ggml-backend.cpp:809`).

## Suggested fix

Skip the size guard for view-only ops. Minimal patch (verified end-to-end on Dozen with whisper-large-v3 — 9:52 audio transcribed in 87 s via `whisper-server` `/inference`):

```diff
 static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
+    // View-only ops don't dispatch kernels, so maxStorageBufferRange / maxBufferSize don't apply.
+    if (op->op == GGML_OP_NONE || op->op == GGML_OP_VIEW || op->op == GGML_OP_RESHAPE
+     || op->op == GGML_OP_PERMUTE || op->op == GGML_OP_TRANSPOSE) {
+        return true;
+    }
     ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
     ...
```

Notes:
- Deliberately **does not** include `GGML_OP_RMS_NORM` or `GGML_OP_ROPE_BACK` from the same case-block-returning-true — those do launch kernels and the size guard is correct for them.
- An equivalent fix could move the early return inside the existing `case GGML_OP_NONE / VIEW / RESHAPE / PERMUTE / TRANSPOSE: return true;` block by short-circuiting the size loop when `op->op` is in that set. Same effect, slightly more invasive structurally.
- Other backends (CUDA, Metal, SYCL) likely have similar guard logic. Not yet verified whether they exhibit the same bug — Dozen surfaces it because of its unusually low limit. Worth a sweep.

## Affected populations

Anyone running a model with at least one tensor larger than the device's `maxStorageBufferRange` on Vulkan. Examples beyond whisper-large-v3:

- Llama 3 8B+: token embedding `128256 × 4096 × F16 = 1.0 GB` → fails on any device with `maxStorageBufferRange < 1 GB`
- Gemma 2/3 (256K vocab × wide hidden): similarly large embeddings
- Any future model with a 50K+ vocab and F16 weights on Dozen/MoltenVK/mobile

## Environment

```
whisper.cpp:    ghcr.io/ggml-org/whisper.cpp:main-vulkan @ 86cfd92553a792b725d8788817fd2abcb487b090c9880955d6a83ea6e7b482c2 (current main, 2026-04-25)
ggml-vulkan:    bundled, master
GPU:            AMD Radeon RX 6800 XT (RDNA2)
Driver:         Mesa Dozen 26.0.5 (kisak-mesa PPA on Ubuntu 24.04)
Vulkan API:     1.2.335
Host:           Windows 11 + WSL2 + Docker Desktop, Linux container
```

Reproducible across:
- whisper-server with default flags
- whisper-server with `-nfa` (no flash-attn)
- whisper-cli direct
- vanilla `ggml-large-v3.bin` (no fine-tune) and ivrit-ai's Hebrew-tuned variant
- two different `main-vulkan` image SHAs (5 days old, 27 days old) — pre-dates any recent scheduler change

## Workarounds for users hitting this before a fix lands

1. Use `whisper-large-v3-turbo` instead of `large-v3` on Dozen/MoltenVK (its graph happens not to expose the leaf the same way — empirical, not a guarantee for all builds).
2. Run on CPU (`-ng`) — usable but slow.
3. Build whisper.cpp with the patch above.

Happy to send a PR if a maintainer thinks the early-return-with-explicit-list shape is the right approach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile) #3777

Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile)

Summary

Reproduction

Root cause

Suggested fix

Affected populations

Environment

Workarounds for users hitting this before a fix lands

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: ggml-vulkan supports_op rejects view-only ops on devices with low maxStorageBufferRange (Dozen, MoltenVK, mobile) #3777

Description

Bug: ggml-vulkan supports_op rejects view-only ops on devices with low maxStorageBufferRange (Dozen, MoltenVK, mobile)

Summary

Reproduction

Root cause

Suggested fix

Affected populations

Environment

Workarounds for users hitting this before a fix lands

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile) #3777

Bug: ggml-vulkan `supports_op` rejects view-only ops on devices with low `maxStorageBufferRange` (Dozen, MoltenVK, mobile)