Skip to content

Bug: ggml-vulkan supports_op rejects view-only ops on devices with low maxStorageBufferRange (Dozen, MoltenVK, mobile) #3777

@Zuzutus

Description

@Zuzutus

Bug: ggml-vulkan supports_op rejects view-only ops on devices with low maxStorageBufferRange (Dozen, MoltenVK, mobile)

Summary

ggml_backend_vk_device_supports_op has a tensor-size guard (maxStorageBufferRange / maxBufferSize) that runs before the per-op switch. The guard applies to every op including pure view ops (GGML_OP_NONE, GGML_OP_VIEW, GGML_OP_RESHAPE, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE) — but those ops don't bind storage buffers in any kernel dispatch, so the limit doesn't apply to them.

When a model contains a single tensor larger than maxStorageBufferRange and that tensor flows through a view (or is referenced as a leaf already pre-allocated in a Vulkan buffer), the scheduler aborts at ggml-backend.cpp:809:

pre-allocated tensor (leaf_N) in a buffer (Vulkan0) that cannot run the operation (NONE)

Discrete AMD/NVIDIA Vulkan drivers report 4 GB+ for this limit and never trip it. The bug shows up on:

  • Dozen (Vulkan-on-D3D12 — the AMD/Intel WSL2 path on modern Windows): maxStorageBufferRange = 128 MB
  • MoltenVK (Vulkan-on-Metal): typically 256 MB
  • Some mobile drivers (Mali/Adreno) on older devices

Reproduction

Model:  whisper-large-v3 (ggml-large-v3.bin, 3.1 GB, F16)
Driver: Mesa Dozen 26.0.5 (kisak-mesa) on AMD RX 6800 XT via WSL2 → D3D12
Image:  ghcr.io/ggml-org/whisper.cpp:main-vulkan (latest, sha256:86cfd92...)

$ /app/build/bin/whisper-cli -m /models/ggml-large-v3.bin -f sample.wav -l he -t 4 -np

WARNING: dzn is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: 0 = Microsoft Direct3D12 (AMD Radeon RX 6800 XT) (Dozen) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
whisper_model_load: Vulkan0 total size = 3094.36 MB
whisper_init_state: compute buffer (encode) = 55.35 MB
operator(): processing 'sample.wav' (...), 4 threads, 1 processors, lang = he, task = transcribe ...
/app/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (leaf_7) in a buffer (Vulkan0) that cannot run the operation (NONE)
Aborted (core dumped)

vulkaninfo confirms the limit:

maxStorageBufferRange = 134217728  (128 MiB)
maxMemoryAllocationSize = 0x80000000  (2 GiB)

Whisper-large-v3's token embedding matrix is 51866 × 1280 × F16 = 132,776,960 bytes (~127 MiB on the storage-buffer-range axis, ~133 MB raw), which exceeds Dozen's 128 MiB limit by a few MB. Same model on the same hardware via CPU (-ng) transcribes cleanly. Whisper-large-v3-turbo on the same Vulkan path also runs cleanly (its graph happens not to surface the offending leaf the same way).

Root cause

In ggml/src/ggml-vulkan/ggml-vulkan.cpp (current master, function starts at L15167):

static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
    ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
    const vk_device& device = ggml_vk_get_device(ctx->device);

    const bool uses_bda = (op->op == GGML_OP_IM2COL || op->op == GGML_OP_IM2COL_3D) &&
                          device->shader_int64 && device->buffer_device_address;

    auto const & tensor_size_supported = [&](size_t tensor_size) {
        if (tensor_size > device->max_buffer_size) { return false; }
        if (!uses_bda && !device->shader_64b_indexing) {
            if (tensor_size > device->properties.limits.maxStorageBufferRange) { return false; }
        }
        return true;
    };
    // reject any tensors larger than the max buffer size
    for (int i = 0; i < GGML_MAX_SRC; i++) {
        if (op->src[i] && !tensor_size_supported(ggml_nbytes(op->src[i]))) { return false; }
    }
    if (!tensor_size_supported(ggml_nbytes(op))) { return false; }

    switch (op->op) {
        // ... ops that DO launch kernels each have their own type/shape checks ...
        case GGML_OP_NONE:
        case GGML_OP_RESHAPE:
        case GGML_OP_VIEW:
        case GGML_OP_PERMUTE:
        case GGML_OP_TRANSPOSE:
            return true;
        // ...
    }
}

The tensor_size_supported guard exists because most ops bind their operand tensors as Vulkan storage buffers in a kernel dispatch, and maxStorageBufferRange is the maximum addressable extent of a single descriptor. That reasoning is correct for compute ops.

It is not correct for view-only ops. OP_NONE is a leaf (no kernel runs). OP_VIEW/OP_RESHAPE/OP_PERMUTE/OP_TRANSPOSE only manipulate ne[]/nb[] strides on an existing buffer — they don't dispatch any shader, don't bind any descriptor, and don't read/write through maxStorageBufferRange-bound bindings. Note that the code already has explicit > maxStorageBufferRange fallbacks in the actual matmul kernels (e.g. lines 7460, 7778, 8084, 8295) — that's where the limit should be enforced, and it is.

When supports_op returns false on a leaf that's already pre-allocated to the Vulkan buffer (which happens because the model loaded into VRAM successfully — the per-buffer allocation goes through maxBufferSize/maxMemoryAllocationSize, not maxStorageBufferRange), the scheduler reaches the abort path in ggml_backend_sched_backend_id_from_cur (ggml/src/ggml-backend.cpp:809).

Suggested fix

Skip the size guard for view-only ops. Minimal patch (verified end-to-end on Dozen with whisper-large-v3 — 9:52 audio transcribed in 87 s via whisper-server /inference):

 static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
+    // View-only ops don't dispatch kernels, so maxStorageBufferRange / maxBufferSize don't apply.
+    if (op->op == GGML_OP_NONE || op->op == GGML_OP_VIEW || op->op == GGML_OP_RESHAPE
+     || op->op == GGML_OP_PERMUTE || op->op == GGML_OP_TRANSPOSE) {
+        return true;
+    }
     ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
     ...

Notes:

  • Deliberately does not include GGML_OP_RMS_NORM or GGML_OP_ROPE_BACK from the same case-block-returning-true — those do launch kernels and the size guard is correct for them.
  • An equivalent fix could move the early return inside the existing case GGML_OP_NONE / VIEW / RESHAPE / PERMUTE / TRANSPOSE: return true; block by short-circuiting the size loop when op->op is in that set. Same effect, slightly more invasive structurally.
  • Other backends (CUDA, Metal, SYCL) likely have similar guard logic. Not yet verified whether they exhibit the same bug — Dozen surfaces it because of its unusually low limit. Worth a sweep.

Affected populations

Anyone running a model with at least one tensor larger than the device's maxStorageBufferRange on Vulkan. Examples beyond whisper-large-v3:

  • Llama 3 8B+: token embedding 128256 × 4096 × F16 = 1.0 GB → fails on any device with maxStorageBufferRange < 1 GB
  • Gemma 2/3 (256K vocab × wide hidden): similarly large embeddings
  • Any future model with a 50K+ vocab and F16 weights on Dozen/MoltenVK/mobile

Environment

whisper.cpp:    ghcr.io/ggml-org/whisper.cpp:main-vulkan @ 86cfd92553a792b725d8788817fd2abcb487b090c9880955d6a83ea6e7b482c2 (current main, 2026-04-25)
ggml-vulkan:    bundled, master
GPU:            AMD Radeon RX 6800 XT (RDNA2)
Driver:         Mesa Dozen 26.0.5 (kisak-mesa PPA on Ubuntu 24.04)
Vulkan API:     1.2.335
Host:           Windows 11 + WSL2 + Docker Desktop, Linux container

Reproducible across:

  • whisper-server with default flags
  • whisper-server with -nfa (no flash-attn)
  • whisper-cli direct
  • vanilla ggml-large-v3.bin (no fine-tune) and ivrit-ai's Hebrew-tuned variant
  • two different main-vulkan image SHAs (5 days old, 27 days old) — pre-dates any recent scheduler change

Workarounds for users hitting this before a fix lands

  1. Use whisper-large-v3-turbo instead of large-v3 on Dozen/MoltenVK (its graph happens not to expose the leaf the same way — empirical, not a guarantee for all builds).
  2. Run on CPU (-ng) — usable but slow.
  3. Build whisper.cpp with the patch above.

Happy to send a PR if a maintainer thinks the early-return-with-explicit-list shape is the right approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions