Bug: ggml-vulkan supports_op rejects view-only ops on devices with low maxStorageBufferRange (Dozen, MoltenVK, mobile)
Summary
ggml_backend_vk_device_supports_op has a tensor-size guard (maxStorageBufferRange / maxBufferSize) that runs before the per-op switch. The guard applies to every op including pure view ops (GGML_OP_NONE, GGML_OP_VIEW, GGML_OP_RESHAPE, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE) — but those ops don't bind storage buffers in any kernel dispatch, so the limit doesn't apply to them.
When a model contains a single tensor larger than maxStorageBufferRange and that tensor flows through a view (or is referenced as a leaf already pre-allocated in a Vulkan buffer), the scheduler aborts at ggml-backend.cpp:809:
pre-allocated tensor (leaf_N) in a buffer (Vulkan0) that cannot run the operation (NONE)
Discrete AMD/NVIDIA Vulkan drivers report 4 GB+ for this limit and never trip it. The bug shows up on:
- Dozen (Vulkan-on-D3D12 — the AMD/Intel WSL2 path on modern Windows):
maxStorageBufferRange = 128 MB
- MoltenVK (Vulkan-on-Metal): typically
256 MB
- Some mobile drivers (Mali/Adreno) on older devices
Reproduction
Model: whisper-large-v3 (ggml-large-v3.bin, 3.1 GB, F16)
Driver: Mesa Dozen 26.0.5 (kisak-mesa) on AMD RX 6800 XT via WSL2 → D3D12
Image: ghcr.io/ggml-org/whisper.cpp:main-vulkan (latest, sha256:86cfd92...)
$ /app/build/bin/whisper-cli -m /models/ggml-large-v3.bin -f sample.wav -l he -t 4 -np
WARNING: dzn is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: 0 = Microsoft Direct3D12 (AMD Radeon RX 6800 XT) (Dozen) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
whisper_model_load: Vulkan0 total size = 3094.36 MB
whisper_init_state: compute buffer (encode) = 55.35 MB
operator(): processing 'sample.wav' (...), 4 threads, 1 processors, lang = he, task = transcribe ...
/app/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (leaf_7) in a buffer (Vulkan0) that cannot run the operation (NONE)
Aborted (core dumped)
vulkaninfo confirms the limit:
maxStorageBufferRange = 134217728 (128 MiB)
maxMemoryAllocationSize = 0x80000000 (2 GiB)
Whisper-large-v3's token embedding matrix is 51866 × 1280 × F16 = 132,776,960 bytes (~127 MiB on the storage-buffer-range axis, ~133 MB raw), which exceeds Dozen's 128 MiB limit by a few MB. Same model on the same hardware via CPU (-ng) transcribes cleanly. Whisper-large-v3-turbo on the same Vulkan path also runs cleanly (its graph happens not to surface the offending leaf the same way).
Root cause
In ggml/src/ggml-vulkan/ggml-vulkan.cpp (current master, function starts at L15167):
static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
const vk_device& device = ggml_vk_get_device(ctx->device);
const bool uses_bda = (op->op == GGML_OP_IM2COL || op->op == GGML_OP_IM2COL_3D) &&
device->shader_int64 && device->buffer_device_address;
auto const & tensor_size_supported = [&](size_t tensor_size) {
if (tensor_size > device->max_buffer_size) { return false; }
if (!uses_bda && !device->shader_64b_indexing) {
if (tensor_size > device->properties.limits.maxStorageBufferRange) { return false; }
}
return true;
};
// reject any tensors larger than the max buffer size
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (op->src[i] && !tensor_size_supported(ggml_nbytes(op->src[i]))) { return false; }
}
if (!tensor_size_supported(ggml_nbytes(op))) { return false; }
switch (op->op) {
// ... ops that DO launch kernels each have their own type/shape checks ...
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
case GGML_OP_TRANSPOSE:
return true;
// ...
}
}
The tensor_size_supported guard exists because most ops bind their operand tensors as Vulkan storage buffers in a kernel dispatch, and maxStorageBufferRange is the maximum addressable extent of a single descriptor. That reasoning is correct for compute ops.
It is not correct for view-only ops. OP_NONE is a leaf (no kernel runs). OP_VIEW/OP_RESHAPE/OP_PERMUTE/OP_TRANSPOSE only manipulate ne[]/nb[] strides on an existing buffer — they don't dispatch any shader, don't bind any descriptor, and don't read/write through maxStorageBufferRange-bound bindings. Note that the code already has explicit > maxStorageBufferRange fallbacks in the actual matmul kernels (e.g. lines 7460, 7778, 8084, 8295) — that's where the limit should be enforced, and it is.
When supports_op returns false on a leaf that's already pre-allocated to the Vulkan buffer (which happens because the model loaded into VRAM successfully — the per-buffer allocation goes through maxBufferSize/maxMemoryAllocationSize, not maxStorageBufferRange), the scheduler reaches the abort path in ggml_backend_sched_backend_id_from_cur (ggml/src/ggml-backend.cpp:809).
Suggested fix
Skip the size guard for view-only ops. Minimal patch (verified end-to-end on Dozen with whisper-large-v3 — 9:52 audio transcribed in 87 s via whisper-server /inference):
static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
+ // View-only ops don't dispatch kernels, so maxStorageBufferRange / maxBufferSize don't apply.
+ if (op->op == GGML_OP_NONE || op->op == GGML_OP_VIEW || op->op == GGML_OP_RESHAPE
+ || op->op == GGML_OP_PERMUTE || op->op == GGML_OP_TRANSPOSE) {
+ return true;
+ }
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
...
Notes:
- Deliberately does not include
GGML_OP_RMS_NORM or GGML_OP_ROPE_BACK from the same case-block-returning-true — those do launch kernels and the size guard is correct for them.
- An equivalent fix could move the early return inside the existing
case GGML_OP_NONE / VIEW / RESHAPE / PERMUTE / TRANSPOSE: return true; block by short-circuiting the size loop when op->op is in that set. Same effect, slightly more invasive structurally.
- Other backends (CUDA, Metal, SYCL) likely have similar guard logic. Not yet verified whether they exhibit the same bug — Dozen surfaces it because of its unusually low limit. Worth a sweep.
Affected populations
Anyone running a model with at least one tensor larger than the device's maxStorageBufferRange on Vulkan. Examples beyond whisper-large-v3:
- Llama 3 8B+: token embedding
128256 × 4096 × F16 = 1.0 GB → fails on any device with maxStorageBufferRange < 1 GB
- Gemma 2/3 (256K vocab × wide hidden): similarly large embeddings
- Any future model with a 50K+ vocab and F16 weights on Dozen/MoltenVK/mobile
Environment
whisper.cpp: ghcr.io/ggml-org/whisper.cpp:main-vulkan @ 86cfd92553a792b725d8788817fd2abcb487b090c9880955d6a83ea6e7b482c2 (current main, 2026-04-25)
ggml-vulkan: bundled, master
GPU: AMD Radeon RX 6800 XT (RDNA2)
Driver: Mesa Dozen 26.0.5 (kisak-mesa PPA on Ubuntu 24.04)
Vulkan API: 1.2.335
Host: Windows 11 + WSL2 + Docker Desktop, Linux container
Reproducible across:
- whisper-server with default flags
- whisper-server with
-nfa (no flash-attn)
- whisper-cli direct
- vanilla
ggml-large-v3.bin (no fine-tune) and ivrit-ai's Hebrew-tuned variant
- two different
main-vulkan image SHAs (5 days old, 27 days old) — pre-dates any recent scheduler change
Workarounds for users hitting this before a fix lands
- Use
whisper-large-v3-turbo instead of large-v3 on Dozen/MoltenVK (its graph happens not to expose the leaf the same way — empirical, not a guarantee for all builds).
- Run on CPU (
-ng) — usable but slow.
- Build whisper.cpp with the patch above.
Happy to send a PR if a maintainer thinks the early-return-with-explicit-list shape is the right approach.
Bug: ggml-vulkan
supports_oprejects view-only ops on devices with lowmaxStorageBufferRange(Dozen, MoltenVK, mobile)Summary
ggml_backend_vk_device_supports_ophas a tensor-size guard (maxStorageBufferRange/maxBufferSize) that runs before the per-op switch. The guard applies to every op including pure view ops (GGML_OP_NONE,GGML_OP_VIEW,GGML_OP_RESHAPE,GGML_OP_PERMUTE,GGML_OP_TRANSPOSE) — but those ops don't bind storage buffers in any kernel dispatch, so the limit doesn't apply to them.When a model contains a single tensor larger than
maxStorageBufferRangeand that tensor flows through a view (or is referenced as a leaf already pre-allocated in a Vulkan buffer), the scheduler aborts atggml-backend.cpp:809:Discrete AMD/NVIDIA Vulkan drivers report 4 GB+ for this limit and never trip it. The bug shows up on:
maxStorageBufferRange = 128 MB256 MBReproduction
vulkaninfoconfirms the limit:Whisper-large-v3's token embedding matrix is 51866 × 1280 × F16 = 132,776,960 bytes (~127 MiB on the storage-buffer-range axis, ~133 MB raw), which exceeds Dozen's 128 MiB limit by a few MB. Same model on the same hardware via CPU (
-ng) transcribes cleanly. Whisper-large-v3-turbo on the same Vulkan path also runs cleanly (its graph happens not to surface the offending leaf the same way).Root cause
In
ggml/src/ggml-vulkan/ggml-vulkan.cpp(current master, function starts at L15167):The
tensor_size_supportedguard exists because most ops bind their operand tensors as Vulkan storage buffers in a kernel dispatch, andmaxStorageBufferRangeis the maximum addressable extent of a single descriptor. That reasoning is correct for compute ops.It is not correct for view-only ops.
OP_NONEis a leaf (no kernel runs).OP_VIEW/OP_RESHAPE/OP_PERMUTE/OP_TRANSPOSEonly manipulatene[]/nb[]strides on an existing buffer — they don't dispatch any shader, don't bind any descriptor, and don't read/write throughmaxStorageBufferRange-bound bindings. Note that the code already has explicit> maxStorageBufferRangefallbacks in the actual matmul kernels (e.g. lines 7460, 7778, 8084, 8295) — that's where the limit should be enforced, and it is.When
supports_opreturns false on a leaf that's already pre-allocated to the Vulkan buffer (which happens because the model loaded into VRAM successfully — the per-buffer allocation goes throughmaxBufferSize/maxMemoryAllocationSize, notmaxStorageBufferRange), the scheduler reaches the abort path inggml_backend_sched_backend_id_from_cur(ggml/src/ggml-backend.cpp:809).Suggested fix
Skip the size guard for view-only ops. Minimal patch (verified end-to-end on Dozen with whisper-large-v3 — 9:52 audio transcribed in 87 s via
whisper-server/inference):static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggml_tensor * op) { + // View-only ops don't dispatch kernels, so maxStorageBufferRange / maxBufferSize don't apply. + if (op->op == GGML_OP_NONE || op->op == GGML_OP_VIEW || op->op == GGML_OP_RESHAPE + || op->op == GGML_OP_PERMUTE || op->op == GGML_OP_TRANSPOSE) { + return true; + } ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context; ...Notes:
GGML_OP_RMS_NORMorGGML_OP_ROPE_BACKfrom the same case-block-returning-true — those do launch kernels and the size guard is correct for them.case GGML_OP_NONE / VIEW / RESHAPE / PERMUTE / TRANSPOSE: return true;block by short-circuiting the size loop whenop->opis in that set. Same effect, slightly more invasive structurally.Affected populations
Anyone running a model with at least one tensor larger than the device's
maxStorageBufferRangeon Vulkan. Examples beyond whisper-large-v3:128256 × 4096 × F16 = 1.0 GB→ fails on any device withmaxStorageBufferRange < 1 GBEnvironment
Reproducible across:
-nfa(no flash-attn)ggml-large-v3.bin(no fine-tune) and ivrit-ai's Hebrew-tuned variantmain-vulkanimage SHAs (5 days old, 27 days old) — pre-dates any recent scheduler changeWorkarounds for users hitting this before a fix lands
whisper-large-v3-turboinstead oflarge-v3on Dozen/MoltenVK (its graph happens not to expose the leaf the same way — empirical, not a guarantee for all builds).-ng) — usable but slow.Happy to send a PR if a maintainer thinks the early-return-with-explicit-list shape is the right approach.