UPSTREAM PR #21048: metal: Fix dimension constraint violation in matmul2d descriptor by loci-dev · Pull Request #1304 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-27T03:10:40Z

Note

Source pull request: ggml-org/llama.cpp#21048

Updates Metal tensor API test probes to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

Overview

Some investigation detailed here ggml-org/llama.cpp#16634 (comment) indicated that the test probes for the metal tensor API fails to compile successfully on macOS 26.4, leading to the tensor support in the metal backend being disabled erroneously. This is due to a change in the Apple APIs between the time ggml-org/llama.cpp#16634 was tested and merged by @ggerganov and today. They now require that at least one of the dimensions M and N be a multiple of 16.

Notably, the actual kernels used already respect this constraint (obviously, as they are compiling successfully today), and it is only these test probes which violate it.

I verified this by turning on verbose logging in the calls to ggml_metal_library_init_from_source() and observed the following output when llama-server starts up. Notice ggml_metal_device_init: has tensor = false in the output, which is unexpected on a device with an M5 chip.

Metal compilation error

$ ./build/bin/llama-server --version                                            
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: error compiling source: Error Domain=MTLLibraryErrorDomain Code=3 "In file included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
    mm.run(sB, sA, cT); 
       ^
" UserInfo={NSLocalizedDescription=In file included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
    mm.run(sB, sA, cT); 
       ^
}
ggml_metal_library_init_from_source: failed to create Metal library from source
ggml_metal_device_init: - the tensor API is not supported in this environment - disabling
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.014 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
version: 1 (0fac87b)
built with AppleClang 21.0.0.21000099 for Darwin arm64

With the change in this branch the error no longer reproduces and the following output is observed:

Successful initialization of metal backend

$ ./build/bin/llama-server --version                                            
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: compiled in 0.105 sec
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x104346e00 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_init_from_source: compiled in 0.070 sec
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x104374500 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.792 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
version: 1 (0fac87b)
built with AppleClang 21.0.0.21000099 for Darwin arm64

Additional information

Tests Performed

Verified the expected output above on an M5 device with the latest release of Apple's APIs on macOS 26.4
test-backend-ops: OK

Requirements

I have read and agree with the contributing guidelines: YES
AI usage disclosure: NO

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

loci-review · 2026-03-27T04:04:56Z

No meaningful performance changes were detected across 123810 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Fix dimension constraint violation in matmul2d descriptor

a307582

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

loci-dev temporarily deployed to PROD__AL_DEMO March 27, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21048: metal: Fix dimension constraint violation in matmul2d descriptor#1304

UPSTREAM PR #21048: metal: Fix dimension constraint violation in matmul2d descriptor#1304
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21048-lathrys-fix-matmul2d-descriptor-constraint-violati

loci-dev commented Mar 27, 2026

Uh oh!

loci-review bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 27, 2026

Overview

Additional information

Tests Performed

Requirements

Uh oh!

loci-review bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants