Skip to content

UPSTREAM PR #21048: metal: Fix dimension constraint violation in matmul2d descriptor#1304

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21048-lathrys-fix-matmul2d-descriptor-constraint-violati
Open

UPSTREAM PR #21048: metal: Fix dimension constraint violation in matmul2d descriptor#1304
loci-dev wants to merge 1 commit intomainfrom
loci/pr-21048-lathrys-fix-matmul2d-descriptor-constraint-violati

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21048

Updates Metal tensor API test probes to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).

Overview

Some investigation detailed here ggml-org/llama.cpp#16634 (comment) indicated that the test probes for the metal tensor API fails to compile successfully on macOS 26.4, leading to the tensor support in the metal backend being disabled erroneously. This is due to a change in the Apple APIs between the time ggml-org/llama.cpp#16634 was tested and merged by @ggerganov and today. They now require that at least one of the dimensions M and N be a multiple of 16.

Notably, the actual kernels used already respect this constraint (obviously, as they are compiling successfully today), and it is only these test probes which violate it.

I verified this by turning on verbose logging in the calls to ggml_metal_library_init_from_source() and observed the following output when llama-server starts up. Notice ggml_metal_device_init: has tensor = false in the output, which is unexpected on a device with an M5 chip.

Metal compilation error
$ ./build/bin/llama-server --version                                            
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: error compiling source: Error Domain=MTLLibraryErrorDomain Code=3 "In file included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
    mm.run(sB, sA, cT); 
       ^
" UserInfo={NSLocalizedDescription=In file included from program_source:4:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MetalPerformancePrimitives.h:10:
In file included from /System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:389:
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/__impl/MPPTensorOpsMatMul2dImpl.h:3765:5: error: static_assert failed due to requirement '(mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.m % 16) == 0 || (mpp::tensor_ops::matmul2d_descriptor{8, 8, -1, false, false, false, 0}.n % 16) == 0' "At least one of M or N must be a multiple of 16"
    static_assert((descriptor.m % 16) == 0 || (descriptor.n % 16) == 0, "At least one of M or N must be a multiple of 16");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/System/Library/Frameworks/MetalPerformancePrimitives.framework/Headers/MPPTensorOpsMatMul2d.h:415:24: note: in instantiation of function template specialization 'mpp::tensor_ops::__mutmul2d_detail::__run<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>>' requested here
    __mutmul2d_detail::__run<Descriptor, Scope, LeftOperandType,
                       ^
program_source:26:8: note: in instantiation of function template specialization 'mpp::tensor_ops::matmul2d<{8, 8, -1, false, false, false, 0}, metal::execution_simdgroups<4>>::run<metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::tensor<device half, metal::extents<int, 18446744073709551615, 18446744073709551615>, metal::tensor_handle, metal::tensor_offset>, metal::cooperative_tensor<float, metal::extents<int, 18446744073709551615, 18446744073709551615>, mpp::tensor_ops::__mutmul2d_detail::__operand_layout<{8, 8, -1, false, false, false, 0}, mpp::tensor_ops::__mutmul2d_detail::__matmul2d_cooperative_operand_index::destination, metal::execution_simdgroups<4>, half, half, float, int>>, void>' requested here
    mm.run(sB, sA, cT); 
       ^
}
ggml_metal_library_init_from_source: failed to create Metal library from source
ggml_metal_device_init: - the tensor API is not supported in this environment - disabling
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.014 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
version: 1 (0fac87b)
built with AppleClang 21.0.0.21000099 for Darwin arm64

With the change in this branch the error no longer reproduces and the following output is observed:

Successful initialization of metal backend
$ ./build/bin/llama-server --version                                            
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_init_from_source: compiled in 0.105 sec
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x104346e00 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_init_from_source: compiled in 0.070 sec
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x104374500 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 5.792 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
version: 1 (0fac87b)
built with AppleClang 21.0.0.21000099 for Darwin arm64

Additional information

Tests Performed

  1. Verified the expected output above on an M5 device with the latest release of Apple's APIs on macOS 26.4
  2. test-backend-ops: OK

Requirements

Updates Metal tensor API test probe to fix the dimension constraint violation in the matmul2d descriptor (at least one value must be a multiple of 16).
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 27, 2026

No meaningful performance changes were detected across 123810 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants