Faster complex matmul by CC-Yeh · Pull Request #2571 · ml-explore/mlx

CC-Yeh · 2025-09-06T16:18:21Z

Proposed changes

Introduce cblas_cgemm
Metal: Make gemv compatible with complex64_t
Metal: Add complex64 BlockMMA specialization to simplify gemm integration.
Cuda : Make gemv and gemm compatible with complex64_t.

Only tuned on small chip, will need people with larger chips to tune the tile size.

Closes #2076

##Benchmarks

Metal

Average 6x faster for new gemv (bench_gemv.py)

Average 1.5x faster for new gemm (bench_gemm.py)

CUDA

Average 6x faster for new gemv

gemm
~1.7 times lower, need help from someone with CUDA expertise

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

mlx/ops.cpp

awni · 2025-09-08T13:56:23Z

This looks like a very nice improvement. There are two issues we need to work through:

The JIT tests are broken on macOS. Are you able to resolve that?
We need a strategy for CUDA. cuBlas has support for complex 64-bit type, so maybe we can just route to that?

Are you up for making those two fixes? If you are unable to make the CUDA change we can help with that, let us know.

CC-Yeh · 2025-09-08T18:12:46Z

This looks like a very nice improvement. There are two issues we need to work through:

The JIT tests are broken on macOS. Are you able to resolve that?

Hope the latest commit will fix this.

We need a strategy for CUDA. cuBlas has support for complex 64-bit type, so maybe we can just route to that?

I’d love to add it. I don’t have access to an NVIDIA GPU, so I can’t test CUDA locally.

awni · 2025-09-08T18:38:52Z

Hi Dan - could you DM me and we can discuss work-arounds? You can reach me on X or at awni at apple.com.

raishish · 2025-09-11T04:01:32Z

I was planning to work on this issue as well but @Dan-Yeh beat me to it lol.

Anyway, I benchmarked the CPU backend on my M2 Air (cblas_gemm vs the current op-based Karatsuba algorithm).

Observations:

Karatsuba is a bit quicker (20-40%) for some cases for nn transpose config. I suppose I can scale this further to see if this trend continues.
However, in case of nt and tn configs, the cblas implementation is overall much faster. My guess is since it can intrinsically account for the transpose operation, it makes for a more efficient implementation.

I don't suppose it makes sense to retain the Karatsuba approach for CPU. In hindsight, it makes sense that cblas would run faster than the naive Karatsuba decomposition but I'm glad I could actually benchmark it for my curiosity.

CC: @awni

CC-Yeh · 2025-09-13T16:43:05Z

@awni

We need a strategy for CUDA. cuBlas has support for complex 64-bit type, so maybe we can just route to that?

I think CUDA_C_32F is already supported.

awni · 2025-09-13T17:27:15Z

Oh ok.. in that case we may just need to update the routing conditions so we don't route to the custom gemv when the input has complex type.

CC-Yeh · 2025-09-19T21:30:55Z

Interestingly, ops based solution can outperform cublasLt in 18/21 cases, except skinny or small matrix.

CC-Yeh · 2025-09-19T22:15:15Z

Hey @awni, can you take a look again?

I believe the CPU, Metal, and CUDA parts are in good shape. One note: CUDA::gemm runs slower than the prior ops-based route and could use optimization by a CUDA specialist.

awni · 2025-09-22T17:58:17Z

One note: CUDA::gemm runs slower than the prior ops-based route and could use optimization by a CUDA specialist.

It's most likely because it's no longer hitting tensor cores 🤔 . I am not really sure what's expected there when using complex. Tensor cores do matmuls in lower precision (tf32).

awni · 2025-09-22T18:01:48Z

mlx/backend/cuda/matmul.cpp

+      array bias_arr = astype(*bias, out.dtype(), s);
+      out = add(out, bias_arr, s);


This is problematic here. We can't do MLX ops inside a primitive's evaluation.

Instead of doing the fallback this way, just fallback to the full addmm. So the condition could be

if (bias && a.dtype() != complex64)

awni · 2025-09-22T18:02:28Z

Hey @Dan-Yeh this is looking good. Could you please rebase, address the latest remarks and then I will run the tests?

CC-Yeh · 2025-09-22T19:25:15Z

@awni Done!

Co-authored-by: Awni Hannun <awni.hannun@gmail.com>

CC-Yeh · 2025-10-01T18:34:03Z

Hi @awni
Just removed some redundancies and rebased again.
Can you take a look? Thanks.

angeloskath · 2025-10-01T22:04:19Z

Thank you @Dan-Yeh . I will do some more checks and merge tonight. Thanks for the awesome PR and for your patience.

angeloskath

It's most likely because it's no longer hitting tensor cores 🤔 .

I enabled TF32 for the complex gemm and now it is 2x faster than before 🚀.

Thanks @Dan-Yeh for the awesome PR, I will merge after the tests clear.

CC-Yeh force-pushed the complex_matmul branch from 302eefa to 6fb7d08 Compare September 7, 2025 00:12