Low t/s for Qwen3 on M3 ultra

## Describe the bug
I'm getting super low t/s (< 5t/s for Qwen/Qwen3-30B-A3B) on mac studio with M3 ultra and 512GB unified memory. With llama.cpp, I'm getting over 50t/s. Also, running the model in gguf format isn't working: "called `Result::unwrap()` on an `Err` value: Unknown GGUF architecture `qwen3moe`"

## Latest commit or version
380da230bd7cff64e5d22b2876f5841b662fba3b