Hi, on "which falls behind MKL's gemm beyond 70x70 or so" is that mostly outdated text? It wasn't obviously true from the graph, and I noticed you reran benchmarks last month (before ArrayInterface upgrade, would 3.0 improve speed?), and couldn't zoom unless going to:
https://github.com/chriselrod/LoopVectorization.jl/blob/5ba0d186bcd2d6f4fed09fd6ca9f7817e8dd29e2/docs/src/assets/bench_AmulB_v2.png
Yes, about there and sometimes for bigger, MKL is only slightly faster (from memory MKL had a much bigger edge), but you might want to change to more positive language. I have and want to keep pointing people to these graphs and your awesome work.
I just recently noticed:
https://github.com/JuliaLinearAlgebra/Octavian.jl
Is it fair to say OpenBLAS will soon be replaced? Or could (already)? I know you target Intel with AVX512. The concepts transfer to ARM and AMD, and even some code already for AMD?
As with:
https://github.com/JuliaGPU/GemmKernels.jl
you need no assembly? I mean on some level, but not for high-level (multiply) functions.
I didn't see (or expect) any common code there with your. I did notice GPUifyLoops.jl which is archived and should use KernelAbstractions.jl?
Hi, on "which falls behind MKL's gemm beyond 70x70 or so" is that mostly outdated text? It wasn't obviously true from the graph, and I noticed you reran benchmarks last month (before ArrayInterface upgrade, would 3.0 improve speed?), and couldn't zoom unless going to:
https://github.com/chriselrod/LoopVectorization.jl/blob/5ba0d186bcd2d6f4fed09fd6ca9f7817e8dd29e2/docs/src/assets/bench_AmulB_v2.png
Yes, about there and sometimes for bigger, MKL is only slightly faster (from memory MKL had a much bigger edge), but you might want to change to more positive language. I have and want to keep pointing people to these graphs and your awesome work.
I just recently noticed:
https://github.com/JuliaLinearAlgebra/Octavian.jl
Is it fair to say OpenBLAS will soon be replaced? Or could (already)? I know you target Intel with AVX512. The concepts transfer to ARM and AMD, and even some code already for AMD?
As with:
https://github.com/JuliaGPU/GemmKernels.jl
you need no assembly? I mean on some level, but not for high-level (multiply) functions.
I didn't see (or expect) any common code there with your. I did notice GPUifyLoops.jl which is archived and should use KernelAbstractions.jl?