Conversation
* removed flash-attenion definition
…conv2d_tensor_core
CUDA: uint to int and added assertion
* Extra: reduces bank conflicts
…conv2d_tensor_core
|
Keeping this as a draft until the implicit or Vulkan changes are merged. I’ll integrate the tensor core kernel with that code. |
|
Hey @Green-Sky, could we also get a sd.cpp perf analysis for this draft? I’ve exposed the tensor core kernel through conv2d_direct. |
|
Ran a bench on this pr and added it here #15805 (comment) . Looks like this is now the fastest version ! VAE decoding is also slightly faster than im2col+matmul (maybe, might be within error). sd1 fp16 512x768
sd1 fp16 768x1024 (like the old table)
sdxl fp16/q8_0 1024x1280Diffusion model is q8_0 and vae is fp16.
|
| __constant__ __device__ Params P; | ||
|
|
||
| // see init_fastdiv_values in ggml-vulkan.cpp | ||
| __inline__ __device__ uint fastdiv(uint n, uint mp, uint L) { |
There was a problem hiding this comment.
Already exists in common.
llama.cpp/ggml/src/ggml-cuda/common.cuh
Line 653 in 1ae7488
|
|
||
| #define CEIL_DIV(M, N) (((M) + (N) - 1) / (N)) | ||
|
|
||
| static uint32_t ceil_div(uint32_t M, uint32_t N); |
There was a problem hiding this comment.
llama.cpp/ggml/src/ggml-sycl/common.hpp
Line 532 in 1ae7488
| #include "convert.cuh" | ||
| #include "mma.cuh" | ||
|
|
||
| #define CEIL_DIV(M, N) (((M) + (N) - 1) / (N)) |
There was a problem hiding this comment.
Remove makro, and use function instead.
|
Closing as this #15805 perf is better than this PR. View perf for reference: |
Added Tensor Core to the code from #16088, have made modification such that it was giving best result on tensor cores. Below result are on RTX 2070 gpu.
FP16 Tensor Core perf
@etasnadi @Green-Sky @JohannesGaessler