Large standard deviation when profile cutlass profiler with nsys profiler. #748
quangnguyen-ai
started this conversation in
General
Replies: 1 comment
-
|
Hi, @Quangnguyengiabku. A few questions:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone! Recently, I used the NVIDIA nsys profiler (https://docs.nvidia.com/nsight-systems/UserGuide/index.html) to profile some GEMMs on the cutlass profiler. As a result, I found that the performance standard deviation of the cutlass kernels is huge (approximately 10% compared to the same kernel cutlass on PyTorch is 0.3%). I tried to config parameters to reduce the StdDev but didn't work. Could anyone help me?
Another thing that confuses me is the kernel _void cutlass::Kernel<cutlass_80_simt_sgemm_256x128_8x4_nn_align1>(T1::Params) (masked in bleu) _ suddenly appeared when I profiler with nsys while if I use cutlass profiler, I don't see it. What is cutlass_80? Could anyone explain it to me?
An example of the command I use:
/usr/local/cuda/bin/nsys profile ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=2048 --kernels=cutlass_simt*nn --profiling-iterations=200 --dist=uniform,min:5,max:100,scale:2 --warmup-iterations=10 --alpha=1 --beta=0 --accum=f32 --save-workspace=always

Beta Was this translation helpful? Give feedback.
All reactions