The example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/notebooks/elementwise_add.ipynb
Is made with float16 values, however the documentation refers to "32-bit" wide values and also only uses 4 element vectorization instead of 8. The dtype should either be changed to float32 or the docs/code should be updated