Support setting kernel block cluster dimensions

With the Hopper architecture, NVIDIA has introduced "clusters" of blocks which can use each other's shared memory. The clustering can be set either using a `__cluster_dims__(1,2,3)`  qualifier in the kernel's signature, or at run-time. We need to support the run-time setting within our `launch_configuration_t` class and in the launch config builder mechanism.