TensorRT-LLM has added support for Skip Softmax Attention (paper: https://www.arxiv.org/pdf/2512.12087) via their fmha_v2 and xqa backend. This would help improve performance for the hopper decode kernels if we integrated the support and updated the API on the FlashInfer side. The PR that implemented this is: NVIDIA/TensorRT-LLM#10264