What's Changed
- perf: add fp4 GEMM tile configs and streamK scheduler for SM120 by @Yuening-wa in #2460
- refactor: Port upstream CUTLASS fixes and refactor grouped_gemm_nt_masked GEMM module location by @bkryu in #2503
- feat: cuteDSL fp4 moe for better DSR1 performance. by @nv-yunzheq in #2398
- ci: refactor PR tests to hide failed spot jobs from PR status by @yongwww in #2500
- Enable setting user in CI containers by @dierksen in #2515
- perf: cache cudaGetDeviceProperties in gdn_prefill to avoid per-call overhead by @xutizhou in #2509
- Revert "ci: refactor PR tests to hide failed spot jobs from PR status… by @yongwww in #2524
- feat: Add TRTLLM-Gen Skip-Softmax kernels for prefill and decode by @DomBrown in #2477
- add salyminty (me) to authorized_codeowners, fix alphabetical ordering by @saltyminty in #2537
- chore: update benchmark scripts; fix trtllm-gen moe comments by @IwakuraRein in #2412
- Add sm90 guard to fence.acquire by @jhalabi-nv in #2535
- feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) by @danisereb in #2464
- fallback to fa2 (instead of fa3) for unsupported configuration (bf16 Q, Fp8 KV) by @saltyminty in #2536
- misc: point triton blackwell-ptxas to local cuda ptxas by @jimmyzho in #2543
- tests: bmm_fp8 for SM110 by @jimmyzho in #2538
- Add parallel testing to unit test script by @dierksen in #2531
- Add gen_gemm_sm100_module_cutlass_mxfp8 to jit-cache by @yongwww in #2549
- fix: Sampling: CUDA Graph fix by @IzzyPutterman in #2432
- fix: include fp8_blockscale_gemm_90 in AOT jit-cache by @Edward-lyz in #2533
- bugfix: fix the enum/int type mismatch mentioned in #2507 by @yzh119 in #2508
- Add test case for Qwen3N by @samuellees in #2532
- Chore: Cute dsl moe update (TMA.RED implementation) by @nv-yunzheq in #2529
- benchmarks: Add microbenchmark support for Mamba selective_state_update by @bkryu in #2512
- Update Docker CI tags to 20260209-a2d3b39 by @flashinfer-bot in #2528
- Ameyn/gdn decode cutedsl kernel by @ameynaik-hub in #2498
- [Bugfix][comm] Fix FP4 one-shot launch config instability in trtllm_allreduce_fusion by @baonudesifeizhai in #2557
- pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend by @saltyminty in #2530
- Feat: Trtllm-gen MxFP8 MoE integration by @IwakuraRein in #2505
- [Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl by @kahyunnam in #2573
- fix: W4A8 autotune crash in cutlass_fused_moe profiler workspace by @ipnon in #2564
- Add Hopper to CI by @yongwww in #2552
- fix: allow fmha_v2_prefill_deepseek on SM121 (DGX Spark) by @blake-snc in #2559
- feat: Enable TRTLLM-Gen Skip-Softmax attention for MLA by @DomBrown in #2547
- docs: Add note on feature support for compute capabilities by @sricketts in #2578
- bump version to 0.6.4 by @aleozlx in #2565
New Contributors
- @Yuening-wa made their first contribution in #2460
- @xutizhou made their first contribution in #2509
- @DomBrown made their first contribution in #2477
- @saltyminty made their first contribution in #2537
- @IzzyPutterman made their first contribution in #2432
- @Edward-lyz made their first contribution in #2533
- @ameynaik-hub made their first contribution in #2498
- @baonudesifeizhai made their first contribution in #2557
- @ipnon made their first contribution in #2564
- @blake-snc made their first contribution in #2559
Full Changelog: v0.6.3...v0.6.4