Skip to content

Release v0.6.4

Latest

Choose a tag to compare

@github-actions github-actions released this 19 Feb 07:20
· 27 commits to main since this release
f1e6fdc

What's Changed

  • perf: add fp4 GEMM tile configs and streamK scheduler for SM120 by @Yuening-wa in #2460
  • refactor: Port upstream CUTLASS fixes and refactor grouped_gemm_nt_masked GEMM module location by @bkryu in #2503
  • feat: cuteDSL fp4 moe for better DSR1 performance. by @nv-yunzheq in #2398
  • ci: refactor PR tests to hide failed spot jobs from PR status by @yongwww in #2500
  • Enable setting user in CI containers by @dierksen in #2515
  • perf: cache cudaGetDeviceProperties in gdn_prefill to avoid per-call overhead by @xutizhou in #2509
  • Revert "ci: refactor PR tests to hide failed spot jobs from PR status… by @yongwww in #2524
  • feat: Add TRTLLM-Gen Skip-Softmax kernels for prefill and decode by @DomBrown in #2477
  • add salyminty (me) to authorized_codeowners, fix alphabetical ordering by @saltyminty in #2537
  • chore: update benchmark scripts; fix trtllm-gen moe comments by @IwakuraRein in #2412
  • Add sm90 guard to fence.acquire by @jhalabi-nv in #2535
  • feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) by @danisereb in #2464
  • fallback to fa2 (instead of fa3) for unsupported configuration (bf16 Q, Fp8 KV) by @saltyminty in #2536
  • misc: point triton blackwell-ptxas to local cuda ptxas by @jimmyzho in #2543
  • tests: bmm_fp8 for SM110 by @jimmyzho in #2538
  • Add parallel testing to unit test script by @dierksen in #2531
  • Add gen_gemm_sm100_module_cutlass_mxfp8 to jit-cache by @yongwww in #2549
  • fix: Sampling: CUDA Graph fix by @IzzyPutterman in #2432
  • fix: include fp8_blockscale_gemm_90 in AOT jit-cache by @Edward-lyz in #2533
  • bugfix: fix the enum/int type mismatch mentioned in #2507 by @yzh119 in #2508
  • Add test case for Qwen3N by @samuellees in #2532
  • Chore: Cute dsl moe update (TMA.RED implementation) by @nv-yunzheq in #2529
  • benchmarks: Add microbenchmark support for Mamba selective_state_update by @bkryu in #2512
  • Update Docker CI tags to 20260209-a2d3b39 by @flashinfer-bot in #2528
  • Ameyn/gdn decode cutedsl kernel by @ameynaik-hub in #2498
  • [Bugfix][comm] Fix FP4 one-shot launch config instability in trtllm_allreduce_fusion by @baonudesifeizhai in #2557
  • pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend by @saltyminty in #2530
  • Feat: Trtllm-gen MxFP8 MoE integration by @IwakuraRein in #2505
  • [Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl by @kahyunnam in #2573
  • fix: W4A8 autotune crash in cutlass_fused_moe profiler workspace by @ipnon in #2564
  • Add Hopper to CI by @yongwww in #2552
  • fix: allow fmha_v2_prefill_deepseek on SM121 (DGX Spark) by @blake-snc in #2559
  • feat: Enable TRTLLM-Gen Skip-Softmax attention for MLA by @DomBrown in #2547
  • docs: Add note on feature support for compute capabilities by @sricketts in #2578
  • bump version to 0.6.4 by @aleozlx in #2565

New Contributors

Full Changelog: v0.6.3...v0.6.4