Release Release v0.6.4 · flashinfer-ai/flashinfer

What's Changed

perf: add fp4 GEMM tile configs and streamK scheduler for SM120 by @Yuening-wa in #2460
refactor: Port upstream CUTLASS fixes and refactor grouped_gemm_nt_masked GEMM module location by @bkryu in #2503
feat: cuteDSL fp4 moe for better DSR1 performance. by @nv-yunzheq in #2398
ci: refactor PR tests to hide failed spot jobs from PR status by @yongwww in #2500
Enable setting user in CI containers by @dierksen in #2515
perf: cache cudaGetDeviceProperties in gdn_prefill to avoid per-call overhead by @xutizhou in #2509
Revert "ci: refactor PR tests to hide failed spot jobs from PR status… by @yongwww in #2524
feat: Add TRTLLM-Gen Skip-Softmax kernels for prefill and decode by @DomBrown in #2477
add salyminty (me) to authorized_codeowners, fix alphabetical ordering by @saltyminty in #2537
chore: update benchmark scripts; fix trtllm-gen moe comments by @IwakuraRein in #2412
Add sm90 guard to fence.acquire by @jhalabi-nv in #2535
feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) by @danisereb in #2464
fallback to fa2 (instead of fa3) for unsupported configuration (bf16 Q, Fp8 KV) by @saltyminty in #2536
misc: point triton blackwell-ptxas to local cuda ptxas by @jimmyzho in #2543
tests: bmm_fp8 for SM110 by @jimmyzho in #2538
Add parallel testing to unit test script by @dierksen in #2531
Add gen_gemm_sm100_module_cutlass_mxfp8 to jit-cache by @yongwww in #2549
fix: Sampling: CUDA Graph fix by @IzzyPutterman in #2432
fix: include fp8_blockscale_gemm_90 in AOT jit-cache by @Edward-lyz in #2533
bugfix: fix the enum/int type mismatch mentioned in #2507 by @yzh119 in #2508
Add test case for Qwen3N by @samuellees in #2532
Chore: Cute dsl moe update (TMA.RED implementation) by @nv-yunzheq in #2529
benchmarks: Add microbenchmark support for Mamba selective_state_update by @bkryu in #2512
Update Docker CI tags to 20260209-a2d3b39 by @flashinfer-bot in #2528
Ameyn/gdn decode cutedsl kernel by @ameynaik-hub in #2498
[Bugfix][comm] Fix FP4 one-shot launch config instability in trtllm_allreduce_fusion by @baonudesifeizhai in #2557
pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend by @saltyminty in #2530
Feat: Trtllm-gen MxFP8 MoE integration by @IwakuraRein in #2505
[Bug] Fix spark unit test failures for test_add_rmsnorm_fp4_quant_cute_dsl by @kahyunnam in #2573
fix: W4A8 autotune crash in cutlass_fused_moe profiler workspace by @ipnon in #2564
Add Hopper to CI by @yongwww in #2552
fix: allow fmha_v2_prefill_deepseek on SM121 (DGX Spark) by @blake-snc in #2559
feat: Enable TRTLLM-Gen Skip-Softmax attention for MLA by @DomBrown in #2547
docs: Add note on feature support for compute capabilities by @sricketts in #2578
bump version to 0.6.4 by @aleozlx in #2565

New Contributors

@Yuening-wa made their first contribution in #2460
@xutizhou made their first contribution in #2509
@DomBrown made their first contribution in #2477
@saltyminty made their first contribution in #2537
@IzzyPutterman made their first contribution in #2432
@Edward-lyz made their first contribution in #2533
@ameynaik-hub made their first contribution in #2498
@baonudesifeizhai made their first contribution in #2557
@ipnon made their first contribution in #2564
@blake-snc made their first contribution in #2559

Full Changelog: v0.6.3...v0.6.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.6.4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!