Skip to content

[Bug] kv cache offloading issue on GLM 4.6 FP8 with flash infer backend #18135

@naveenkumarmarri

Description

@naveenkumarmarri

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

I am trying to setup kv cache offloading for GLM 4.6 model. I am facing error with flash-infer. is there a workaround for this issue? error in
docker image

docker.artifactory.rbx.com/lmsysorg/sglang:latest 

launch command

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.6-FP8 \
    --tp-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --enable-flashinfer-allreduce-fusion \
    --enable-fused-qk-norm-rope \
    --host 0.0.0.0 \
    --port 80 \
    --enable-hierarchical-cache \
    --hicache-ratio 2 \
    --hicache-write-policy write_through \
    --hicache-storage-backend nixl

error

   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
     ) = self.capture_one_batch_size(bs, forward, stream_idx)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 278, in capture_one_batch_size
     self.model_runner.draft_attn_backend.init_forward_metadata_capture_cuda_graph(
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1591, in init_forward_metadata_capture_cuda_graph
     self.common_template(forward_batch, self.cuda_graph_kv_indices, call_fn)
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1542, in common_template
     call_fn(i, forward_batch)
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1581, in call_fn
     self.attn_backends[i].init_forward_metadata_capture_cuda_graph(
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 568, in init_forward_metadata_capture_cuda_graph                                     self.indices_updater_decode.update(
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 964, in update_single_wrapper
     self.call_begin_forward(
   File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1145, in call_begin_forward
     wrapper.begin_forward(
   File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1051, in plan
     self._cached_module = get_batch_prefill_module(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                              File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 404, in get_batch_prefill_module
     module = gen_batch_prefill_module(backend, *args).build_and_load()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 316, in build_and_load
     self.build(verbose, need_lock=False)
   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 302, in build
     run_ninja(self.build_dir, self.ninja_path, verbose)
   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 340, in run_ninja
     raise RuntimeError(msg) from e
 RuntimeError: Ninja build failed. Ninja output:
 ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                                                                       batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 '
 [1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                          batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -   isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -       isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.  12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -            DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/           flashinfer/0.6.1/90a/generated/                                                                                                                                                batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                                                   batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
 FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                                                                                 batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
 /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -   isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -       isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.  12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -            DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/           flashinfer/0.6.1/90a/generated/                                                                                                                                                batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                                                   batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
 /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include -   isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -       isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.  12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 -            DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/           flashinfer/0.6.1/90a/generated/                                                                                                                                                batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/                                                                                   batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
 /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for       request configuration."
         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
         ^
           detected during:
             instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ,     ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::    GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_,   IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ,              DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/   python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS, Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::  DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563     of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,                           SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom,     LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/ 90a/generated/                                                                                                                                                                 batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu

 /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute::           make_tiled_mma" matches the argument list
             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
     using TiledMmaQK = decltype(cute::make_tiled_mma(
                                 ^
 /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const     MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
   make_tiled_mma(MMA_Op const&,
   ^
 /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const     cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
   ^
           detected during:
             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_,   IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ,              DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/   python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS, Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::  DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563     of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh

Reproduction

python -m sglang.launch_server \
    --model-path zai-org/GLM-4.6-FP8 \
    --tp-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --speculative-algorithm NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --enable-flashinfer-allreduce-fusion \
    --enable-fused-qk-norm-rope \
    --host 0.0.0.0 \
    --port 80 \
    --enable-hierarchical-cache \
    --hicache-ratio 2 \
    --hicache-write-policy write_through \
    --hicache-storage-backend nixl

Environment

docker.artifactory.rbx.com/lmsysorg/sglang:latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions