-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Open
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
I am trying to setup kv cache offloading for GLM 4.6 model. I am facing error with flash-infer. is there a workaround for this issue? error in
docker image
docker.artifactory.rbx.com/lmsysorg/sglang:latest
launch command
python -m sglang.launch_server \
--model-path zai-org/GLM-4.6-FP8 \
--tp-size 4 \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-flashinfer-allreduce-fusion \
--enable-fused-qk-norm-rope \
--host 0.0.0.0 \
--port 80 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
error
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 278, in capture_one_batch_size
self.model_runner.draft_attn_backend.init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1591, in init_forward_metadata_capture_cuda_graph
self.common_template(forward_batch, self.cuda_graph_kv_indices, call_fn)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1542, in common_template
call_fn(i, forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1581, in call_fn
self.attn_backends[i].init_forward_metadata_capture_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 568, in init_forward_metadata_capture_cuda_graph self.indices_updater_decode.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 964, in update_single_wrapper
self.call_begin_forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1145, in call_begin_forward
wrapper.begin_forward(
File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1051, in plan
self._cached_module = get_batch_prefill_module(
^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 404, in get_batch_prefill_module
module = gen_batch_prefill_module(backend, *args).build_and_load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 316, in build_and_load
self.build(verbose, need_lock=False)
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 302, in build
run_ninja(self.build_dir, self.ninja_path, verbose)
File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/cpp_ext.py", line 340, in run_ninja
raise RuntimeError(msg) from e
RuntimeError: Ninja build failed. Ninja output:
ninja: Entering directory `/root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 '
[1/9] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include - isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include - isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 - DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/ flashinfer/0.6.1/90a/generated/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
FAILED: [code=1] /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include - isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include - isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 - DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/ flashinfer/0.6.1/90a/generated/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/cuda/include - isystem /usr/local/cuda/include/cccl -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include -isystem /usr/local/lib/python3.12/dist-packages/tvm_ffi/include - isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/cutlass/include -isystem /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3. 12/dist-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -static-global-template-stub=false -DFLASHINFER_ENABLE_FP8_E8M0 - DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 - DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -c /root/.cache/ flashinfer/0.6.1/90a/generated/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu -o /root/.cache/flashinfer/0.6.1/90a/cached_ops/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cuda.o
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(2239): error: static assertion failed with "No eligible GMMA operator for request configuration."
static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
^
detected during:
instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90:: GMMA::Major::K, Args=<>]" at line 76 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/ python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS, Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams:: DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW, SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, __nv_bool, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kCustom, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/0.6.1/ 90a/generated/ batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_e4m3_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 /batch_prefill_ragged_sm90_kernel_mask_2.cu
/usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(75): error: no instance of overloaded function "cute:: make_tiled_mma" matches the argument list
argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
using TiledMmaQK = decltype(cute::make_tiled_mma(
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(548): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Op const&,
^
/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(531): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
^
detected during:
instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 435 of /usr/local/lib/ python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS, Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams:: DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 563 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
Reproduction
python -m sglang.launch_server \
--model-path zai-org/GLM-4.6-FP8 \
--tp-size 4 \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-flashinfer-allreduce-fusion \
--enable-fused-qk-norm-rope \
--host 0.0.0.0 \
--port 80 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
Environment
docker.artifactory.rbx.com/lmsysorg/sglang:latest
Metadata
Metadata
Assignees
Labels
No labels