[Bug] CUDA error: no kernel image is available for execution on the device

### Checklist

- [x] I searched related issues but found no solution.
- [ ] The bug persists in the latest version.
- [ ] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [ ] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Describe the bug

Using the official script/documentation to deploy the Qwen-Image model for inference results in an error

[02-02 16:22:41] [DenoisingStage] Error during execution after 14353.3677 ms: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py", line 203, in __call__
    result = self.forward(batch, server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1023, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1268, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1214, in _predict_noise
    return current_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 982, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 763, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 560, in forward
    img_query, img_key = apply_qk_norm(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 439, in apply_qk_norm
    fused_inplace_qknorm(
  File "/ossfs/workspace/sglang-main/python/sglang/jit_kernel/norm.py", line 77, in fused_inplace_qknorm
    module.qknorm(q, k, q_weight, k_weight, eps)
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
[02-02 16:22:41] [DenoisingStage] Error during execution after 14359.9784 ms: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py", line 203, in __call__
    result = self.forward(batch, server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1023, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1268, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1214, in _predict_noise
    return current_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 982, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 763, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 560, in forward
    img_query, img_key = apply_qk_norm(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 439, in apply_qk_norm
    fused_inplace_qknorm(
  File "/ossfs/workspace/sglang-main/python/sglang/jit_kernel/norm.py", line 77, in fused_inplace_qknorm
    module.qknorm(q, k, q_weight, k_weight, eps)
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
[02-02 16:22:42] Error executing request 55a5e069-22c7-43bd-8860-9350ee3bb13e: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 165, in execute_forward
    result = self.pipeline.forward(req, self.server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 356, in forward
    return self.executor.execute_with_profiling(self.stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/pipeline_executor.py", line 57, in execute_with_profiling
    batch = self.execute(stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py", line 97, in execute
    batch = self._execute(stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py", line 88, in _execute
    batch = stage(batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py", line 203, in __call__
    result = self.forward(batch, server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1023, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1268, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1214, in _predict_noise
    return current_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 982, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 763, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 560, in forward
    img_query, img_key = apply_qk_norm(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 439, in apply_qk_norm
    fused_inplace_qknorm(
  File "/ossfs/workspace/sglang-main/python/sglang/jit_kernel/norm.py", line 77, in fused_inplace_qknorm
    module.qknorm(q, k, q_weight, k_weight, eps)
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
[02-02 16:22:42] Error executing request 55a5e069-22c7-43bd-8860-9350ee3bb13e: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 165, in execute_forward
    result = self.pipeline.forward(req, self.server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 356, in forward
    return self.executor.execute_with_profiling(self.stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/pipeline_executor.py", line 57, in execute_with_profiling
    batch = self.execute(stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py", line 97, in execute
    batch = self._execute(stages, batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py", line 88, in _execute
    batch = stage(batch, server_args)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py", line 203, in __call__
    result = self.forward(batch, server_args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1023, in forward
    noise_pred = self._predict_noise_with_cfg(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1268, in _predict_noise_with_cfg
    noise_pred_cond = self._predict_noise(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py", line 1214, in _predict_noise
    return current_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 982, in forward
    encoder_hidden_states, hidden_states = block(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 763, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py", line 560, in forward
    img_query, img_key = apply_qk_norm(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/layers/layernorm.py", line 439, in apply_qk_norm
    fused_inplace_qknorm(
  File "/ossfs/workspace/sglang-main/python/sglang/jit_kernel/norm.py", line 77, in fused_inplace_qknorm
    module.qknorm(q, k, q_weight, k_weight, eps)
  File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
[02-02 16:22:42] Failed to generate output for prompt: Model generation returned no output. Error from scheduler: Error executing request 55a5e069-22c7-43bd-8860-9350ee3bb13e: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/utils/logging_utils.py", line 470, in log_generation_timer
    yield timer
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py", line 216, in process_generation_batch
    raise RuntimeError(
RuntimeError: Model generation returned no output. Error from scheduler: Error executing request 55a5e069-22c7-43bd-8860-9350ee3bb13e: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device
[2026-02-02 16:22:42] INFO:     127.0.0.1:50870 - "POST /v1/images/generations HTTP/1.1" 500 Internal Server Error
[2026-02-02 16:22:42] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 115, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 101, in app
    response = await f(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 355, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 243, in run_endpoint_function
    return await dependant.call(**values)
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/entrypoints/openai/image_api.py", line 135, in generations
    save_file_path_list, result = await process_generation_batch(
  File "/ossfs/workspace/sglang-main/python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py", line 216, in process_generation_batch
    raise RuntimeError(
RuntimeError: Model generation returned no output. Error from scheduler: Error executing request 55a5e069-22c7-43bd-8860-9350ee3bb13e: Runtime check failed at :0: CUDA error: no kernel image is available for execution on the device

### Reproduction

sglang serve --model-path /datacube_nas/noao_data/Qwen-Image         --trust-remote-code         --tp-size 2         --port 30010         --host 0.0.0.0 --diffusers-attention-backend native

### Environment

python3 -m sglang.check_env
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA A100-SXM4-80GB
GPU 0,1 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.61
CUDA Driver Version: 470.82.01
PyTorch: 2.9.1+cu128
sglang: 0.0.0.dev0
sgl_kernel: 0.3.21
flashinfer_python: 0.6.2
flashinfer_cubin: 0.6.2
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.13.3
fastapi: 0.128.0
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.33.0
orjson: 3.11.5
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.12.5
python-multipart: 0.0.21
pyzmq: 27.1.0
uvicorn: 0.40.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 3.0.0
NVIDIA Topology: 
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV12    0-31,64-95      0
GPU1    NV12     X      32-63,96-127    1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 655350

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CUDA error: no kernel image is available for execution on the device #18108

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] CUDA error: no kernel image is available for execution on the device #18108

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions