Skip to content

[Bug]: nixl connector PD disagg stuck at INFO logging level #29369

@herotai214

Description

@herotai214

Your current environment

/

🐛 Describe the bug

cc EPD authors @fake0fan @knlnguyen1802 @herotai214 @khuonglmhw
cc @NickLucche

Following #29248 having issue with the EPD disagg feature, I found that simply running PD disagg with nixl connector with export VLLM_LOGGING_LEVEL=INFO would also stuck at the decode instance when streaming for the 1st request.

export VLLM_LOGGING_LEVEL=DEBUG avoids the issue and works fine, but I think there are still some defects that worth paying attention...

My starting script for PD disagg with nixl connector using /tests/v1/kv_connector/nixl_integration/toy_proxy_server.py:

#!/bin/bash
set -xe

echo "🚧🚧 Running NIXL CONNECTOR!!!!!!!! 🚧🚧"

MODEL_NAME=<YOUR_MODEL_NAME_HERE>
LOG_PATH=<YOUR_LOG_PATH_HERE>
mkdir -p $LOG_PATH

# Cleanup function
cleanup() {
    echo "Stopping everything…"
    trap - INT TERM USR1   # prevent re-entrancy
    
    # Kill all tracked PIDs
    for pid in "${PIDS[@]}"; do
        if kill -0 "$pid" 2>/dev/null; then
            echo "Killing process $pid"
            kill "$pid" 2>/dev/null
        fi
    done
    
    # Wait a moment for graceful shutdown
    sleep 2
    
    # Force kill any remaining processes
    for pid in "${PIDS[@]}"; do
        if kill -0 "$pid" 2>/dev/null; then
            echo "Force killing process $pid"
            kill -9 "$pid" 2>/dev/null
        fi
    done
    
    # Kill the entire process group as backup
    kill -- -$$ 2>/dev/null
    
    echo "All processes stopped."
    exit 0
}

trap cleanup INT
trap cleanup USR1
trap cleanup TERM

export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')

# a function that waits vLLM server to start
wait_for_server() {
  local port=$1
  timeout 12000 bash -c "
    until curl -s localhost:${port}/v1/chat/completions > /dev/null; do
      sleep 1
    done" && return 0 || return 1
}

### NIXL
A=2
B=3
C=6661
D=6662
PORT=9775

CUDA_VISIBLE_DEVICES=$A VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $MODEL_NAME \
    --port $C \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' \
    > $LOG_PATH/prefiller_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

# decoding instance, which is the KV consumer
CUDA_VISIBLE_DEVICES=$B VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $MODEL_NAME \
    --port $D \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' \
    > $LOG_PATH/decoder_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

# wait until prefill and decode instances are ready
wait_for_server $C
wait_for_server $D

# toy server
python3 /workspace/hero/hero_EPD_1017/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py  \
    --prefiller-ports $C   \
    --decoder-ports $D   \
    --port $PORT \
    > $LOG_PATH/proxy_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

sleep 1

## Vision benchmark
wait_for_server $PORT

vllm bench serve \
  --backend openai-chat \
  --model $MODEL_NAME \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 100 \
  --seed 40 \
  --save-result \
  --save-detailed \
  --result-dir $LOG_PATH \
  --result-filename pddisaggnixl_$(date +"%Y%m%d_%H%M%S").json \
  --port $PORT


# cleanup
echo "cleanup..."
cleanup

Related logs:

Details

At INFO level:

prefill instance

INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6661, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='966415c7-c559-48e7-a664-c287e16856be', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
�[0;36m(APIServer pid=248)�[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=248)�[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:54579 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
�[0;36m(EngineCore_DP0 pid=594)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.51it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.55it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.55it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.33 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.685600 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:361 Backend UCX was instantiated
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:251 Initialized NIXL agent: 72fca207-f8c6-4554-9663-6b9c3f44660b
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.55 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
�[0;36m(APIServer pid=248)�[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6661
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO:     Started server process [248]
�[0;36m(APIServer pid=248)�[0;0m INFO:     Waiting for application startup.
�[0;36m(APIServer pid=248)�[0;0m INFO:     Application startup complete.
�[0;36m(APIServer pid=248)�[0;0m INFO:     127.0.0.1:40172 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
�[0;36m(APIServer pid=248)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:40:45 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
�[0;36m(APIServer pid=248)�[0;0m INFO:     127.0.0.1:45690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:40:50 [loggers.py:236] Engine 000: Avg prompt throughput: 26.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:41:00 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.

decode instance

INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6662, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='027a18cd-ad69-4b86-92cb-2fd78674bced', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
�[0;36m(APIServer pid=249)�[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=249)�[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:33843 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
�[0;36m(EngineCore_DP0 pid=590)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.53it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.59it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.58it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.30 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.701283 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:361 Backend UCX was instantiated
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:251 Initialized NIXL agent: e04da5c5-8fb1-476f-a589-d9e9fa2231d8
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.62 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
�[0;36m(APIServer pid=249)�[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6662
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO:     Started server process [249]
�[0;36m(APIServer pid=249)�[0;0m INFO:     Waiting for application startup.
�[0;36m(APIServer pid=249)�[0;0m INFO:     Application startup complete.
�[0;36m(APIServer pid=249)�[0;0m INFO:     127.0.0.1:43240 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
�[0;36m(APIServer pid=249)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:40:48 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
�[0;36m(APIServer pid=249)�[0;0m INFO:     127.0.0.1:49674 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.

proxy

INFO:     Started server process [2512]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:9775 (Press CTRL+C to quit)
Initialized 1 prefill clients and 1 decode clients.
INFO:     127.0.0.1:51120 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
INFO:     127.0.0.1:54482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2512]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions