[Bug]: nixl connector PD disagg stuck at INFO logging level

### Your current environment

/

### 🐛 Describe the bug

cc EPD authors @fake0fan @knlnguyen1802 @herotai214  @khuonglmhw 
cc @NickLucche 

Following #29248 having issue with the EPD disagg feature, I found that simply running PD disagg with nixl connector with `export VLLM_LOGGING_LEVEL=INFO` would also stuck at the decode instance when streaming for the 1st request.

`export VLLM_LOGGING_LEVEL=DEBUG` avoids the issue and works fine, but I think there are still some defects that worth paying attention...

My starting script for PD disagg with nixl connector using `/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py`:

```shell
#!/bin/bash
set -xe

echo "🚧🚧 Running NIXL CONNECTOR!!!!!!!! 🚧🚧"

MODEL_NAME=<YOUR_MODEL_NAME_HERE>
LOG_PATH=<YOUR_LOG_PATH_HERE>
mkdir -p $LOG_PATH

# Cleanup function
cleanup() {
    echo "Stopping everything…"
    trap - INT TERM USR1   # prevent re-entrancy
    
    # Kill all tracked PIDs
    for pid in "${PIDS[@]}"; do
        if kill -0 "$pid" 2>/dev/null; then
            echo "Killing process $pid"
            kill "$pid" 2>/dev/null
        fi
    done
    
    # Wait a moment for graceful shutdown
    sleep 2
    
    # Force kill any remaining processes
    for pid in "${PIDS[@]}"; do
        if kill -0 "$pid" 2>/dev/null; then
            echo "Force killing process $pid"
            kill -9 "$pid" 2>/dev/null
        fi
    done
    
    # Kill the entire process group as backup
    kill -- -$$ 2>/dev/null
    
    echo "All processes stopped."
    exit 0
}

trap cleanup INT
trap cleanup USR1
trap cleanup TERM

export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')

# a function that waits vLLM server to start
wait_for_server() {
  local port=$1
  timeout 12000 bash -c "
    until curl -s localhost:${port}/v1/chat/completions > /dev/null; do
      sleep 1
    done" && return 0 || return 1
}

### NIXL
A=2
B=3
C=6661
D=6662
PORT=9775

CUDA_VISIBLE_DEVICES=$A VLLM_NIXL_SIDE_CHANNEL_PORT=5559 vllm serve $MODEL_NAME \
    --port $C \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' \
    > $LOG_PATH/prefiller_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

# decoding instance, which is the KV consumer
CUDA_VISIBLE_DEVICES=$B VLLM_NIXL_SIDE_CHANNEL_PORT=6000 vllm serve $MODEL_NAME \
    --port $D \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' \
    > $LOG_PATH/decoder_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

# wait until prefill and decode instances are ready
wait_for_server $C
wait_for_server $D

# toy server
python3 /workspace/hero/hero_EPD_1017/vllm/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py  \
    --prefiller-ports $C   \
    --decoder-ports $D   \
    --port $PORT \
    > $LOG_PATH/proxy_$(date +"%Y%m%d_%H%M%S").log 2>&1 &

sleep 1

## Vision benchmark
wait_for_server $PORT

vllm bench serve \
  --backend openai-chat \
  --model $MODEL_NAME \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 100 \
  --seed 40 \
  --save-result \
  --save-detailed \
  --result-dir $LOG_PATH \
  --result-filename pddisaggnixl_$(date +"%Y%m%d_%H%M%S").json \
  --port $PORT


# cleanup
echo "cleanup..."
cleanup


```


Related logs:
<details>

**At INFO level:**

***prefill instance***
```
INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6661, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='966415c7-c559-48e7-a664-c287e16856be', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
[0;36m(APIServer pid=248)[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(APIServer pid=248)[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:54579 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
[0;36m(EngineCore_DP0 pid=594)[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
[0;36m(EngineCore_DP0 pid=594)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
[0;36m(EngineCore_DP0 pid=594)[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.51it/s]
[0;36m(EngineCore_DP0 pid=594)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.55it/s]
[0;36m(EngineCore_DP0 pid=594)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.55it/s]
[0;36m(EngineCore_DP0 pid=594)[0;0m 
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.33 seconds
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.685600 seconds
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
[0;36m(EngineCore_DP0 pid=594)[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 966415c7-c559-48e7-a664-c287e16856be
[0;36m(EngineCore_DP0 pid=594)[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:361 Backend UCX was instantiated
[0;36m(EngineCore_DP0 pid=594)[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:251 Initialized NIXL agent: 72fca207-f8c6-4554-9663-6b9c3f44660b
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.55 seconds
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
[0;36m(EngineCore_DP0 pid=594)[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 966415c7-c559-48e7-a664-c287e16856be
[0;36m(EngineCore_DP0 pid=594)[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
[0;36m(APIServer pid=248)[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6661
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
[0;36m(APIServer pid=248)[0;0m INFO:     Started server process [248]
[0;36m(APIServer pid=248)[0;0m INFO:     Waiting for application startup.
[0;36m(APIServer pid=248)[0;0m INFO:     Application startup complete.
[0;36m(APIServer pid=248)[0;0m INFO:     127.0.0.1:40172 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
[0;36m(APIServer pid=248)[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:40:45 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
[0;36m(APIServer pid=248)[0;0m INFO:     127.0.0.1:45690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:40:50 [loggers.py:236] Engine 000: Avg prompt throughput: 26.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:41:00 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
[0;36m(APIServer pid=248)[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.

```

***decode instance***
```
INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6662, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='027a18cd-ad69-4b86-92cb-2fd78674bced', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
[0;36m(APIServer pid=249)[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(APIServer pid=249)[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:33843 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
[0;36m(EngineCore_DP0 pid=590)[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
[0;36m(EngineCore_DP0 pid=590)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
[0;36m(EngineCore_DP0 pid=590)[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.53it/s]
[0;36m(EngineCore_DP0 pid=590)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.59it/s]
[0;36m(EngineCore_DP0 pid=590)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.58it/s]
[0;36m(EngineCore_DP0 pid=590)[0;0m 
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.30 seconds
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.701283 seconds
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
[0;36m(EngineCore_DP0 pid=590)[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 027a18cd-ad69-4b86-92cb-2fd78674bced
[0;36m(EngineCore_DP0 pid=590)[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:361 Backend UCX was instantiated
[0;36m(EngineCore_DP0 pid=590)[0;0m 2025-11-25 02:39:46 NIXL INFO    _api.py:251 Initialized NIXL agent: e04da5c5-8fb1-476f-a589-d9e9fa2231d8
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.62 seconds
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
[0;36m(EngineCore_DP0 pid=590)[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 027a18cd-ad69-4b86-92cb-2fd78674bced
[0;36m(EngineCore_DP0 pid=590)[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
[0;36m(APIServer pid=249)[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6662
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
[0;36m(APIServer pid=249)[0;0m INFO:     Started server process [249]
[0;36m(APIServer pid=249)[0;0m INFO:     Waiting for application startup.
[0;36m(APIServer pid=249)[0;0m INFO:     Application startup complete.
[0;36m(APIServer pid=249)[0;0m INFO:     127.0.0.1:43240 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
[0;36m(APIServer pid=249)[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:40:48 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
[0;36m(APIServer pid=249)[0;0m INFO:     127.0.0.1:49674 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[0;36m(APIServer pid=249)[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.

```

***proxy***
```
INFO:     Started server process [2512]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:9775 (Press CTRL+C to quit)
Initialized 1 prefill clients and 1 decode clients.
INFO:     127.0.0.1:51120 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
INFO:     127.0.0.1:54482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2512]

```

</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: nixl connector PD disagg stuck at INFO logging level #29369

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: nixl connector PD disagg stuck at INFO logging level #29369

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions