INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6661, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='966415c7-c559-48e7-a664-c287e16856be', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
�[0;36m(APIServer pid=248)�[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=248)�[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:54579 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
�[0;36m(EngineCore_DP0 pid=594)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.51it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.55it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.55it/s]
�[0;36m(EngineCore_DP0 pid=594)�[0;0m
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.33 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.685600 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 2025-11-25 02:39:46 NIXL INFO _api.py:361 Backend UCX was instantiated
�[0;36m(EngineCore_DP0 pid=594)�[0;0m 2025-11-25 02:39:46 NIXL INFO _api.py:251 Initialized NIXL agent: 72fca207-f8c6-4554-9663-6b9c3f44660b
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.55 seconds
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 966415c7-c559-48e7-a664-c287e16856be
�[0;36m(EngineCore_DP0 pid=594)�[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
�[0;36m(APIServer pid=248)�[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6661
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
�[0;36m(APIServer pid=248)�[0;0m INFO: Started server process [248]
�[0;36m(APIServer pid=248)�[0;0m INFO: Waiting for application startup.
�[0;36m(APIServer pid=248)�[0;0m INFO: Application startup complete.
�[0;36m(APIServer pid=248)�[0;0m INFO: 127.0.0.1:40172 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
�[0;36m(APIServer pid=248)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:40:45 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
�[0;36m(APIServer pid=248)�[0;0m INFO: 127.0.0.1:45690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:40:50 [loggers.py:236] Engine 000: Avg prompt throughput: 26.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:41:00 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
�[0;36m(APIServer pid=248)�[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.
INFO 11-25 02:39:04 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=2048.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [api_server.py:2056] vLLM API server version 0.11.2.dev157+ge9af6ba62
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [utils.py:253] non-default args: {'model_tag': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'port': 6662, 'model': '/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', 'trust_remote_code': True, 'enforce_eager': True, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'kv_transfer_config': KVTransferConfig(kv_connector='NixlConnector', engine_id='027a18cd-ad69-4b86-92cb-2fd78674bced', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None, enable_permute_local_kv=False)}
�[0;36m(APIServer pid=249)�[0;0m The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [model.py:630] Resolved architecture: Qwen2_5_VLForConditionalGeneration
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:04 [model.py:1745] Using max model len 128000
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:06 [scheduler.py:207] Chunked prefill is enabled with max_num_batched_tokens=4096.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:06 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=249)�[0;0m WARNING 11-25 02:39:06 [vllm.py:729] Turning off hybrid kv cache manager because `--kv-transfer-config` is set. This will reduce the performance of vLLM on LLMs with sliding window attention or Mamba attention. If you are a developer of kv connector, please consider supporting hybrid kv cache manager for your connector by making sure your connector is a subclass of `SupportsHMA` defined in kv_connector/v1/base.py.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:16 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev157+ge9af6ba62) with config: model='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 0, 'local_cache_dir': None}
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1217] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.90.67.85:33843 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:20 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
�[0;36m(EngineCore_DP0 pid=590)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [gpu_model_runner.py:3249] Starting to load model /mnt/nvme3n1/models/Qwen2.5-VL-3B-Instruct...
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:25 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.53it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.59it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.58it/s]
�[0;36m(EngineCore_DP0 pid=590)�[0;0m
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:27 [default_loader.py:308] Loading weights took 1.30 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:27 [gpu_model_runner.py:3331] Model loading took 7.1556 GiB memory and 1.701283 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:28 [gpu_model_runner.py:4083] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [gpu_worker.py:348] Available KV cache memory: 42.95 GiB
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1234] GPU KV cache size: 1,251,120 tokens
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [kv_cache_utils.py:1239] Maximum concurrency for 128,000 tokens per request: 9.77x
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:73] NIXL is available
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m WARNING 11-25 02:39:44 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:799] Initializing NIXL wrapper
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:44 [nixl_connector.py:800] Initializing NIXL worker 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 2025-11-25 02:39:46 NIXL INFO _api.py:361 Backend UCX was instantiated
�[0;36m(EngineCore_DP0 pid=590)�[0;0m 2025-11-25 02:39:46 NIXL INFO _api.py:251 Initialized NIXL agent: e04da5c5-8fb1-476f-a589-d9e9fa2231d8
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [cuda.py:415] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [cuda.py:424] Using FLASH_ATTN backend.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:46 [nixl_connector.py:199] NixlConnector setting KV cache layout to HND for better xfer performance.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:47 [nixl_connector.py:1154] Registering KV_Caches. use_mla: False, kv_buffer_device: cuda, use_host_buffer: False
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [core.py:253] init engine (profile, create kv cache, warmup model) took 21.62 seconds
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [factory.py:64] Creating v1 connector with name: NixlConnector and engine_id: 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m WARNING 11-25 02:39:49 [base.py:163] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:49 [nixl_connector.py:364] Initializing NIXL Scheduler 027a18cd-ad69-4b86-92cb-2fd78674bced
�[0;36m(EngineCore_DP0 pid=590)�[0;0m INFO 11-25 02:39:50 [vllm.py:508] Cudagraph is disabled under eager mode
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [nixl_connector.py:73] NIXL is available
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [api_server.py:1804] Supported tasks: ['generate']
�[0;36m(APIServer pid=249)�[0;0m WARNING 11-25 02:39:50 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_responses.py:154] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_completion.py:73] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [serving_chat.py:131] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 1e-06}
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [api_server.py:2134] Starting vLLM API server 0 on http://0.0.0.0:6662
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:38] Available routes are:
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /health, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /load, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pause, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /resume, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_paused, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /tokenize, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /detokenize, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/models, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /version, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/messages, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/completions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/embeddings, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /pooling, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /classify, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /score, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/score, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v1/rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /v2/rerank, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /ping, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /invocations, Methods: POST
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:39:50 [launcher.py:46] Route: /metrics, Methods: GET
�[0;36m(APIServer pid=249)�[0;0m INFO: Started server process [249]
�[0;36m(APIServer pid=249)�[0;0m INFO: Waiting for application startup.
�[0;36m(APIServer pid=249)�[0;0m INFO: Application startup complete.
�[0;36m(APIServer pid=249)�[0;0m INFO: 127.0.0.1:43240 - "GET /v1/chat/completions HTTP/1.1" 405 Method Not Allowed
�[0;36m(APIServer pid=249)�[0;0m The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:40:48 [chat_utils.py:574] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
�[0;36m(APIServer pid=249)�[0;0m INFO: 127.0.0.1:49674 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[0;36m(APIServer pid=249)�[0;0m INFO 11-25 02:46:09 [launcher.py:110] Shutting down FastAPI HTTP server.
Your current environment
/
🐛 Describe the bug
cc EPD authors @fake0fan @knlnguyen1802 @herotai214 @khuonglmhw
cc @NickLucche
Following #29248 having issue with the EPD disagg feature, I found that simply running PD disagg with nixl connector with
export VLLM_LOGGING_LEVEL=INFOwould also stuck at the decode instance when streaming for the 1st request.export VLLM_LOGGING_LEVEL=DEBUGavoids the issue and works fine, but I think there are still some defects that worth paying attention...My starting script for PD disagg with nixl connector using
/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py:Related logs:
Details
At INFO level:
prefill instance
decode instance
proxy
Before submitting a new issue...