Skip to content

Conversation

@R3hankhan123
Copy link
Contributor

@R3hankhan123 R3hankhan123 commented Nov 18, 2025

Purpose

This PR fixes BF16 (bfloat16) support issue by fixing the byte ordering and adds comprehensive vectorized mathematical operations for IBM Z s390x architecture using VXE.

Key Issues Addressed:

  1. BF16 Byte Ordering Bug: Fixed incorrect byte ordering in BF16↔FP32 conversions that caused model inference failures on big-endian s390x architecture
  2. Missing Vectorization: Replaced slow scalar fallbacks (std::exp, std::tanh, std::erf) with optimized vector implementations
  3. FMA Operations: Implemented fused multiply-add intrinsics using IBM Z vector instructions for better performance
  4. Numerical Accuracy: Improved BF16 rounding from simple truncation to round-to-nearest-even (RNE)

Test Plan

Deploy vLLM service and check the inferencing output

Test Result

[root@b314lp81 vllm]# docker run --rm -p 9000:9000 --name vllm-optimized  --entrypoint python    quay.io/r3hankhan/vllm:bf16    -m vllm.entrypoints.openai.api_server   --model ibm-granite/granite-3.3-8b-instruct   --host 0.0.0.0   --port 9000   --dtype bfloat16   --max-model-len 2048 
INFO 11-18 06:34:51 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 11-18 06:34:53 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 11-18 06:34:53 [api_server.py:1977] vLLM API server version 0.1.dev11386+g16451b2ba.d20251118
(APIServer pid=1) INFO 11-18 06:34:53 [utils.py:253] non-default args: {'host': '0.0.0.0', 'port': 9000, 'model': 'ibm-granite/granite-3.3-8b-instruct', 'dtype': 'bfloat16', 'max_model_len': 2048}
(APIServer pid=1) INFO 11-18 06:35:06 [model.py:631] Resolved architecture: GraniteForCausalLM
(APIServer pid=1) INFO 11-18 06:35:06 [model.py:1745] Using max model len 2048
(APIServer pid=1) WARNING 11-18 06:35:06 [cpu.py:155] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
(APIServer pid=1) INFO 11-18 06:35:06 [arg_utils.py:1376] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) INFO 11-18 06:35:06 [arg_utils.py:1382] Prefix caching is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) WARNING 11-18 06:35:07 [cpu.py:390] Pin memory is not supported on CPU.
INFO 11-18 06:35:14 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=24) INFO 11-18 06:35:15 [core.py:93] Initializing a V1 LLM engine (v0.1.dev11386+g16451b2ba.d20251118) with config: model='ibm-granite/granite-3.3-8b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.3-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ibm-granite/granite-3.3-8b-instruct, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': None, 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'local_cache_dir': None}
(EngineCore_DP0 pid=24) WARNING 11-18 06:35:18 [cpu.py:390] Pin memory is not supported on CPU.
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:188] auto thread-binding list (id, physical core): [(0, 0), (1, 0), (2, 1), (3, 1), (8, 4), (9, 4), (10, 5), (11, 5), (16, 8), (17, 8), (18, 9), (19, 9), (24, 12), (25, 12), (26, 13), (27, 13), (32, 16), (33, 16), (34, 17), (35, 17), (40, 20), (41, 20), (42, 21), (43, 21), (48, 24), (49, 24)]
get_mempolicy: Operation not permitted
[W1118 06:35:19.957288234 utils.cpp:67] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
set_mempolicy: Operation not permitted
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] OMP threads binding of Process 24:
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 24, core 0
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 35, core 1
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 36, core 2
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 37, core 3
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 38, core 8
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 39, core 9
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 40, core 10
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 41, core 11
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 42, core 16
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 43, core 17
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 44, core 18
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 45, core 19
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 46, core 24
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 47, core 25
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 48, core 26
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 49, core 27
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 50, core 32
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 51, core 33
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 52, core 34
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 53, core 35
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 54, core 40
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 55, core 41
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 56, core 42
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 57, core 43
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 58, core 48
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 	OMP tid: 59, core 49
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_worker.py:94] 
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:51829 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=24) INFO 11-18 06:35:19 [cpu_model_runner.py:55] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore_DP0 pid=24) INFO 11-18 06:40:24 [weight_utils.py:441] Time spent downloading weights for ibm-granite/granite-3.3-8b-instruct: 305.037842 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.02it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.41s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.60s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.54s/it]
(EngineCore_DP0 pid=24) 
(EngineCore_DP0 pid=24) INFO 11-18 06:40:35 [default_loader.py:314] Loading weights took 10.15 seconds
(EngineCore_DP0 pid=24) INFO 11-18 06:40:35 [kv_cache_utils.py:1229] GPU KV cache size: 26,112 tokens
(EngineCore_DP0 pid=24) INFO 11-18 06:40:35 [kv_cache_utils.py:1234] Maximum concurrency for 2,048 tokens per request: 12.75x
(EngineCore_DP0 pid=24) INFO 11-18 06:40:35 [cpu_model_runner.py:65] Warming up model for the compilation...
(EngineCore_DP0 pid=24) INFO 11-18 06:43:59 [cpu_model_runner.py:75] Warming up done.
(EngineCore_DP0 pid=24) INFO 11-18 06:43:59 [core.py:249] init engine (profile, create kv cache, warmup model) took 204.83 seconds
(EngineCore_DP0 pid=24) WARNING 11-18 06:44:00 [cpu.py:155] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
(APIServer pid=1) [INFO] model_hosting_container_standards - decorators.py:76: [PING] Framework handler registered: ping
(APIServer pid=1) [INFO] model_hosting_container_standards.common.transforms.base_factory - base_factory.py:90: [INJECT_ADAPTER_ID] Transform decorator applied to: invocations
(APIServer pid=1) [INFO] model_hosting_container_standards.common.transforms.base_factory - base_factory.py:115: [INJECT_ADAPTER_ID] Registered transform handler for invocations
(APIServer pid=1) [INFO] model_hosting_container_standards.common.transforms.base_factory - base_factory.py:90: [STATEFUL_SESSION_MANAGER] Transform decorator applied to: decorated_func
(APIServer pid=1) [INFO] model_hosting_container_standards.common.transforms.base_factory - base_factory.py:115: [STATEFUL_SESSION_MANAGER] Registered transform handler for decorated_func
(APIServer pid=1) [INFO] model_hosting_container_standards - decorators.py:76: [INVOKE] Framework handler registered: decorated_func
(APIServer pid=1) [INFO] model_hosting_container_standards - __init__.py:156: Starting SageMaker bootstrap process
(APIServer pid=1) [INFO] model_hosting_container_standards - registry.py:109: [REGISTRY] Middleware resolution and registration complete
(APIServer pid=1) [INFO] model_hosting_container_standards - core.py:100: [MIDDLEWARE_LOADER] Middleware stack rebuilt successfully
(APIServer pid=1) [INFO] model_hosting_container_standards - core.py:102: [MIDDLEWARE_LOADER] Processed 3 middlewares
(APIServer pid=1) [WARNING] model_hosting_container_standards.common.custom_code_ref_resolver.function_loader - function_loader.py:73: Failed to load function from spec 'model:custom_sagemaker_invocation_handler': HandlerFileNotFoundError: File '/opt/ml/model/model.py' not found in search paths: ['/opt/ml/model/']
(APIServer pid=1) [WARNING] model_hosting_container_standards.common.custom_code_ref_resolver.function_loader - function_loader.py:73: Failed to load function from spec 'model:custom_sagemaker_ping_handler': HandlerFileNotFoundError: File '/opt/ml/model/model.py' not found in search paths: ['/opt/ml/model/']
(APIServer pid=1) [INFO] model_hosting_container_standards.sagemaker.sagemaker_router - sagemaker_router.py:93: Creating SageMaker router with unified route resolver
(APIServer pid=1) [INFO] model_hosting_container_standards.common.fastapi.routing - routing.py:172: Creating router with prefix='', tags=['sagemaker']
(APIServer pid=1) [INFO] model_hosting_container_standards.common.fastapi.routing - routing.py:110: Mounting 2 handlers to router
(APIServer pid=1) [INFO] model_hosting_container_standards.common.fastapi.routing - routing.py:184: Router created with 0 routes
(APIServer pid=1) [INFO] model_hosting_container_standards.sagemaker.sagemaker_router - sagemaker_router.py:101: SageMaker router created successfully with 0 routes
(APIServer pid=1) [INFO] model_hosting_container_standards.common.fastapi.routing - routing.py:287: Including router with conflict detection
(APIServer pid=1) [INFO] model_hosting_container_standards.common.fastapi.routing - routing.py:305: Successfully included router with 0 routes
(APIServer pid=1) [INFO] model_hosting_container_standards - __init__.py:168: SageMaker bootstrap completed successfully
(APIServer pid=1) INFO 11-18 06:44:01 [api_server.py:1725] Supported tasks: ['generate']
(APIServer pid=1) INFO 11-18 06:44:01 [api_server.py:2052] Starting vLLM API server 0 on http://0.0.0.0:9000
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 11-18 06:44:01 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 11-18 06:44:41 [loggers.py:236] Engine 000: Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:44:51 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:01 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:11 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:21 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:31 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:41 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:45:51 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:01 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:11 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:21 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:31 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:41 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:46:51 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:01 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:11 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:21 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:31 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:41 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:47:51 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:48:01 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:48:11 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 11-18 06:48:21 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:34264 - "POST /v1/completions HTTP/1.1" 200 OK
[root@b314lp81 ~]# curl http://localhost:9000/v1/completions -H "Content-Type: application/json" -d '{ "model": "ibm-granite/granite-3.3-8b-instruct", "prompt": "Tell me a creative story about how artificial intelligence changed the world for the better...", "max_tokens": 150 }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1234  100  1057    0   177      4      0  0:04:24  0:04:12  0:00:12   299
{
  "id": "cmpl-e65e31c8bc7041f0be92629c7ef4b68e",
  "object": "text_completion",
  "created": 1763449137,
  "model": "ibm-granite/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": " one of its major impacts on society is the advancement of the global healthcare system.\"\n\nTitle: The Symphony of Healing: AI's Orchestration in Global Healthcare\n\nIn the year 2050, the world had witnessed a transformative shift in the realm of healthcare, orchestrated by the harmonious melody of Artificial Intelligence (AI). This evolution was not a silent revolution, but a symphony of advancements that resonated across nations, healing the sick and mitigating global health crises.\n\nThe tale began with the AI-driven diagnostic tools that became the new stethoscopes of the",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 19,
    "total_tokens": 169,
    "completion_tokens": 150,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}
---
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements for the s390x architecture by fixing a BF16 byte-ordering bug and adding vectorized implementations for several math operations. The changes are generally good, but there are several opportunities for further performance optimization in the new vectorized functions. Specifically, some functions contain inefficient patterns like calling expensive operations inside loops or unnecessarily unpacking vectors to scalars. I've provided specific suggestions to improve these areas.

Comment on lines 207 to 287
FP32Vec8 tanh() const {
// TODO: Vectorize this
AliasReg ar;
ar.reg = reg;
f32x4x4_t ret;
ret.val[0][0] = std::tanh(ar.values[0]);
ret.val[0][1] = std::tanh(ar.values[1]);
ret.val[0][2] = std::tanh(ar.values[2]);
ret.val[0][3] = std::tanh(ar.values[3]);
ret.val[1][0] = std::tanh(ar.values[4]);
ret.val[1][1] = std::tanh(ar.values[5]);
ret.val[1][2] = std::tanh(ar.values[6]);
ret.val[1][3] = std::tanh(ar.values[7]);
return FP32Vec8(f32x4x2_t({ret.val[0], ret.val[1]}));
}

FP32Vec8 er() const {
// TODO: Vectorize this
AliasReg ar;
ar.reg = reg;
f32x4x4_t ret;
ret.val[0][0] = std::erf(ar.values[0]);
ret.val[0][1] = std::erf(ar.values[1]);
ret.val[0][2] = std::erf(ar.values[2]);
ret.val[0][3] = std::erf(ar.values[3]);
ret.val[1][0] = std::erf(ar.values[4]);
ret.val[1][1] = std::erf(ar.values[5]);
ret.val[1][2] = std::erf(ar.values[6]);
ret.val[1][3] = std::erf(ar.values[7]);
return FP32Vec8(f32x4x2_t({ret.val[0], ret.val[1]}));
// tanh(x) = (exp(2x) - 1) / (exp(2x) + 1)
const __vector float one = vec_splats(1.0f);
const __vector float two = vec_splats(2.0f);
const __vector float zero = vec_splats(0.0f);
const __vector float sat = vec_splats(9.0f); // beyond this, tanh(x) ~ sign(x)

f32x4x2_t out;

for (int i = 0; i < 2; i++) {
__vector float x = reg.val[i];
__vector float ax = vec_abs(x);

// sign(x): +1 or -1
__vector float sign = vec_sel(vec_splats(-1.0f), one,
vec_cmpgt(x, zero));

// saturation mask: |x| > sat
__vector __bool int saturated = vec_cmpgt(ax, sat);

// 2x
__vector float two_x = vec_mul(x, two);

// Build a temporary FP32Vec8 with both lanes = 2x, reuse exp()
f32x4x2_t tmp;
tmp.val[0] = two_x;
tmp.val[1] = two_x;
FP32Vec8 exp_2x_vec(tmp);

FP32Vec8 e2x = exp_2x_vec.exp();
__vector float e = e2x.reg.val[i];

// tanh(x) = (e - 1) / (e + 1)
__vector float num = vec_sub(e, one);
__vector float den = vec_add(e, one);

__vector float t = vec_div(num, den);

// For large |x|, clamp to sign(x)
out.val[i] = vec_sel(t, sign, saturated);
}

return FP32Vec8(out);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The exp() call is inside a loop, which is inefficient. It's called for each of the two lanes of FP32Vec8, and each call computes the exponential for a full temporary vector. This can be optimized by calling exp() once on the 2*x vector outside the loop.

  FP32Vec8 tanh() const {
    // tanh(x) = (exp(2x) - 1) / (exp(2x) + 1)
    const __vector float one   = vec_splats(1.0f);
    const __vector float two   = vec_splats(2.0f);
    const __vector float zero  = vec_splats(0.0f);
    const __vector float sat   = vec_splats(9.0f);  // beyond this, tanh(x) ~ sign(x)

    f32x4x2_t two_x_lanes;
    for (int i = 0; i < 2; i++) {
        two_x_lanes.val[i] = vec_mul(reg.val[i], two);
    }
    FP32Vec8 e2x = FP32Vec8(two_x_lanes).exp();

    f32x4x2_t out;
    for (int i = 0; i < 2; i++) {
        __vector float x  = reg.val[i];
        __vector float ax = vec_abs(x);

        // sign(x): +1 or -1
        __vector float sign = vec_sel(vec_splats(-1.0f), one,
                                      vec_cmpgt(x, zero));

        // saturation mask: |x| > sat
        __vector __bool int saturated = vec_cmpgt(ax, sat);

        // tanh(x) = (e - 1) / (e + 1)
        __vector float num = vec_sub(e2x.reg.val[i], one);
        __vector float den = vec_add(e2x.reg.val[i], one);

        __vector float t = vec_div(num, den);

        // For large |x|, clamp to sign(x)
        out.val[i] = vec_sel(t, sign, saturated);
    }

    return FP32Vec8(out);
}

Comment on lines 290 to 359
FP32Vec8 er() const {
// A&S 7.1.26 approximation:
// erf(x) = sign(x) * (1 - ((((a5*t + a4)*t + a3)*t + a2)*t + a1) * t * exp(-x^2))
// t = 1 / (1 + p*|x|), p = 0.3275911

const __vector float one = vec_splats(1.0f);
const __vector float zero = vec_splats(0.0f);
const __vector float p = vec_splats(0.3275911f);

// Polynomial coeffs
const __vector float a1 = vec_splats(0.254829592f);
const __vector float a2 = vec_splats(-0.284496736f);
const __vector float a3 = vec_splats(1.421413741f);
const __vector float a4 = vec_splats(-1.453152027f);
const __vector float a5 = vec_splats(1.061405429f);

// Threshold where erf(x) ~ sign(x)
const __vector float sat = vec_splats(6.0f);

f32x4x2_t out;

for (int lane = 0; lane < 2; lane++) {
__vector float x = reg.val[lane];
__vector float ax = vec_abs(x);

// sign(x)
__vector float sign = vec_sel(vec_splats(-1.0f), one,
vec_cmpgt(x, zero));

// |x| > 6 → erf(x) = ±1
__vector __bool int saturated = vec_cmpgt(ax, sat);

// t = 1 / (1 + p * |x|)
__vector float t = vec_madd(p, ax, one);
t = vec_div(one, t);

// poly = a5
__vector float poly = a5;
poly = vec_madd(poly, t, a4);
poly = vec_madd(poly, t, a3);
poly = vec_madd(poly, t, a2);
poly = vec_madd(poly, t, a1);

// full polynomial: poly = poly * t
poly = vec_mul(poly, t);

// Compute exp(-x^2)
__vector float x2 = vec_mul(x, x);
__vector float neg_x2 = vec_neg(x2);

f32x4x2_t tmp;
tmp.val[0] = neg_x2;
tmp.val[1] = neg_x2;
FP32Vec8 exp_neg_x2(tmp);

FP32Vec8 e = exp_neg_x2.exp();
__vector float ex = e.reg.val[lane];

// erf(x) = sign * (1 - poly * exp(-x^2))
__vector float term = vec_mul(poly, ex);
__vector float y = vec_sub(one, term);
y = vec_mul(y, sign);

// saturated → ±1
__vector float sat_val = vec_mul(sign, one);
out.val[lane] = vec_sel(y, sat_val, saturated);
}

return FP32Vec8(out);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The exp() call is inside a loop, which is inefficient. It's called for each of the two lanes of FP32Vec8, and each call computes the exponential for a full temporary vector. This can be optimized by calling exp() once on the -x^2 vector outside the loop.

FP32Vec8 er() const {
    // A&S 7.1.26 approximation:
    // erf(x) = sign(x) * (1 - ((((a5*t + a4)*t + a3)*t + a2)*t + a1) * t * exp(-x^2))
    // t = 1 / (1 + p*|x|),  p = 0.3275911

    const __vector float one  = vec_splats(1.0f);
    const __vector float zero = vec_splats(0.0f);
    const __vector float p    = vec_splats(0.3275911f);

    // Polynomial coeffs
    const __vector float a1 = vec_splats(0.254829592f);
    const __vector float a2 = vec_splats(-0.284496736f);
    const __vector float a3 = vec_splats(1.421413741f);
    const __vector float a4 = vec_splats(-1.453152027f);
    const __vector float a5 = vec_splats(1.061405429f);

    // Threshold where erf(x) ~ sign(x)
    const __vector float sat = vec_splats(6.0f);

    f32x4x2_t neg_x2_lanes;
    for (int lane = 0; lane < 2; lane++) {
        __vector float x2 = vec_mul(reg.val[lane], reg.val[lane]);
        neg_x2_lanes.val[lane] = vec_neg(x2);
    }
    FP32Vec8 e = FP32Vec8(neg_x2_lanes).exp();

    f32x4x2_t out;

    for (int lane = 0; lane < 2; lane++) {
        __vector float x  = reg.val[lane];
        __vector float ax = vec_abs(x);

        // sign(x)
        __vector float sign = vec_sel(vec_splats(-1.0f), one,
                                      vec_cmpgt(x, zero));

        // |x| > 6 → erf(x) = ±1
        __vector __bool int saturated = vec_cmpgt(ax, sat);

        // t = 1 / (1 + p * |x|)
        __vector float t = vec_madd(p, ax, one);
        t = vec_div(one, t);

        // poly = a5
        __vector float poly = a5;
        poly = vec_madd(poly, t, a4);
        poly = vec_madd(poly, t, a3);
        poly = vec_madd(poly, t, a2);
        poly = vec_madd(poly, t, a1);

        // full polynomial: poly = poly * t
        poly = vec_mul(poly, t);

        __vector float ex = e.reg.val[lane];

        // erf(x) = sign * (1 - poly * exp(-x^2))
        __vector float term = vec_mul(poly, ex);
        __vector float y = vec_sub(one, term);
        y = vec_mul(y, sign);

        // saturated → ±1
        __vector float sat_val = vec_mul(sign, one);
        out.val[lane] = vec_sel(y, sat_val, saturated);
    }

    return FP32Vec8(out);
}

Comment on lines +429 to +434
FP32Vec8 rcp() const {
AliasReg in, out;
in.reg = reg;

for (int i = 0; i < VEC_ELEM_NUM; ++i) {
out.values[i] = 1.0f / in.values[i];
}
return FP32Vec8(out.reg);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rcp function is implemented with a scalar loop, which is inefficient. It can be implemented using vectorized division for better performance.

  // Elementwise reciprocal: 1/x
  FP32Vec8 rcp() const {
    return FP32Vec8(1.0f) / (*this);
  }

Comment on lines 599 to 603
float reduce_max() const {
AliasReg ar;
ar.reg = reg;
float result = ar.values[0];
unroll_loop<int, VEC_ELEM_NUM>(
[&result, &ar](int i) {
if (ar.values[i] > result) result = ar.values[i];
});
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The reduce_max function is implemented with a scalar loop. This can be optimized by using vector instructions to perform a parallel reduction.

  float reduce_max() const {
    __vector float v01 = vec_max(reg.val[0], reg.val[1]);
    __vector float v23 = vec_max(reg.val[2], reg.val[3]);
    __vector float v = vec_max(v01, v23);
    v = vec_max(v, vec_sld(v, v, 8));
    v = vec_max(v, vec_sld(v, v, 4));
    return vec_extract(v, 0);
  }

Comment on lines +820 to +823
for (; i + FP32Vec8::VEC_ELEM_NUM <= n; i += FP32Vec8::VEC_ELEM_NUM) {
FP32Vec8 v(input + i);
FP32Vec8::AliasReg ar;
ar.reg = v.reg;
for (int j = 0; j < FP32Vec8::VEC_ELEM_NUM; ++j) {
if (ar.values[j] > max_val) max_val = ar.values[j];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop to find the maximum value is inefficient as it unpacks the vector and finds the maximum in a scalar loop. This pattern is repeated in other parts of softmax_fp32vec8 and rmsnorm_fp32vec8. To improve performance, these loops should be fully vectorized. For finding the max, you can use vector max operations and a final horizontal max reduction. This would require adding max() and reduce_max() methods to FP32Vec8, similar to what you've done for FP32Vec16.

Comment on lines +895 to +898
for (; i + FP32Vec8::VEC_ELEM_NUM <= n; i += FP32Vec8::VEC_ELEM_NUM) {
FP32Vec8 x_vec(input + i);

FP32Vec8 sq = x_vec * x_vec;

FP32Vec8::AliasReg ar;
ar.reg = sq.reg;
for (int j = 0; j < FP32Vec8::VEC_ELEM_NUM; ++j) {
sum_sq += ar.values[j];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This loop to compute the sum of squares is inefficient because it unpacks the vector sq to sum its elements in a scalar loop. This can be optimized by accumulating the sum in a vector and then doing a final horizontal sum (reduce_sum) after the loop.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +844 to 853
#if !defined(__powerpc__) && !defined(__s390x__)
template <>
struct VecTypeTrait<c10::Half> {
using vec_t = vec_op::FP16Vec16;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore VecTypeTraitc10::Half for s390x

The new guard in VecTypeTrait<c10::Half> (#if !defined(__powerpc__) && !defined(__s390x__)) now completely removes the specialization whenever __s390x__ is defined. The CPU attention kernels (cpu_attn_vec.hpp/cpu_attn_vec16.hpp) still instantiate VecTypeTrait<c10::Half>::vec_t when building float16 variants, so on s390 builds this alias now resolves to void and those translation units fail to compile, effectively dropping FP16 CPU support even though FP16Vec16 is still defined (aliased to FP32Vec16). The specialization should remain and use the FP32 alias instead of being disabled for s390.

Useful? React with 👍 / 👎.

Comment on lines 214 to 220
__vector float y = vec_mul(x, log2e);

__vector float kf = vec_floor(y);
__vector float r = vec_sub(y, kf);

__vector signed int k = vec_signed(kf);
const __vector signed int min_k = vec_splats((signed int)-126);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge vec_signed intrinsic is undefined

FP32Vec8::exp() now calls vec_signed(kf) (lines 214‑220) to turn the floored exponent into an integer, but neither this file nor the toolchain headers provide a definition for vec_signed (it does not appear anywhere else in the repo). As written the file will not compile because the compiler cannot resolve this intrinsic. The conversion needs to use an existing helper (e.g. vec_cts or an explicit cast) instead of referencing an undefined function.

Useful? React with 👍 / 👎.

@R3hankhan123 R3hankhan123 force-pushed the bf16-fix-s390x branch 4 times, most recently from 540a570 to 3df78d2 Compare November 19, 2025 06:48
@R3hankhan123 R3hankhan123 changed the title [CPU][IBM Z] Fix BF16 support and vectorize math operations for s390 [CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x Nov 19, 2025
@R3hankhan123 R3hankhan123 force-pushed the bf16-fix-s390x branch 3 times, most recently from dcd9b63 to efe7993 Compare November 20, 2025 06:49
… (VXE)

- Fix BF16 byte ordering for big-endian architecture
- Vectorize exp(), tanh(), erf() functions with polynomial approximations
- Add FMA intrinsics (fma, fms, nfma, nfms) using vec_madd/vec_msub
- Improve BF16 rounding with round-to-nearest-even
- Fix prefetch implementation
- Add sigmoid, gelu_tanh, gelu_erf, rcp, rsqrt operations
- Implement softmax_fp32vec8 and rmsnorm_fp32vec8 kernels
- Fix FP16 support by aliasing to FP32Vec16
- Exclude s390x from FP16 vector trait in cpu_attn_impl.hpp

Signed-off-by: Rehan Khan <[email protected]>
@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) November 24, 2025 11:04
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 24, 2025
@bigPYJ1151 bigPYJ1151 merged commit 4de8786 into vllm-project:main Nov 24, 2025
19 checks passed
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Nov 24, 2025
@R3hankhan123 R3hankhan123 deleted the bf16-fix-s390x branch November 25, 2025 14:56
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants