Skip to content

Conversation

@yt0428
Copy link
Contributor

@yt0428 yt0428 commented Oct 26, 2025

Purpose

Add support for openPangu_Ultra_MoE models
FIX #27019

Test Plan

Test for openPangu-Ultra-MoE-718B-V1.1

Start serving:

vllm serve $LOCAL_CKPT_DIR/openPangu-Ultra-MoE-718B-V1.1 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 8 \ --served-model-name pangu_ultra_moe \ --enable-expert-parallel \ --trust-remote-code \

Test for openPangu-Embedded-7B-V1.1

Start serving:

Master node:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345
Other nodes:
vllm serve FreedomIntelligence/openPangu-Embedded-7B-V1.1 \ --host 0.0.0.0 \ --port 8000 \ --headless \ --max-num-batched-tokens 32768 \ --max-model-len 32768 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --served-model-name pangu \ --tensor-parallel-size 8 \ --data-parallel-size 4 \ --data-parallel-size-local 1 \ --data-parallel-start-rank $NODE_RANK \ --data-parallel-address $MASTER_NODE_IP \ --data-parallel-rpc-port 13345

Test Result

Results for openPangu-Ultra-MoE-718B-V1.1

MATH500: 97.6

Results for openPangu-Embedded-7B-V1.1

Request test

python3 -c " import requests; response = requests.post( 'http://localhost:8000/v1/chat/completions', headers={'Content-Type': 'application/json'}, json={ 'model': 'pangu', 'temperature': 0.6, 'top_p': 0.95, 'max_tokens': 500, 'messages': [ { 'role': 'user', 'content': 'Let $S$ be the set of points $(a,b)$ with $0 \\le a,$ $b \\le 1$ such that the equation\n\\[x^4 + ax^3 - bx^2 + ax + 1 = 0\\]has at least one real root. Determine the area of the graph of $S.$' } ] } ); result = response.json(); if 'choices' in result and result['choices']: print(result['choices'][0]['message']['content']) else: print('No response') "

Response Correctness

`
[unused16] The answer is ( \frac{2x + 1}{2} ), which matches the expected value ( \frac{2x + 1}{2} ).

So the final answer is ( \frac{2x + 1}{2} ).

But wait—there's a second way to compute the the number of real roots by checking all the possible combinations of x and y, but only counting those where the equation holds and there's at least one real root. So in the code, we can do:

count = 0
for x in x_list:
    for y in y_list:
        if (x^4 + ax^3 - b*x^2 + a*x + 1) == 0 and (x^4 + ax^3 - b*x^2 + a*x + 1) == x^2 + y^2:
            count += 1

But that's O(n*m), which is acceptable for small n and m......
`

Inference throughout

The throuthoutput is ~180 Tokens/s
image


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Oct 26, 2025

Documentation preview: https://vllm--27521.org.readthedocs.build/en/27521/

@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models speculative-decoding v1 labels Oct 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the openPangu_Ultra_MoE model. The changes include a new model implementation file and updates to various configuration and registry files to integrate the new model. The implementation appears to be largely adapted from the existing deepseek_v2 model.

I've identified a critical issue in the scaling logic within the OpenPanguMoE module, which seems to have been carried over from the deepseek_v2 implementation. This logic flaw could lead to incorrect computations, particularly in float16 precision, potentially affecting the model's output. A detailed comment with a suggested fix is provided below. The other changes appear to be correct and consistent with adding a new model to the framework.

@Bye-legumes
Copy link

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

| `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | | ✅︎ |
| `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ |
| `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ |
| `PanguUltraMoEForCausalLM` |openpangu-ultra-moe-718b-model | | ✅︎ | ✅︎ |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this model have a publicly accessible link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a publicly accessible version in https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1. However, it has not been upload to huggingface yet. The config file in this repo https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1 needs to be modified to align with the common practice in vllm. Therefore, I basically test the model in my local environments and it works well.

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model will be upload to Huggingface soon :)

@yt0428
Copy link
Contributor Author

yt0428 commented Oct 29, 2025

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

@Kishanthan
Copy link

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

Could you also share your config.json file for the above changes? We are currently using the following and we could not load the model with the changes suggested in this PR.

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "num_dense_layers": 3,
  "bos_token_id": 0,
  "eos_token_id": 1,
  "ep_size": 1,
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "attention_kv_lora_dim": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "num_routed_experts": 256,
  "num_shared_experts": 1,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [
      128,
      128
    ]
  },
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "num_mtp_layers": 1,
  "attention_q_lora_dim": 1536,
  "attention_qk_dim": 128,
  "attention_qk_rope_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "topk_group": 4,
  "topk_method": "noaux_tc",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "attention_v_dim": 128,
  "v_head_dim": 128,
  "vocab_size": 153600
}


class OpenPanguForCausalLM(nn.Module, SupportsPP, MixtureOfExperts, SupportsLoRA):
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed QKVParallelLinear is used for self.qkv_proj layer creation. Why we don't add the mapping here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! We should add the mapping for qkv_proj here.

else:
shared_output = None
final_hidden_states = fused_moe_out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check shared_ouput to ensure it is not None when self.shared_experts is not None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We will add a assertion here.

Comment on lines 205 to 206
self.tp_size = get_tensor_model_parallel_world_size()
self.tp_rank = get_tp_group().rank_in_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to line 136, to make the parallel related parameters in one region.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

self.num_heads = self.total_num_heads // tp_size
self.total_num_kv_heads = num_kv_heads
if (
self.total_num_kv_heads >= tp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.total_num_kv_heads >= tp_size
self.total_num_kv_heads > tp_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! self.total_num_kv_heads can not equal to tp_size in this condition.

elif (
self.total_num_kv_heads < tp_size and tp_size % self.total_num_kv_heads != 0
):
# Number of KV heads is less than TP size, so we replicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this replication a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. When number of KV heads is less than TP size, we can simply set the self.num_kv_heads to 1. The 'QKVParallelLinearmodule will do the replication automatically, which can be found in the description ofQKVParallelLinear`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this details

config.hidden_size, eps=config.rms_norm_eps
)
self.tp_group = get_tp_group().device_group
if getattr(config, "sandwich_norm", False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could just set self.sandwich_norm = getattr(config, "sandwich_norm", False) and create the pre- and post- mlp layer when self.sandwich_norm is True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We can make this adjustment to make the code simpler.

Comment on lines +173 to +179
if hf_config.model_type in ("pangu_ultra_moe"):
hf_config.model_type = "pangu_ultra_moe_mtp"
if hf_config.model_type == "pangu_ultra_moe_mtp":
n_predict = getattr(hf_config, "num_nextn_predict_layers", None)
hf_config.update(
{"n_predict": n_predict, "architectures": ["OpenPanguMTPModel"]}
)
Copy link
Member

@hmellor hmellor Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this override be done in vllm/model_executor/models/openpangu.py and vllm/model_executor/models/openpangu_mtp.py instead of putting model specific config overrides in the global configs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see this is currently done for quite a few models... We should do this in a follow up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the common practice (like qwen3_next_mtp, longcat_flash_mtp) and place the override for mtp in vllm/config/speculative.py. I also thought about move the override to the modeling definition file, but it seems the initialization work flow for mtp makes it infeasible. Could you please provide some suggestions on the implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, please follow the existing pattern as you have already done. Refactoring MTP config is a separate task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Looking forward to it.

@yt0428
Copy link
Contributor Author

yt0428 commented Oct 30, 2025

hi, can you give us (me and https://github.com/kcmnd )the access to your fork repo as we tested it it doent work now. We can fix some codes.

Sure, I have sent the invitation. By the way, the reasoning for not working may the the config file issue, as I mentioned above.

Could you also share your config.json file for the above changes? We are currently using the following and we could not load the model with the changes suggested in this PR.

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "num_dense_layers": 3,
  "bos_token_id": 0,
  "eos_token_id": 1,
  "ep_size": 1,
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "attention_kv_lora_dim": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "num_routed_experts": 256,
  "num_shared_experts": 1,
  "moe_layer_freq": 1,
  "n_group": 8,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 62,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [
      128,
      128
    ]
  },
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "num_mtp_layers": 1,
  "attention_q_lora_dim": 1536,
  "attention_qk_dim": 128,
  "attention_qk_rope_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "topk_group": 4,
  "topk_method": "noaux_tc",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "attention_v_dim": 128,
  "v_head_dim": 128,
  "vocab_size": 153600
}

Yes, the config file should be:

{
  "architectures": [
    "PanguUltraMoEForCausalLM"
  ],
  "attention_bias": false,
  "auto_map": {
    "AutoConfig": "configuration_openpangu_moe.PanguUltraMoEConfig",
    "AutoModel": "modeling_openpangu_moe.PanguUltraMoEModel",
    "AutoModelForCausalLM": "modeling_openpangu_moe.PanguUltraMoEForCausalLM"
  },
  "first_k_dense_replace": 3,
  "hidden_act": "silu",
  "hidden_size": 7680,
  "initializer_range": 0.02,
  "intermediate_size": 18432,
  "kv_lora_rank": 512,
  "max_position_embeddings": 131072, 
  "model_type": "pangu_ultra_moe",
  "moe_intermediate_size": 2048,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "norm_topk_prob": true,
  "num_attention_heads": 128,
  "num_experts_per_tok": 8,
  "num_hidden_layers": 61,
  "num_key_value_heads": 128,
  "num_nextn_predict_layers": 1,
  "q_lora_rank": 1536,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "rms_norm_eps": 1e-05,
  "rope_theta": 25600000,
  "routed_scaling_factor": 2.5,
  "sandwich_norm": true,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.2",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 153600
}

And you should also change the name in configuration_openpangu_moe.py, which should be corresponing to the new config file.

Signed-off-by: yuantao <[email protected]>
@Bye-legumes
Copy link

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

@Kishanthan
Copy link

Kishanthan commented Oct 30, 2025

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

@yt0428
Copy link
Contributor Author

yt0428 commented Oct 31, 2025

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

It seems there is some problem in the communication of ray. My test script is running the following command on all four nodes:

uv run vllm serve $LOCAL_CKPT_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --max-num-batched-tokens 32768 \
        --max-model-len 32768 \
        --trust-remote-code \
        --gpu-memory-utilization 0.85 \
        --served-model-name pangu \
        --tensor-parallel-size 8 \
        --data-parallel-size 4 \
        --data-parallel-size-local 1 \
	    --data-parallel-rank $LOCAL_NODE_RANK\
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13345 \
        --enable-expert-parallel \

@Kishanthan
Copy link

Can you share the whole process with the doc on how to use it? We have met some problems and trying to find which step is wrong. like how to use the pangu model and how to config the vllm and patch with current PR. Thanks! We still cannot run it with TP8 and PP4 as it will hang.

What we are trying is to load this model with TP 8 and PP 4 on 4 nodes (32 H100 cards). We are not using DP. The model weights loads fine on all for nodes but when sending a request the inference fails with the below error. Our observation is that, though the weights are loaded on all nodes, only one node were seen using the GPU for inference while other nodes were idle. And after sometime, the request timeouts with the below CCGraph timeout error.

(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc5.dev3+gd2c33c397) with config: model='/home/original_models/openPangu-Ultra-MoE-718B-model', speculative_config=None, tokenizer='/home/original_models/openPangu-Ultra-MoE-718B-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/original_models/openPangu-Ultra-MoE-718B-model, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 16, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a'], resumed_from_preemption=[false], new_token_ids=[[45974]], resumed_req_token_ids=[null], new_block_ids=[null], num_computed_tokens=[15], num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-838f640e-c966-48e6-ba3f-dd874733c65a: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=4.771675335213388e-05, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise e
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 998, in get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/_raylet.pyx", line 3141, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "python/ray/includes/common.pxi", line 120, in ray._raylet.check_status
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 005fed9a4d0da286e8b79c779c6273685d7b963e0200000002e1f505
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] 
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 799, in run_busy_loop
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._process_engine_step()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 382, in step_with_batch_queue
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     model_output = future.result()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 149, in result
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     return self.refs[0].get()
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     self._dag._execute_until(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=771) ERROR 10-29 12:11:52 [core.py:781] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.

It seems there is some problem in the communication of ray. My test script is running the following command on all four nodes:

uv run vllm serve $LOCAL_CKPT_PATH \
        --host 0.0.0.0 \
        --port 8000 \
        --max-num-batched-tokens 32768 \
        --max-model-len 32768 \
        --trust-remote-code \
        --gpu-memory-utilization 0.85 \
        --served-model-name pangu \
        --tensor-parallel-size 8 \
        --data-parallel-size 4 \
        --data-parallel-size-local 1 \
	    --data-parallel-rank $LOCAL_NODE_RANK\
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-rpc-port 13345 \
        --enable-expert-parallel \

Yes there seems to be a bug with latest vLLM version when using PP from ray cgraph side. Some related issues for this #26899 and ray-project/ray#58062. We applied the fixes proposed in those issues and we can now load and invoke the model with TP and PP. Thanks for the help.

Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution

@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 4, 2025
@vllm-bot vllm-bot merged commit 05cae69 into vllm-project:main Nov 4, 2025
52 of 55 checks passed
juliendenize pushed a commit to juliendenize/vllm that referenced this pull request Nov 6, 2025
zWaNg3 added a commit to fangyuchu/vllm that referenced this pull request Nov 7, 2025
* add fault_report_addr in FaultToleranceConfig

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <[email protected]>

* remove fault_report_address in CoreEngineActorManager __init__

Signed-off-by: a798347923 <[email protected]>

* ruff format

Signed-off-by: a798347923 <[email protected]>

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <[email protected]>

* fix one bug.

Signed-off-by: fangyuchu <[email protected]>

* add fault_report_port in FaultToleranceConfig

Signed-off-by: a798347923 <[email protected]>

* add zmq_addr concatenate with fault_report_addr and fault_report_port

Signed-off-by: a798347923 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fix some bug

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* fault reporter bug fix

Signed-off-by: w00689259 <[email protected]>

* remove fault_report_addr in FaultToleranceConfig

Signed-off-by: a798347923 <[email protected]>

* refactor: relocate method serialization functions to serial_util.py

Signed-off-by: fangyuchu <[email protected]>

* fix actor bug

* fix actor bug

* add engine_core_cmd_addr in FaultToleranceConfig

Signed-off-by: a798347923 <[email protected]>

* add and use _stop_worker_execution in EngineCoreGuard

Signed-off-by: a798347923 <[email protected]>

* add and use run in WorkerGuard

Signed-off-by: a798347923 <[email protected]>

* fix actor bug

* fix bug

* fix sentinel

* fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard"

Signed-off-by: a798347923 <[email protected]>

* fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int"

Signed-off-by: a798347923 <[email protected]>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <[email protected]>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <[email protected]>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <[email protected]>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <[email protected]>

* add _recv_cmd func
use deserialize_method_call and run_method in run func

Signed-off-by: a798347923 <[email protected]>

* Update core.py

fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...")

Signed-off-by: a798347923 <[email protected]>

* add self.ctx.term() in shutdown()

Signed-off-by: a798347923 <[email protected]>

* changed import deserialize_method_call,serialize_method_call

Signed-off-by: a798347923 <[email protected]>

* changed init worker_guard in init_device

Signed-off-by: a798347923 <[email protected]>

* Update core.py

add import serialize_method_call

Signed-off-by: a798347923 <[email protected]>

* Update gpu_worker.py

changed init WorkerGuard in init_device

Signed-off-by: a798347923 <[email protected]>

* Update gpu_worker.py

FIX BUG self.worker_guard: WorkerGuard|None = None

Signed-off-by: a798347923 <[email protected]>

* Update gpu_worker.py

fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str"  [arg-type]

Signed-off-by: a798347923 <[email protected]>

* Update gpu_worker.py

ruff format

Signed-off-by: a798347923 <[email protected]>

* Update core.py

ruff-format

Signed-off-by: a798347923 <[email protected]>

* actively send exception information

Signed-off-by: w00689259 <[email protected]>

* actively send exception information

Signed-off-by: w00689259 <[email protected]>

* actively send exception information

Signed-off-by: w00689259 <[email protected]>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <[email protected]>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <[email protected]>

* Update utils.py

delete engine_core_cmd_addr in EngineZmqAddresses

Signed-off-by: a798347923 <[email protected]>

* Remove redundant configuration: fault-pub-port

Signed-off-by: fangyuchu <[email protected]>

* Send pause instructions after receiving fault info in ClientGuard

Signed-off-by: fangyuchu <[email protected]>

* change engine_core_guard_identities from dict[int, bytes] to list[bytes]

Signed-off-by: a798347923 <[email protected]>

* fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard

Signed-off-by: a798347923 <[email protected]>

* change local_rank to rank_in_group in WorkerGuard

Signed-off-by: a798347923 <[email protected]>

* changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)]

Signed-off-by: a798347923 <[email protected]>

* add gloo communication timeout

* fix some bug

* add  stateless_process_group gloo_comm_timeout

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <[email protected]>

* fix some bug

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <[email protected]>

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <[email protected]>

* fix return format

Signed-off-by: w00689259 <[email protected]>

* fix return format

Signed-off-by: w00689259 <[email protected]>

* fix return format

Signed-off-by: w00689259 <[email protected]>

* add abort request

* fix some bug

* fix some bug

* fix some bug

* add dt for client guard

Signed-off-by: w00689259 <[email protected]>

* add dt for client guard

Signed-off-by: w00689259 <[email protected]>

* add dt for client guard

Signed-off-by: w00689259 <[email protected]>

* Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators.

Signed-off-by: fangyuchu <[email protected]>

* Refine certain log forms and fix a minor bug in pause function.

Signed-off-by: fangyuchu <[email protected]>

* Refactor and abstract the recv_msg logic in CG,ECG,WG.

Signed-off-by: fangyuchu <[email protected]>

* [Frontend] Align finish_reason when tool is called with OpenAI (vllm-project#25054)

Signed-off-by: Sungyoon Jeong <[email protected]>
Co-authored-by: Chauncey <[email protected]>

* [Hybrid] Pass kernel block size to builders (vllm-project#27753)

Signed-off-by: Thomas Parnell <[email protected]>

* [Bugfix] Padded Eagle Specdec with Chunked Prefill (vllm-project#26263)

Signed-off-by: Rémi Delacourt <[email protected]>
Signed-off-by: Rémi Delacourt <[email protected]>
Signed-off-by: remi <[email protected]>
Co-authored-by: Benjamin Chislett <[email protected]>

* [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (vllm-project#27964)

Signed-off-by: Kunshang Ji <[email protected]>

* Add and check method uuid when sending commands and receiving results.

Signed-off-by: fangyuchu <[email protected]>

* Add ORCA endpoint load metrics support (vllm-project#24905)

Signed-off-by: Misha Efimov <[email protected]>

* [CI/Build] Remove the flaky gpt-oss lora test (vllm-project#27966)

Signed-off-by: Jee Jee Li <[email protected]>

* Abstract the logic of sending instructions and waiting responses from FaultHandler

Signed-off-by: fangyuchu <[email protected]>

* [Model] Add PaddleOCR-VL Model Support  (vllm-project#27758)

Signed-off-by: zhangyue <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: zhangyue66 <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* Add options in EngineCoreGuard to recv execution results from WorkerGuard

Signed-off-by: fangyuchu <[email protected]>

* Early exit for MoE LoRA kernels (vllm-project#27131)

Signed-off-by: gnovack <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Bugfix] Skip gs:// model paths for speculator detection (vllm-project#27846)

Signed-off-by: Peter Schuurman <[email protected]>

* [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (vllm-project#27616)

Signed-off-by: ahao-anyscale <[email protected]>

* [CI/Testing] Add basic single node dual batch overlap test (vllm-project#27235)

Signed-off-by: Lucas Wilkinson <[email protected]>

* [Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-project#25784)

Co-authored-by: Aurick Qiao <[email protected]>

* [Feature][Benchmarks] Support `inf` burstiness (vllm-project#26941)

Signed-off-by: Sophie du Couédic <[email protected]>

* [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (vllm-project#27764)

Signed-off-by: Lucas Kabela <[email protected]>

* [Bugfix] change FlashMLA reorder_batch_threshold (vllm-project#27777)

Signed-off-by: Matthew Bonanni <[email protected]>

* [Docs] add runai_streamer_sharded to LoadConfig (vllm-project#27937)

Signed-off-by: Andy Xie <[email protected]>

* Add TP parameter to attention tests (vllm-project#27683)

Signed-off-by: Matthew Bonanni <[email protected]>

* [Bugfix][plugin] fla crash on plugin (vllm-project#27322)

* [Bugfix] Fix MoE Routing Simulation (vllm-project#28002)

Signed-off-by: Tyler Michael Smith <[email protected]>

* Remove the tpu docker image nightly build. (vllm-project#27997)

Signed-off-by: Qiliang Cui <[email protected]>

* [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (vllm-project#27748)

Signed-off-by: vllmellm <[email protected]>

* [LoRA] Lora shrink swizzle (vllm-project#27694)

Signed-off-by: li2haipeng <[email protected]>
Signed-off-by: Haipeng Li <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Refactor] Lazy import tool_parser (vllm-project#27974)

Signed-off-by: chaunceyjiang <[email protected]>

* [NIXL][XPU] Pin NIXL version to 0.7.0 (vllm-project#27849)

Signed-off-by: zhenwei-intel <[email protected]>

* [Metrics] Enable sleep state metric outside of dev mode (vllm-project#27867)

Signed-off-by: Mark McLoughlin <[email protected]>

* [Bug] Batch invariant: Fix flash attn MLA `RuntimeError: scheduler_metadata must have shape (metadata_size)` (vllm-project#27884)

* [CPU]Improve dynamic 4bit moe performance (vllm-project#27240)

Signed-off-by: Zhang Xiangze <[email protected]>

* [CI/Build] Update LM Eval Version in AMD CI (vllm-project#27944)

Signed-off-by: zhewenli <[email protected]>

* [KV Connector] Make KVCacheConfig an explicit constructor argument (vllm-project#27887)

Signed-off-by: Mark McLoughlin <[email protected]>

* [Model] fix ernie45 reasoning_parser (vllm-project#27973)

Signed-off-by: wangyafeng <[email protected]>

* [CI/Build] Fix OpenAI API correctness on AMD CI (vllm-project#28022)

Signed-off-by: zhewenli <[email protected]>

* [BugFix][Performance] Restore flashinfer autotuning for all scenarios (vllm-project#27904)

* Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution

Signed-off-by: fangyuchu <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* Load tuned fused_moe_lora shrink and expand kernel configs separately (vllm-project#27435)

Signed-off-by: Yu Gong <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* resolve conflicts

Signed-off-by: w00689259 <[email protected]>

* Support using Int4PreshuffledTensor after loading (vllm-project#26066)

Signed-off-by: Jerry Zhang <[email protected]>

* [Core] Enable StatLogger in LLMEngine (vllm-project#28020)

Signed-off-by: Zhuohan Li <[email protected]>

* [Model][Bugfix] fix pipeline parallelism support for NemotronH (vllm-project#27968)

Signed-off-by: Tomer Asida <[email protected]>

* [Model] add optimal triton fused moe configs for NemotronH MoE (vllm-project#27967)

Signed-off-by: Tomer Asida <[email protected]>

* [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (vllm-project#27123)

* [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (vllm-project#28025)

Signed-off-by: Nick Hill <[email protected]>

* [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (vllm-project#27284)

Signed-off-by: Faqin Zhong <[email protected]>
Co-authored-by: Faqin Zhong <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [PERF] Decouple projections from GDN custom op (vllm-project#27512)

Signed-off-by: Vadim Gimpelson <[email protected]>

* [model] Add support for openPangu_Ultra_MoE (vllm-project#27521)

Signed-off-by: yuantao <[email protected]>
Signed-off-by: yt0428 <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [PerfFix] Avoid separate thread for MP executor shm spin (vllm-project#28012)

Signed-off-by: Nick Hill <[email protected]>

* [AsyncScheduling] Don't schedule past request max_tokens (vllm-project#27922)

Signed-off-by: Nick Hill <[email protected]>

* Remove deprecated `--rope-scaling` and `--rope-theta` (vllm-project#28006)

Signed-off-by: Harry Mellor <[email protected]>

* [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (vllm-project#25763)

Signed-off-by: ganyi <[email protected]>

* Added disable rule to track files under benchmarks/lib (vllm-project#28048)

Signed-off-by: Nadav Kluger <[email protected]>

* [Multimodal] Make MediaConnector extensible. (vllm-project#27759)

Signed-off-by: Chenheli Hua <[email protected]>

* [ROCm] gemm_a16w16 upstreaming (vllm-project#26969)

Signed-off-by: Aleksandr Malyshev <[email protected]>
Co-authored-by: Aleksandr Malyshev <[email protected]>

* Revert "[PERF] Decouple projections from GDN custom op" (vllm-project#28080)

Signed-off-by: Vadim Gimpelson <[email protected]>

* add engine core ut

Signed-off-by: w00689259 <[email protected]>

* add engine core ut

Signed-off-by: w00689259 <[email protected]>

* [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (vllm-project#27740)

* [XPU] Add gpt-oss model support for Intel GPU (vllm-project#27786)

Signed-off-by: Kunshang Ji <[email protected]>

* [CI/Build] Enable some fixed tests in AMD CI (vllm-project#28078)

Signed-off-by: zhewenli <[email protected]>

* [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (vllm-project#27955)

Signed-off-by: wangxiyuan <[email protected]>

* [Bugfix] Fix encoder-only model support for transformers backend (vllm-project#28021)

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>

* [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (vllm-project#28100)

Signed-off-by: Lucas Wilkinson <[email protected]>

* [Model, Core] Support Granite Speech & LoRA for STT (vllm-project#24455)

* [Refactor] Lazy-loaded reasoning_parser (vllm-project#28092)

Signed-off-by: chaunceyjiang <[email protected]>

* [Refactor] to simplify and extract the shared logic between chat completion and responses (vllm-project#27961)

Signed-off-by: chaunceyjiang <[email protected]>

* [bugfix] fix wrong `dcp_local_seq_lens` calc (vllm-project#27518)

Signed-off-by: Qiu <[email protected]>

* [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (vllm-project#28011)

Signed-off-by: KuntaiDu <[email protected]>

* [Misc] fix import error for DeepSeekR1ReasoningParser (vllm-project#28114)

Signed-off-by: chaunceyjiang <[email protected]>

* Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (vllm-project#27635)

Signed-off-by: minatoaquaMK2 <[email protected]>

* Bugfix: Cutlass FP8 FusedMoE bad scaling factors (vllm-project#27255)

Signed-off-by: Amir Klein <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [Graph Partition][Cache] Use inductor partition ops config (vllm-project#27702)

Signed-off-by: Boyuan Feng <[email protected]>

* [XPU] Enable custom routing functions in IPEX for Llama4 (vllm-project#28004)

Signed-off-by: frost-intel <[email protected]>

* add kimi reasoning parser (vllm-project#28128)

Signed-off-by: wangzhengtao <[email protected]>
Co-authored-by: wangzhengtao <[email protected]>

* [DCP] check return_lse for all layers in dcp (vllm-project#27929)

Signed-off-by: Chen Zhang <[email protected]>

* [BugFix] Support EP/DP + EPLB with MTP (vllm-project#25311)

Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Co-authored-by: Sage Moore <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Lucas Wilkinson <[email protected]>

* Enabling cooperative multi-gpu tests on multi-gpu nodes (vllm-project#27986)

Signed-off-by: Alexei V. Ivanov <[email protected]>

* [ROCm][MLA] Support block-size > 1 for AITER MLA backend  (vllm-project#27224)

Signed-off-by: ganyi <[email protected]>
Co-authored-by: wuhuikx <[email protected]>

* [Bugfix] Validate custom logits processor xargs for online serving (vllm-project#27560)

Signed-off-by: Isotr0py <[email protected]>

* [misc] add vLLM Beijing Meetup (vllm-project#28127)

Signed-off-by: Jiaju Zhang <[email protected]>

* [Kernel] Fuse computation of g and beta for Gated Delta Net (vllm-project#28095)

Signed-off-by: zjy0516 <[email protected]>

* [Core] add support for reasoning parser plugins (vllm-project#28075)

Signed-off-by: walter beller-morales <[email protected]>

* [Bugfix] vLLM should check Inductor config for compile cache enablement status (vllm-project#27637)

Signed-off-by: Yanan Cao <[email protected]>

* [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (vllm-project#27994)

Signed-off-by: Chen Zhang <[email protected]>

* [CI]: Add LMCacheConnector Unit Tests (vllm-project#27852)

Signed-off-by: Samuel Shen <[email protected]>
Co-authored-by: Samuel Shen <[email protected]>
Co-authored-by: Yihua Cheng <[email protected]>

* [Feature] Extend batch invariant torch.compile to B200 (vllm-project#27856)

Signed-off-by: PaulZhang12 <[email protected]>

* [Bugfix] Fix Qwen3-Reranker-8B load (vllm-project#28117)

Signed-off-by: wang.yuqi <[email protected]>

* [Docs] Clean up README_TUNING.md (vllm-project#28088)

Signed-off-by: windsonsea <[email protected]>

* [Hardware][IBM Z] Optimize s390x Dockerfile (vllm-project#28023)

Signed-off-by: Rehan Khan <[email protected]>

* [Chore] Remove Nemotron-Nano-VL config copy (vllm-project#28126)

Signed-off-by: Isotr0py <[email protected]>

* [Docs] Add guide to debugging vLLM-torch.compile integration (vllm-project#28094)

Signed-off-by: Richard Zou <[email protected]>

* [Feature]: Add corrupted request metric to V1 metrics system. (vllm-project#27306)

Signed-off-by: atalhens <[email protected]>

* [CI/Build] Update checking logic in cutlass_group_gemm_supported  (vllm-project#27948)

Signed-off-by: zhewenli <[email protected]>

* [CI/Build] Fix `test_defaults_with_usage_context` in AMD CI (vllm-project#27926)

Signed-off-by: zhewenli <[email protected]>

* [Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <[email protected]>

* [Debugging] Add annotation for easier trace analysis (vllm-project#22496)

* [PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-project#28083)

Signed-off-by: Vadim Gimpelson <[email protected]>

* [Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STREAM` (vllm-project#28157)

Signed-off-by: yewentao256 <[email protected]>

* [Bug] Fix env string `"0"` same to `True` (vllm-project#28159)

Signed-off-by: yewentao256 <[email protected]>

* Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1

Signed-off-by: fangyuchu <[email protected]>

* [Feature] Enable TP + EP `shared_experts` overlap with router, 3.7% E2E performance improvement (vllm-project#28164)

Signed-off-by: yewentao256 <[email protected]>

* [CI Failure] `nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV` was removed from HF. Skip it in tests (vllm-project#28170)

Signed-off-by: Vadim Gimpelson <[email protected]>

* [Misc] Remove the duplicate code (vllm-project#28111)

Signed-off-by: chaunceyjiang <[email protected]>

* rename& format logger

Signed-off-by: w00689259 <[email protected]>

* rename& format logger

Signed-off-by: w00689259 <[email protected]>

* feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort

Signed-off-by: fangyuchu <[email protected]>

---------

Signed-off-by: w00689259 <[email protected]>
Signed-off-by: a798347923 <[email protected]>
Signed-off-by: fangyuchu <[email protected]>
Signed-off-by: zWaNg3 <[email protected]>
Signed-off-by: a798347923 <[email protected]>
Signed-off-by: Sungyoon Jeong <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Rémi Delacourt <[email protected]>
Signed-off-by: Rémi Delacourt <[email protected]>
Signed-off-by: remi <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Misha Efimov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: zhangyue <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: zhangyue66 <[email protected]>
Signed-off-by: gnovack <[email protected]>
Signed-off-by: Peter Schuurman <[email protected]>
Signed-off-by: ahao-anyscale <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Sophie du Couédic <[email protected]>
Signed-off-by: Lucas Kabela <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Andy Xie <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Qiliang Cui <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: li2haipeng <[email protected]>
Signed-off-by: Haipeng Li <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
Signed-off-by: Zhang Xiangze <[email protected]>
Signed-off-by: zhewenli <[email protected]>
Signed-off-by: wangyafeng <[email protected]>
Signed-off-by: Yu Gong <[email protected]>
Signed-off-by: Jerry Zhang <[email protected]>
Signed-off-by: Zhuohan Li <[email protected]>
Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Faqin Zhong <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: yuantao <[email protected]>
Signed-off-by: yt0428 <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: ganyi <[email protected]>
Signed-off-by: Nadav Kluger <[email protected]>
Signed-off-by: Chenheli Hua <[email protected]>
Signed-off-by: Aleksandr Malyshev <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Qiu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: minatoaquaMK2 <[email protected]>
Signed-off-by: Amir Klein <[email protected]>
Signed-off-by: Boyuan Feng <[email protected]>
Signed-off-by: frost-intel <[email protected]>
Signed-off-by: wangzhengtao <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Alexei V. Ivanov <[email protected]>
Signed-off-by: Jiaju Zhang <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: walter beller-morales <[email protected]>
Signed-off-by: Yanan Cao <[email protected]>
Signed-off-by: Samuel Shen <[email protected]>
Signed-off-by: PaulZhang12 <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: windsonsea <[email protected]>
Signed-off-by: Rehan Khan <[email protected]>
Signed-off-by: Richard Zou <[email protected]>
Signed-off-by: atalhens <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
Co-authored-by: fangyuchu <[email protected]>
Co-authored-by: a798347923 <[email protected]>
Co-authored-by: w00689259 <[email protected]>
Co-authored-by: fangyuchu <[email protected]>
Co-authored-by: TianZhuo <[email protected]>
Co-authored-by: a798347923 <[email protected]>
Co-authored-by: Sungyoon Jeong <[email protected]>
Co-authored-by: Chauncey <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Rémi Delacourt <[email protected]>
Co-authored-by: Benjamin Chislett <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: Misha Efimov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: zhang-prog <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: gnovack <[email protected]>
Co-authored-by: pwschuurman <[email protected]>
Co-authored-by: ahao-anyscale <[email protected]>
Co-authored-by: Lucas Wilkinson <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Sophie du Couédic <[email protected]>
Co-authored-by: Lucas Kabela <[email protected]>
Co-authored-by: Matthew Bonanni <[email protected]>
Co-authored-by: Ning Xie <[email protected]>
Co-authored-by: Hank_ <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: QiliangCui <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Co-authored-by: li2haipeng <[email protected]>
Co-authored-by: liuzhenwei <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Co-authored-by: Wentao Ye <[email protected]>
Co-authored-by: xiangze-arm <[email protected]>
Co-authored-by: Zhewen Li <[email protected]>
Co-authored-by: CSWYF3634076 <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: yugong333 <[email protected]>
Co-authored-by: Jerry Zhang <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: tomeras91 <[email protected]>
Co-authored-by: bnellnm <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: lyrisz <[email protected]>
Co-authored-by: Faqin Zhong <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Vadim Gimpelson <[email protected]>
Co-authored-by: yt0428 <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Pleaplusone <[email protected]>
Co-authored-by: nadavkluger <[email protected]>
Co-authored-by: Chenheli Hua <[email protected]>
Co-authored-by: Aleksandr Malyshev <[email protected]>
Co-authored-by: Aleksandr Malyshev <[email protected]>
Co-authored-by: tou <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Qiu <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Eric Yue <[email protected]>
Co-authored-by: amirkl94 <[email protected]>
Co-authored-by: Boyuan Feng <[email protected]>
Co-authored-by: Frost Mitchell <[email protected]>
Co-authored-by: bigmoyan <[email protected]>
Co-authored-by: wangzhengtao <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Ilya Markov <[email protected]>
Co-authored-by: Sage Moore <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: wuhuikx <[email protected]>
Co-authored-by: Jiaju Zhang <[email protected]>
Co-authored-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Walter Beller-Morales <[email protected]>
Co-authored-by: gmagogsfm <[email protected]>
Co-authored-by: Samuel Shen <[email protected]>
Co-authored-by: Samuel Shen <[email protected]>
Co-authored-by: Yihua Cheng <[email protected]>
Co-authored-by: Paul Zhang <[email protected]>
Co-authored-by: wang.yuqi <[email protected]>
Co-authored-by: Michael Yao <[email protected]>
Co-authored-by: R3hankhan <[email protected]>
Co-authored-by: Richard Zou <[email protected]>
Co-authored-by: Snehlata <[email protected]>
Co-authored-by: Dayeol Lee <[email protected]>
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
Signed-off-by: yuantao <[email protected]>
Signed-off-by: yt0428 <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Add support for openPangu-Ultra-MoE-718B

7 participants