Skip to content

Conversation

@ISEEKYAN
Copy link
Collaborator

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

support training qwen3vl with megatron

  1. add an image with vllm0.11 and nemo's dedicated megatron that support gpt-oss with optimized fused kernels.
  2. add a script of training qwen3vl-30b with megatron
  3. necessary changes to support qwen3vl megatron. (just register forward functions, the modeling is through mbridge)

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

image

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for training qwen3vl with Megatron. The changes include a new Dockerfile, an example training script, and updates to model registries. The implementation correctly reuses existing forward functions for QWEN2_5_VL in most cases. However, there is a critical issue in the model registry where an incorrect forward function is assigned for the 'no padding' case, which will cause issues for this vision-language model. I've left a comment with details on the required fix.

@wuxibin89 wuxibin89 merged commit 33eb86f into volcengine:main Oct 15, 2025
65 of 67 checks passed
@LuoXiaoHeics
Copy link

Hi, I used the codes you updated, but found that mbridge does not support the model

File "verl/workers/megatron_workers.py", line 161, in _init_hf_config_and_tf_config
bridge = AutoBridge.from_config(hf_config)
ValueError: Unregistered model type: qwen3_vl_moe, now only support dict_keys(['deepseek_v3', 'llama', 'qwen2', 'mimo', 'mixtral', 'qwen2_5_vl', 'qwen2_moe', 'qwen3', 'qwen3_moe', 'glm4_moe', 'glm4v', 'glm4v_moe', 'gemma3', 'internvl_chat'])

Do not know how to fix this problem? Thanks.

@ccilery
Copy link
Contributor

ccilery commented Oct 15, 2025

Hi, I used the codes you updated, but found that mbridge does not support the model

File "verl/workers/megatron_workers.py", line 161, in _init_hf_config_and_tf_config bridge = AutoBridge.from_config(hf_config) ValueError: Unregistered model type: qwen3_vl_moe, now only support dict_keys(['deepseek_v3', 'llama', 'qwen2', 'mimo', 'mixtral', 'qwen2_5_vl', 'qwen2_moe', 'qwen3', 'qwen3_moe', 'glm4_moe', 'glm4v', 'glm4v_moe', 'gemma3', 'internvl_chat'])

Do not know how to fix this problem? Thanks.

use the latest mbridge repo

@ccilery
Copy link
Contributor

ccilery commented Oct 15, 2025

ray.exceptions.RayTaskError(KeyError): ray::WorkerDict.ref_init_model() (pid=15741, ip=192.168.81.181, actor_id=9a63e9057b4c4a61808d92bd14000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f3ff7074ec0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/single_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/single_controller/base/decorator.py", line 433, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/workers/megatron_workers.py", line 517, in init_model
self.ref_module, self.ref_model_config = self._build_model_optimizer(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/workers/megatron_workers.py", line 367, in _build_model_optimizer
self.bridge.load_weights(ref_module, local_model_path)
File "/usr/local/lib/python3.12/dist-packages/mbridge/core/bridge.py", line 193, in load_weights
hf_weights_map = self.safetensor_io.load_some_hf_weight(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/core/safetensor_io.py", line 58, in load_some_hf_weight
filename = index[name]
~~~~~^^^^^^
KeyError: 'model.visual.blocks.27.attn.proj.weight'

@ISEEKYAN 235b model has this problem, 30b is ok.

@ISEEKYAN
Copy link
Collaborator Author

ray.exceptions.RayTaskError(KeyError): ray::WorkerDict.ref_init_model() (pid=15741, ip=192.168.81.181, actor_id=9a63e9057b4c4a61808d92bd14000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f3ff7074ec0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/single_controller/ray/base.py", line 700, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/single_controller/base/decorator.py", line 433, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/workers/megatron_workers.py", line 517, in init_model self.ref_module, self.ref_model_config = self._build_model_optimizer( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_4bebd38688bad966/verl/workers/megatron_workers.py", line 367, in _build_model_optimizer self.bridge.load_weights(ref_module, local_model_path) File "/usr/local/lib/python3.12/dist-packages/mbridge/core/bridge.py", line 193, in load_weights hf_weights_map = self.safetensor_io.load_some_hf_weight( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/mbridge/core/safetensor_io.py", line 58, in load_some_hf_weight filename = index[name] ~~~~~^^^^^^ KeyError: 'model.visual.blocks.27.attn.proj.weight'

@ISEEKYAN 235b model has this problem, 30b is ok.

hello @ccilery , could you provide more details, such as the script of 235B, nGPUs, parallelism settings?

@ccilery
Copy link
Contributor

ccilery commented Oct 15, 2025

@ISEEKYAN
8nodes, the scripts I copy from qwen3-235b_megatron_96gb.sh

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=True
kl_loss_coef=0.001

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=$((1024 * 2))
max_response_length=$((1204 * 8))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 1))
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

train_prompt_bsz=${TRAIN_BS:-32}
n_resp_per_prompt=8
train_prompt_mini_bsz=16

# minimum nodes need for qwen3-235B-A22B
NNODES=${NNODES:-4}
# Paths

LOG_PATH=/mnt/qwen3vl/output
MODEL_PATH=/mnt/qwen3vl/Qwen3-VL-235B-A22B-Instruct

TRAIN_FILE=
TEST_FILE=

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
use_dynamic_bsz=True
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 10 / 10))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
offload=True
OPTIM_OFFLOAD=${OPTIM_OFFLOAD:-True}
gen_tp=8
train_tp=${TP:-4}
train_pp=${PP:-8}

EP=${EP:-4}
ETP=1
CP=1
optimizer_offload_fraction=${OFFLOAD_FRACTION:-1.}
last_layer=${LAST_LAYER:-10}

project_name='verl-qwen3'
exp_name="235B-${NNODES}-pp${train_pp}-tp${train_tp}-ep${EP}-actor-length${actor_ppo_max_token_len}"
CKPTS_DIR=$LOG_PATH/ckpt/${project_name}/${exp_name}
ROLL_DUMP_DIR=$LOG_PATH/rollouts

# TODO: support cuda graph for rollout by setting the following config
# actor_rollout_ref.rollout.cudagraph_capture_sizes=[1,2,4,8,16,32]
# actor_rollout_ref.rollout.enforce_eager=False

ray job submit
--working-dir=$WORKING_DIR
--
python3 -m verl.trainer.main_ppo
--config-path=config
--config-name='ppo_megatron_trainer.yaml'
data.train_files="${TRAIN_FILE}"
data.val_files="${TEST_FILE}"
data.prompt_key=prompt
data.truncation='left'
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.train_batch_size=${train_prompt_bsz}
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.enforce_eager=True
actor_rollout_ref.rollout.free_cache_engine=True
algorithm.adv_estimator=${adv_estimator}
algorithm.use_kl_in_reward=${use_kl_in_reward}
algorithm.kl_ctrl.kl_coef=${kl_coef}
actor_rollout_ref.model.use_fused_kernels=True
actor_rollout_ref.actor.megatron.use_mbridge=True
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.model.path="${MODEL_PATH}"
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=10
actor_rollout_ref.actor.optim.weight_decay=0.1
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
actor_rollout_ref.actor.megatron.param_offload=${offload}
actor_rollout_ref.actor.megatron.optimizer_offload=${OPTIM_OFFLOAD}
actor_rollout_ref.actor.megatron.grad_offload=${offload}
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.actor.megatron.context_parallel_size=${CP}
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.optim.clip_grad=1.0
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
actor_rollout_ref.rollout.gpu_memory_utilization=0.85
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k=${top_k}
actor_rollout_ref.nccl_timeout=1200
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.do_sample=True
actor_rollout_ref.rollout.val_kwargs.n=1
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.ref.megatron.context_parallel_size=${CP}
actor_rollout_ref.ref.megatron.param_offload=${offload}
+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.masked_softmax_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_activation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_dropout_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.deallocate_pipeline_outputs=True
+actor_rollout_ref.actor.megatron.override_transformer_config.persist_layer_norm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type="flex"
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=True
reward_model.reward_manager=dapo
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False
+reward_model.reward_kwargs.max_resp_len=${max_response_length}
trainer.logger=['console']
trainer.project_name="${project_name}"
trainer.experiment_name="${exp_name}"
trainer.n_gpus_per_node=8
trainer.nnodes="${NNODES}"
trainer.val_before_train=False
trainer.test_freq=10
trainer.save_freq=100
trainer.total_epochs=10
trainer.default_local_dir="${CKPTS_DIR}"
trainer.rollout_data_dir=${ROLL_DUMP_DIR}
trainer.resume_mode=auto
trainer.log_val_generations=10

@ISEEKYAN
Copy link
Collaborator Author

@ccilery fixed, please update mbridge to the latest main branch. more details about the bug please refer to ISEEKYAN/mbridge@cf1a12f

@ccilery
Copy link
Contributor

ccilery commented Oct 16, 2025

@ccilery fixed, please update mbridge to the latest main branch. more details about the bug please refer to ISEEKYAN/mbridge@cf1a12f

@ISEEKYAN Thanks, it's ok. A new error has occurred, have you encountered this problem before?

ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=84545, ip=192.168.81.176, [57/152499]
c599bafbc7dc027b4707052e000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fa74ef82a80>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/single
_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/single
_controller/base/decorator.py", line 433, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/utils/
profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/utils/
profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/utils/
profiler/profile.py", line 256, in wrapper
return func(self_instance, *args, **kwargs_inner)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/worker
s/megatron_workers.py", line 746, in compute_log_prob
output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/utils/
profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/utils/
profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/worker
s/actor/megatron_actor.py", line 217, in compute_log_prob output = self.forward_backward_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/worker
s/actor/megatron_actor.py", line 588, in forward_backward_batch
losses_reduced = forward_backward_func(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 2023, in forward_backward_pipelining_without_interleav
ing
output_tensor, num_tokens = forward_step(
^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 397, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/worker
s/actor/megatron_actor.py", line 524, in forward_step
output = forward_fn( ^^^^^^^^^^^
File "/tmp/ray/session_2025-10-15_06-42-39_805680_21265/runtime_resources/working_dir_files/_ray_pkg_ffb110169af6a57e/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
output_orig: CausalLMOutputForPPO = model(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/distributed/data_parallel_base.py", line 22, in forward
return self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/module.py", line 237, in forward
outputs = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/model.py", line 253, in forward
vision_embeds, deepstack_feature_lists = self.vision_model(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/vision_model.py", line 249, in forward
hidden_states, deepstack_feature_lists = self.decoder(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 316, in forward
hidden_states, context = layer(
^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 898, in call
return super(MegatronModule, self).call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 458, in forward
hidden_states, context = self._forward_attention(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 519, in _forward_attention
attention_output_with_bias = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Qwen3VLSelfAttention.forward() got an unexpected keyword argument 'yarn_mscale'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(WorkerDict pid=109476, ip=192.168.82.42) WARNING 10-16 04:02:40 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not i
nitialized. No cudagraph will be used. [repeated 63x across cluster]
(WorkerDict pid=109477, ip=192.168.82.42) kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 8192, 'repetition_penalty': 1.0, 'detokeniz
e': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False} [repeated 63x across cluster]

@ISEEKYAN
Copy link
Collaborator Author

@ccilery you can not use the megatron inside that container, which is a dedicated version for gpt-oss. Just install another megatron as I wrote in the 30B script.

@ccilery
Copy link
Contributor

ccilery commented Oct 16, 2025

@ISEEKYAN Thanks, another problem has occurred for 235b-a22b model, Could you please take a look at it? the scripts as follows.

ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=176271, ip=192.168.81.180, actor_i
d=499fffe7686ef69e6022f68106000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f434e5be810>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/singl
e_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/singl
e_controller/base/decorator.py", line 433, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/utils
/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/utils
/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/utils
/profiler/profile.py", line 256, in wrapper
return func(self_instance, *args, **kwargs_inner)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/worke
rs/megatron_workers.py", line 746, in compute_log_prob
output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/utils
/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/utils
/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/worke
rs/actor/megatron_actor.py", line 217, in compute_log_prob
output = self.forward_backward_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/worke
rs/actor/megatron_actor.py", line 588, in forward_backward_batch
losses_reduced = forward_backward_func(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 2135, in forward_backward_pipel
ining_without_interleaving
output_tensor, num_tokens = forward_step(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 402, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/worke
rs/actor/megatron_actor.py", line 560, in forward_step
output = forward_fn(
^^^^^^^^^^^
File "/tmp/ray/session_2025-10-16_11-40-31_478381_281906/runtime_resources/working_dir_files/_ray_pkg_01a4e0a09a2365d0/verl/model
s/mcore/model_forward.py", line 113, in gptmodel_forward_qwen2_5_vl
output_orig = model(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/data_parallel_base.py", line 22, in forward
return self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/module.py", line 429, in forward
outputs = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/model.py", line 320, in forward
output = self.language_model(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/gpt_model.py", line 132, in forward
hidden_states = self.decoder(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/transformer_block.py", line 553, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/module.py", line 305, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 610, in forward
hidden_states = self._checkpointed_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 446, in _checkpointed_forward
hidden_states, context = checkpoint_handler(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 429, in checkpoint_handler
return tensor_parallel.checkpoint(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/tensor_parallel/random.py", line 480, in checkpoint
return CheckpointFunction.apply(function, distribute_saved_activations, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/tensor_parallel/random.py", line 426, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 399, in custom_forward
hidden_states, context = layer(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/transformer_layer.py", line 852, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/module.py", line 305, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/transformer_layer.py", line 434, in forward
hidden_states, context = self._forward_attention(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/transformer_layer.py", line 499, in _forward_attention
attention_output_with_bias = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/attention.py", line 192, in forward
query = apply_rotary_pos_emb_absolute(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/rope_utils.py", line 345, in apply_rotary_pos_emb_absolute
result = apply_rotary_pos_emb_thd_absolute(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/rope_utils.py", line 318, in apply_rotary_pos_emb_thd_absol
ute
return _apply_rotary_pos_emb_bshd(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/megatron/core/models/common/embeddings/rope_utils.py", line 125, in apply_rotary_p
os_emb_bshd
t = (t * cos
) + (rotate_half(t, rotary_interleaved) * sin)
~~^~~~~~
RuntimeError: The size of tensor a (20480) must match the size of tensor b (17956) at non-singleton dimension 0

python3 -m verl.trainer.main_ppo --config-path=config
--config-name='ppo_megatron_trainer.yaml'
algorithm.adv_estimator=grpo
data.train_files="$train_path"
data.val_files="$test_path"
data.train_batch_size=64
data.max_prompt_length=1024
data.max_response_length=2048
data.filter_overlong_prompts=True
data.truncation='error'
actor_rollout_ref.model.path=$HF_MODEL_PATH
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=64
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.megatron.expert_model_parallel_size=8
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4
actor_rollout_ref.actor.megatron.expert_model_parallel_size=8
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=1
actor_rollout_ref.actor.megatron.context_parallel_size=1
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=True
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.01
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.rollout.tensor_model_parallel_size=8
actor_rollout_ref.actor.use_dynamic_bsz=True
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=5120
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=20480
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=20480
actor_rollout_ref.rollout.name=$ENGINE
actor_rollout_ref.rollout.enforce_eager=True
actor_rollout_ref.rollout.free_cache_engine=True
+actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
actor_rollout_ref.rollout.tensor_model_parallel_size=8
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.n=5
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.megatron.use_mbridge=True
actor_rollout_ref.actor.megatron.param_offload=True
actor_rollout_ref.actor.megatron.optimizer_offload=True
actor_rollout_ref.actor.megatron.grad_offload=True
actor_rollout_ref.ref.megatron.param_offload=True
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4
actor_rollout_ref.ref.megatron.expert_model_parallel_size=8
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=1
actor_rollout_ref.ref.megatron.context_parallel_size=1
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True
algorithm.use_kl_in_reward=False
trainer.critic_warmup=0
trainer.logger='["console"]'
trainer.project_name='verl_grpo_example_geo3k'
trainer.experiment_name='qwen3_vl_235b_megatron'
trainer.n_gpus_per_node=8
trainer.nnodes=8
trainer.save_freq=20
trainer.test_freq=5
trainer.val_before_train=False
trainer.total_epochs=15 $@

@ISEEKYAN
Copy link
Collaborator Author

@ccilery Thank you for reporting these bugs. I will add another runnable qwen3vl-235B example script.

@LuoXiaoHeics
Copy link

LuoXiaoHeics commented Oct 17, 2025

Hi, I met another problem in the mcore,

output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch                                                      
    losses_reduced = forward_backward_func(                                                                        
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining       
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'

I found that mbridge correctly loads the Qwen3VLGPTModel, but it incorrectly uses the function _fused_GPTModel_forward in model_forward_fused (I guess). Do you meet such a problem?

@ISEEKYAN
Copy link
Collaborator Author

Hi, I met another problem in the mcore,

output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch                                                      
    losses_reduced = forward_backward_func(                                                                        
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining       
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'

I found that mbridge correctly loads the Qwen3VLGPTModel, but it incorrectly uses the function _fused_GPTModel_forward in model_forward_fused (I guess). Do you meet such a problem?
,
There is a addition argument in Qwen3VLGPTModel called visual_pos_masks. I think we should modify _fused_GPTModel_forward and pass that argument through it

@ISEEKYAN
Copy link
Collaborator Author

@ccilery Bug is fixed by updating to the latest mbridge and I have confirm runnable with the script #3799

@ccilery
Copy link
Contributor

ccilery commented Oct 17, 2025

@ISEEKYAN Thanks, it's ok.
By the way, how much memory in your machine, is there any CPU memory OOM when saving the checkpoint? such as #3360

@LuoXiaoHeics
Copy link

Hi, I met another problem in the mcore,

output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch                                                      
    losses_reduced = forward_backward_func(                                                                        
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining       
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'

I found that mbridge correctly loads the Qwen3VLGPTModel, but it incorrectly uses the function _fused_GPTModel_forward in model_forward_fused (I guess). Do you meet such a problem?
,
There is a addition argument in Qwen3VLGPTModel called visual_pos_masks. I think we should modify _fused_GPTModel_forward and pass that argument through it

Yes, I think it is caused by the use of actor_rollout_ref.model.use_fused_kernels=True , and if I unset this it works correctly.

@ISEEKYAN
Copy link
Collaborator Author

@ccilery My cpu memory is 2TB per node.

I welcome anyone connect me with wechat for further questions. My wechat id is becauseyesterday

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

support training qwen3vl with megatron

1. add an image with vllm0.11 and nemo's dedicated megatron that support
gpt-oss with optimized fused kernels.
2. add a script of training qwen3vl-30b with megatron
3. necessary changes to support qwen3vl megatron. (just register forward
functions, the modeling is through mbridge)


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
<img width="372" height="314" alt="image"
src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
@baobaohanhan21
Copy link

@ISEEKYAN Hi, I got some error when run qwen3vl-30b-megatron.sh, env: use docker://iseekyan/verl:nemo.gptoss_vllm0.11.0, and pip install mbridge,transformers megatron as wrote in scripts. BUT when it runs to update actor:

output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.

trace log I found:
[rank1]:W1020 09:45:39.088000 189458 torch/_dynamo/convert_frame.py:984] [3/8] torch._dynamo hit config.recompile_limit (8)
^[[36m(WorkerDict pid=189458)^[[0m [rank1]:W1020 09:45:39.088000 189458 torch/_dynamo/convert_frame.py:984] [3/8] function: 'bias_dropout_add_fused_train' (/usr/local/lib/python3.12/dist-packages/megatron/core/fusions/fused_bias_dropout.py:67)
^[[36m(WorkerDict pid=189458)^[[0m [rank1]:W1020 09:45:39.088000 189458 torch/_dynamo/convert_frame.py:984] [3/8] last reason: 3/7: 'NoneType' object has no attribute 'size'
^[[36m(WorkerDict pid=189458)^[[0m [rank1]:W1020 09:45:39.088000 189458 torch/_dynamo/convert_frame.py:984] [3/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
^[[36m(WorkerDict pid=189458)^[[0m [rank1]:W1020 09:45:39.088000 189458 torch/_dynamo/convert_frame.py:984] [3/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.

Is this a problem with megatron version ?

@jiayouxiaoding
Copy link

jiayouxiaoding commented Oct 21, 2025

File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/tensor_parallel/random.py", line 426, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 100, in custom_forward
hidden_states, context = layer(
^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 898, in call
return super(MegatronModule, self).call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 458, in forward
hidden_states, context = self._forward_attention(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 519, in _forward_attention
attention_output_with_bias = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Qwen3VLSelfAttention.forward() got an unexpected keyword argument 'yarn_mscale'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@ISEEKYAN Hi,I'm getting this error:
From the logs, I see that the TransformerConfig contains a whole set of Yarn-related parameters, which don’t match the HF model’s config.json at all. In the config file, rope_type is clearly set to "default", not Yarn. So where are these Yarn-related parameters (like yarn_mscale) coming from? Does anyone know what’s causing this issue?

@ISEEKYAN
Copy link
Collaborator Author

please try with a new docker image docker.io/iseekyan/verl:megatron0.13_vllm0.11

@baobaohanhan21
Copy link

please try with a new docker image docker.io/iseekyan/verl:megatron0.13_vllm0.11

not work. miss too many libs, DeepEP ...

@huaiyizhao
Copy link
Contributor

File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply return super().apply(*args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/megatron-lm/megatron/core/tensor_parallel/random.py", line 426, in forward outputs = run_function(*args) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/transformer_block.py", line 100, in custom_forward hidden_states, context = layer( ^^^^^^ File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 898, in call return super(MegatronModule, self).call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 458, in forward hidden_states, context = self._forward_attention(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/megatron-lm/megatron/core/transformer/transformer_layer.py", line 519, in _forward_attention attention_output_with_bias = self.self_attention( ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: Qwen3VLSelfAttention.forward() got an unexpected keyword argument 'yarn_mscale'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@ISEEKYAN Hi,I'm getting this error: From the logs, I see that the TransformerConfig contains a whole set of Yarn-related parameters, which don’t match the HF model’s config.json at all. In the config file, rope_type is clearly set to "default", not Yarn. So where are these Yarn-related parameters (like yarn_mscale) coming from? Does anyone know what’s causing this issue?

Same. #3783

@huaiyizhao
Copy link
Contributor

please try with a new docker image docker.io/iseekyan/verl:megatron0.13_vllm0.11

This image has too many missing dependencies

@spacegoing
Copy link
Contributor

spacegoing commented Oct 23, 2025

I'm running dsv3 and hit the same issue:

_H200.json'] [repeated 511x across cluster]                                                                                                                                                                                                                                             (WorkerDict pid=591, ip=10.120.1.207) kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 9632, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False}                                                                           (TaskRunner pid=14204) /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is fi
eld-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type
.                                                                                                                                                                                                                                                                                       (TaskRunner pid=14204)   warnings.warn(                                                                                                                                                                                                                                                 (TaskRunner pid=14204) /usr/local/lib/python3.12/dist-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union t
ype.                                                                                                                                                                                                                                                                                    
(TaskRunner pid=14204)   warnings.warn(                                                                                                                                                                                                                                                 (WorkerDict pid=591, ip=10.120.0.123) <string>:24: UserWarning: set sequence parallel to false as TP size is 1 [repeated 31x across cluster]                                                                                                                                            (WorkerDict pid=591, ip=10.120.0.123) /tmp/ray/session_2025-10-23_09-29-05_226490_48/runtime_resources/working_dir_files/_ray_pkg_27e1ab0d3a69a347/verl/utils/profiler/config.py:49: UserWarning: Torch profiler tool config is not fully supported now. [repeated 31x across cluster]  (WorkerDict pid=591, ip=10.120.0.123)   warnings.warn("Torch profiler tool config is not fully supported now.", stacklevel=1) [repeated 31x across cluster]                                                                                                                             
(WorkerDict pid=597, ip=10.120.0.123) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. [repeated 32x across cluster]                                                                                                             
(TaskRunner pid=14204) wandb: Tracking run with wandb version 0.22.2                                                                                                                                                                                                                    (TaskRunner pid=14204) wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.                                                                                                                                   (TaskRunner pid=14204) wandb: Run data is saved locally in /tmp/ray/session_2025-10-23_09-29-05_226490_48/runtime_resources/working_dir_files/_ray_pkg_27e1ab0d3a69a347/wandb/offline-run-20251023_095025-4wuwwl30                                                                      (TaskRunner pid=14204) Checkpoint tracker file does not exist: /root/myCodeLab/host/verl/ckpts/750B/750B-64-pp16-tp1-ep32-actor-length11680/latest_checkpointed_iteration.txt                                                                                                           
(TaskRunner pid=14204) Training from scratch                                                                                                                                                                                                                                            
(WorkerDict pid=597, ip=10.120.0.196) WARNING 10-23 09:50:19 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used. [repeated 511x across cluster]                                                                                    (WorkerDict pid=594, ip=10.120.0.39) kwargs: {'n': 1, 'logprobs': 0, 'max_tokens': 9632, 'repetition_penalty': 1.0, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1.0, 'ignore_eos': False} [repeated 511x across cluster]                                             (TaskRunner pid=14204)                                                                                                                                                                                                                                                                  Training Progress:   0%|          | 0/139970 [00:00<?, ?it/s]                                                                                                                                                                                                                           
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] fx graph cache unable to load compiled graph                                                                                                                         
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] Traceback (most recent call last):                                                                                                                                   (WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codecache.py", line 1022, in iterate_over_candidates                                                 (WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]     with open(os.path.join(subdir, path), "rb") as f:                                                                                                                (WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                      
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/torchinductor_root/fxgraph/6b/f6b6da72xjev7qxu43phdbfydr7qxuydfhbzegovmalv6jjbw7tv/.592.139814949451072
.tmp'                                                                                                                                                                                                                                                                                   (WorkerDict pid=646, ip=10.120.2.95) /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.)              
(WorkerDict pid=646, ip=10.120.2.95)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                                                                            
(WorkerDict pid=646, ip=10.120.2.95) /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) [repeated 32x across cluster]                                                           


(WorkerDict pid=591, ip=10.120.0.169)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 21x across cluster]                                                                                                             (WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8] torch._dynamo hit config.recompile_limit (8)                                                                                                                        (WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8]    function: 'bias_dropout_add_fused_train' (/workspace/Megatron-LM/megatron/core/fusions/fused_bias_dropout.py:67)                                                 
(WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8]    last reason: 6/7: tensor 'x_with_bias[0]' requires_grad mismatch. expected requires_grad=0                                                                       
(WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".                                                                                                      (WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.                                                            (WorkerDict pid=591, ip=10.120.0.137) /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If you
r operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) [repeated 73
x across cluster]                                                                                                                                                                                                                                                                       (WorkerDict pid=591, ip=10.120.0.137)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass [repeated 73x across cluster]                                                                                                             (WorkerDict pid=592, ip=10.120.0.113) [rank9]:W1023 10:14:41.305000 592 torch/_dynamo/convert_frame.py:1016] [6/8] torch._dynamo hit config.recompile_limit (8) [repeated 31x across cluster]                                                                                           (WorkerDict pid=592, ip=10.120.0.113) [rank9]:W1023 10:14:41.305000 592 torch/_dynamo/convert_frame.py:1016] [6/8]    function: 'bias_dropout_add_fused_train' (/workspace/Megatron-LM/megatron/core/fusions/fused_bias_dropout.py:67) [repeated 31x across cluster]                    
(WorkerDict pid=592, ip=10.120.0.113) [rank9]:W1023 10:14:41.305000 592 torch/_dynamo/convert_frame.py:1016] [6/8]    last reason: 6/7: tensor 'x_with_bias[0]' requires_grad mismatch. expected requires_grad=0 [repeated 31x across cluster]                                          
(WorkerDict pid=592, ip=10.120.0.113) [rank9]:W1023 10:14:41.305000 592 torch/_dynamo/convert_frame.py:1016] [6/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". [repeated 31x across cluster]                                                                         (WorkerDict pid=592, ip=10.120.0.113) [rank9]:W1023 10:14:41.305000 592 torch/_dynamo/convert_frame.py:1016] [6/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. [repeated 31x across cluster]                               (WorkerDict pid=595, ip=10.120.0.123) /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If you
r operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) [repeated 7x
 across cluster]                      

The issue presents as three distinct warnings in the logs:

1. Root Cause: Dynamo Recompile Limit (requires_grad mismatch)

(WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8] torch._dynamo hit config.recompile_limit (8)
(WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8]   function: 'bias_dropout_add_fused_train' (/workspace/Megatron-LM/megatron/core/fusions/fused_bias_dropout.py:67)
(WorkerDict pid=591, ip=10.120.0.135) [rank25]:W1023 10:14:40.923000 591 torch/_dynamo/convert_frame.py:1016] [6/8]   last reason: 6/7: tensor 'x_with_bias[0]' requires_grad mismatch. expected requires_grad=0

2. Critical Symptom: Autograd Warning (c10d::broadcast_)

(WorkerDict pid=646, ip=10.120.2.95) /usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.)
(WorkerDict pid=646, ip=10.120.2.95)   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

3. Secondary Symptom: Inductor Cache Failure

(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] fx graph cache unable to load compiled graph
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] Traceback (most recent call last):
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codecache.py", line 1022, in iterate_over_candidates
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]     with open(os.path.join(subdir, path), "rb") as f:
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=596, ip=10.120.1.254) [rank311]:W1023 10:06:15.945000 596 torch/_inductor/codecache.py:1026] [8/1] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/torchinductor_root/fxgraph/6b/f6b6da72xjev7qxu43phdbfydr7qxuydfhbzegovmalv6jjbw7tv/.592.139814949451072.tmp'

@ISEEKYAN
Copy link
Collaborator Author

please try with a new docker image docker.io/iseekyan/verl:megatron0.13_vllm0.11

This image has too many missing dependencies

we are fixing the image issue currently, for now I recommend you to build an image with correct dependencies. For qwen3vl we need megatron-core0.13 and the latest mbridge from my repo's main branch.

@huaiyizhao
Copy link
Contributor

@ISEEKYAN Thank you for the great work. I have successfully build a customized environment. Anyone interested in training qwen3 vl can checkout #3906

@ISEEKYAN
Copy link
Collaborator Author

ISEEKYAN commented Oct 28, 2025

Hello folks, sorry for the buggy docker image I previously provided. I have a new tested image docker://iseekyan/verl:vllm011.stable and modified the scripts in #3937 please take a look.

mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

support training qwen3vl with megatron

1. add an image with vllm0.11 and nemo's dedicated megatron that support
gpt-oss with optimized fused kernels.
2. add a script of training qwen3vl-30b with megatron
3. necessary changes to support qwen3vl megatron. (just register forward
functions, the modeling is through mbridge)


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
<img width="372" height="314" alt="image"
src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

support training qwen3vl with megatron

1. add an image with vllm0.11 and nemo's dedicated megatron that support
gpt-oss with optimized fused kernels.
2. add a script of training qwen3vl-30b with megatron
3. necessary changes to support qwen3vl megatron. (just register forward
functions, the modeling is through mbridge)


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
<img width="372" height="314" alt="image"
src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd"
/>

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants