-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[megatron] feat: support qwen3vl #3763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds support for training qwen3vl with Megatron. The changes include a new Dockerfile, an example training script, and updates to model registries. The implementation correctly reuses existing forward functions for QWEN2_5_VL in most cases. However, there is a critical issue in the model registry where an incorrect forward function is assigned for the 'no padding' case, which will cause issues for this vision-language model. I've left a comment with details on the required fix.
|
Hi, I used the codes you updated, but found that mbridge does not support the model File "verl/workers/megatron_workers.py", line 161, in _init_hf_config_and_tf_config Do not know how to fix this problem? Thanks. |
use the latest mbridge repo |
|
ray.exceptions.RayTaskError(KeyError): ray::WorkerDict.ref_init_model() (pid=15741, ip=192.168.81.181, actor_id=9a63e9057b4c4a61808d92bd14000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f3ff7074ec0>) @ISEEKYAN 235b model has this problem, 30b is ok. |
hello @ccilery , could you provide more details, such as the script of 235B, nGPUs, parallelism settings? |
|
@ISEEKYAN adv_estimator=grpo use_kl_in_reward=False clip_ratio_low=0.2 max_prompt_length=$((1024 * 2)) loss_agg_mode="token-mean" train_prompt_bsz=${TRAIN_BS:-32} # minimum nodes need for qwen3-235B-A22B LOG_PATH=/mnt/qwen3vl/output TRAIN_FILE= # Algorithm EP=${EP:-4} project_name='verl-qwen3' # TODO: support cuda graph for rollout by setting the following config ray job submit |
|
@ccilery fixed, please update |
@ISEEKYAN Thanks, it's ok. A new error has occurred, have you encountered this problem before? ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=84545, ip=192.168.81.176, [57/152499] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. |
|
@ccilery you can not use the megatron inside that container, which is a dedicated version for gpt-oss. Just install another megatron as I wrote in the 30B script. |
|
@ISEEKYAN Thanks, another problem has occurred for 235b-a22b model, Could you please take a look at it? the scripts as follows. ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=176271, ip=192.168.81.180, actor_i python3 -m verl.trainer.main_ppo --config-path=config |
|
@ccilery Thank you for reporting these bugs. I will add another runnable qwen3vl-235B example script. |
|
Hi, I met another problem in the mcore, I found that mbridge correctly loads the Qwen3VLGPTModel, but it incorrectly uses the function _fused_GPTModel_forward in model_forward_fused (I guess). Do you meet such a problem? |
|
Yes, I think it is caused by the use of actor_rollout_ref.model.use_fused_kernels=True , and if I unset this it works correctly. |
|
@ccilery My cpu memory is 2TB per node. I welcome anyone connect me with wechat for further questions. My wechat id is |
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. support training qwen3vl with megatron 1. add an image with vllm0.11 and nemo's dedicated megatron that support gpt-oss with optimized fused kernels. 2. add a script of training qwen3vl-30b with megatron 3. necessary changes to support qwen3vl megatron. (just register forward functions, the modeling is through mbridge) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. <img width="372" height="314" alt="image" src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
@ISEEKYAN Hi, I got some error when run qwen3vl-30b-megatron.sh, env: use docker://iseekyan/verl:nemo.gptoss_vllm0.11.0, and pip install mbridge,transformers megatron as wrote in scripts. BUT when it runs to update actor:output = ray.get(output)
|
|
File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. @ISEEKYAN Hi,I'm getting this error: |
|
please try with a new docker image |
not work. miss too many libs, DeepEP ... |
Same. #3783 |
This image has too many missing dependencies |
|
I'm running dsv3 and hit the same issue: The issue presents as three distinct warnings in the logs: 1. Root Cause: Dynamo Recompile Limit (requires_grad mismatch) 2. Critical Symptom: Autograd Warning ( 3. Secondary Symptom: Inductor Cache Failure |
we are fixing the image issue currently, for now I recommend you to build an image with correct dependencies. For qwen3vl we need megatron-core0.13 and the latest mbridge from my repo's main branch. |
|
Hello folks, sorry for the buggy docker image I previously provided. I have a new tested image |
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. support training qwen3vl with megatron 1. add an image with vllm0.11 and nemo's dedicated megatron that support gpt-oss with optimized fused kernels. 2. add a script of training qwen3vl-30b with megatron 3. necessary changes to support qwen3vl megatron. (just register forward functions, the modeling is through mbridge) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. <img width="372" height="314" alt="image" src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. support training qwen3vl with megatron 1. add an image with vllm0.11 and nemo's dedicated megatron that support gpt-oss with optimized fused kernels. 2. add a script of training qwen3vl-30b with megatron 3. necessary changes to support qwen3vl megatron. (just register forward functions, the modeling is through mbridge) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. <img width="372" height="314" alt="image" src="https://github.com/user-attachments/assets/f1126e46-51a9-4e00-958f-5d034b8f94bd" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
What does this PR do?
support training qwen3vl with megatron
Test
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)