[megatron] feat: qwen2.5vl #1286

ISEEKYAN · 2025-04-28T04:07:54Z

works with qwen2.5vl 3b + geo3k

TODO: weight converter

…en25vl_tmp_update0527_v2

…en25vl_tmp_update0527_v3

…en25vl_tmp_update0527_v4

…en25vl_tmp_update0527_v5

ISEEKYAN · 2025-06-05T09:48:33Z

I have a general question. Does mcore not support qwen2.5vl? i.e. why is the implementation done in verl instead of in mcore (and verl imports mcore)?
Another question is, what do you think of works such as https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen2_5_vl/README.md#Megatron-Core%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E6%B5%81%E7%A8%8B ?

the features needed by qwen2.5vl in mcore is under development since last year, some of them have been released (such as mrope position embedding in mcore 0.12), but the time of complete support is not guaranteed.

PAI's implementation is the reference of mine. The difference is that I use the native mcore parts as possible to avoid duplicated codes, also support sequence packing for RL, and modify the vision preprocess part to be consistent to FSDP backend. I have added the acknowledgment to PAI team at the copyright position.

We at nvidia have developed the mbridge, a universal solution for RL frameworks to seamlessly use megatron with huggingface downloaded models. The support for various model archs including qwen2.5vl will be included in mbridge and be maintained by the nemo team. I will commit PRs to adopt the mbridge after its publish. Once working with mbridge, the verl's code for megatron usage will be clean and concise. We will support all RL frameworks's megatron experience including nemo-RL by maintain the mbridge.
But the release path and date is still under discussion. But it would not be very long. We have to support qwen2.5vl in verl for now.

eric-haibin-lin

could u add a new dataset and a reference training record in https://verl.readthedocs.io/en/latest/algo/baseline.html ? thanks! (doing it in next PR is fine).
i do not have further comments for now. i think it's crucial to refactor the worker classes according to so that we can have standalone tests for model fwd/bwd. #1560 #1913

ISEEKYAN · 2025-06-10T02:18:23Z

could u add a new dataset and a reference training record in https://verl.readthedocs.io/en/latest/algo/baseline.html ? thanks! (doing it in next PR is fine). i do not have further comments for now. i think it's crucial to refactor the worker classes according to so that we can have standalone tests for model fwd/bwd. #1560 #1913

Sure, I will add a record of qwen2.5vl+megatron on next PR.

For the refactors, I will keep tracking and contribute from aspect of megatron

eric-haibin-lin · 2025-06-10T02:40:20Z

@dataproblems do you want to take a final look?

dataproblems · 2025-06-10T18:19:42Z

@eric-haibin-lin - my bad! I missed this! I'll try to make sure I get the notifications from now on!!

ISEEKYAN · 2025-06-11T09:35:24Z

@eric-haibin-lin added a record and a recipe of training qwen2.5vl 7b, see #1969

MaoChouHJM · 2025-06-11T11:55:10Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ?
https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

ISEEKYAN · 2025-06-12T01:39:58Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

MaoChouHJM · 2025-06-12T07:08:20Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

ISEEKYAN · 2025-06-12T07:31:35Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code.
I think we can fix it by code like

pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)

would you please contribute a PR to fix this?

MaoChouHJM · 2025-06-12T08:48:53Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like
pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)
would you please contribute a PR to fix this?

of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.

ISEEKYAN · 2025-06-12T09:09:39Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like
pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)
would you please contribute a PR to fix this?
of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.

Both HF implementation and this megatron implementation support pure language as input. While it is a bug that existing code asserts that multimodal input exists.
Glad to see your contribution!

MaoChouHJM · 2025-06-12T17:02:27Z

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like
pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)
would you please contribute a PR to fix this?
of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.
Both HF implementation and this megatron implementation support pure language as input. While it is a bug that existing code asserts that multimodal input exists. Glad to see your contribution!

@ISEEKYAN #1999 here is my pr, would you please review it :>

works with qwen2.5vl 3b + geo3k <img width="1148" alt="image" src="https://github.com/user-attachments/assets/87c8746c-7f40-4189-9e82-eb1b459669f8" /> <img width="1143" alt="image" src="https://github.com/user-attachments/assets/58bce88d-c53e-45a2-b89c-bfacf4ae9e85" /> <img width="1503" alt="image" src="https://github.com/user-attachments/assets/284ef5c6-2057-4a73-ad56-bed2ef0ece43" />

…-text and image-text (#1999) ### Checklist Before Starting - [ ] Searched for similar PR(s). - [ ] Checked PR Title format - [ ] In format of: [modules] type: Title - [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc` - [ ] type is in `feat, fix, refactor, chore` - [ ] can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? fix qwen2_vl on plain-text data and mix data of plain-text and image-text, refer to #1286 ### Test test on gsm8k dataset and mix data of gsm8k and geo3k. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

…-text and image-text (volcengine#1999) ### Checklist Before Starting - [ ] Searched for similar PR(s). - [ ] Checked PR Title format - [ ] In format of: [modules] type: Title - [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc` - [ ] type is in `feat, fix, refactor, chore` - [ ] can involve multiple modules, seperated by `,` or space, like `[megatron, fsdp, doc] feat: xxx` ### What does this PR do? fix qwen2_vl on plain-text data and mix data of plain-text and image-text, refer to volcengine#1286 ### Test test on gsm8k dataset and mix data of gsm8k and geo3k. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

works with qwen2.5vl 3b + geo3k <img width="1148" alt="image" src="https://github.com/user-attachments/assets/87c8746c-7f40-4189-9e82-eb1b459669f8" /> <img width="1143" alt="image" src="https://github.com/user-attachments/assets/58bce88d-c53e-45a2-b89c-bfacf4ae9e85" /> <img width="1503" alt="image" src="https://github.com/user-attachments/assets/284ef5c6-2057-4a73-ad56-bed2ef0ece43" />

ISEEKYAN added 9 commits April 23, 2025 20:34

offline weight converter

592d619

forward with geo3k.

ff9d367

TODO: weight converter

scrips

87f508e

qwen25vl mcore weight converter

959ce5a

fix qwen2.5vl model converter

c7f177a

sequence packing

5b6eaad

tmp

ca05a00

clean

d53de0e

flash

068d82f

ISEEKYAN mentioned this pull request Apr 28, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

ccclyu added the megatron label Apr 30, 2025

ISEEKYAN added 15 commits May 7, 2025 22:57

qwen pp

1d3a39b

fix PP

b89c6b7

support 7b and more qwen25vl models

406d80e

enable sp

dd16c7f

align some configs

dc0205c

Merge commit '867d3024bf7af6aee2cd785cfd573aec561f212d' into mcore_qw…

bf541b3

…en25vl_tmp_update0527_v2

Merge commit '04acd09d65900521e8019adefd10308220cb7ee2' into mcore_qw…

75dd567

…en25vl_tmp_update0527_v3

Merge commit '02862103babdd0df4fe70d9b236926fcc02bac27' into mcore_qw…

d1f5320

…en25vl_tmp_update0527_v4

Merge commit '7d26d7359e17937d2590093f51b3e9de2e5e131d' into mcore_qw…

95ebb55

…en25vl_tmp_update0527_v5

fix

cffa9c1

Merge branch 'main' into mcore_qwen25vl_tmp_update0527_v6

a2a6dba

clean the implementation of qwen25vl

8d6ac6c

clean

ecc7c9f

add copyright

4ffe705

clean

d3b829d

ISEEKYAN force-pushed the mcore_qwen25vl branch from abc8619 to d3b829d Compare May 31, 2025 04:07

ISEEKYAN added 2 commits June 3, 2025 00:02

Merge branch 'main' into mcore_qwen25vl_clean

91f9692

add example

b50e6be

ISEEKYAN marked this pull request as ready for review June 3, 2025 07:49

Merge branch 'main' into mcore_qwen25vl

aadd5dc

ISEEKYAN added 4 commits June 6, 2025 01:01

Merge branch 'main' into mcore_qwen25vl

b23a6fa

fix vpp for ci

21dcbd8

Merge branch 'main' into mcore_qwen25vl

c9820a8

fix ci

c0b61e1

ISEEKYAN changed the title ~~[mcore] qwen2.5vl~~ [megatron] qwen2.5vl Jun 9, 2025

ISEEKYAN changed the title ~~[megatron] qwen2.5vl~~ [megatron] feat: qwen2.5vl Jun 9, 2025

fix pipeline parallel

c89c54f

eric-haibin-lin approved these changes Jun 10, 2025

View reviewed changes

ccclyu approved these changes Jun 10, 2025

View reviewed changes

ETOgaosion merged commit 85fef90 into volcengine:main Jun 10, 2025
38 checks passed

ISEEKYAN mentioned this pull request Jun 12, 2025

use Megatron as training backend Visual-Agent/DeepEyes#44

Open

MaoChouHJM mentioned this pull request Jun 12, 2025

[megatron] fix: fix qwen2_vl on plain-text data and mix data of plain-text and image-text #1999

Merged

12 tasks

[megatron] feat: qwen2.5vl #1286

[megatron] feat: qwen2.5vl #1286

Uh oh!

Conversation

ISEEKYAN commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ISEEKYAN commented Jun 5, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

ISEEKYAN commented Jun 10, 2025

Uh oh!

eric-haibin-lin commented Jun 10, 2025

Uh oh!

Uh oh!

dataproblems commented Jun 10, 2025

Uh oh!

ISEEKYAN commented Jun 11, 2025

Uh oh!

MaoChouHJM commented Jun 11, 2025

Uh oh!

ISEEKYAN commented Jun 12, 2025

Uh oh!

MaoChouHJM commented Jun 12, 2025

Uh oh!

ISEEKYAN commented Jun 12, 2025

Uh oh!

MaoChouHJM commented Jun 12, 2025

Uh oh!

ISEEKYAN commented Jun 12, 2025

Uh oh!

MaoChouHJM commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ISEEKYAN commented Apr 28, 2025 •

edited

Loading