Skip to content

Conversation

@ISEEKYAN
Copy link
Collaborator

@ISEEKYAN ISEEKYAN commented Apr 28, 2025

works with qwen2.5vl 3b + geo3k

image image image

@ISEEKYAN ISEEKYAN marked this pull request as ready for review June 3, 2025 07:49
@ISEEKYAN
Copy link
Collaborator Author

ISEEKYAN commented Jun 5, 2025

I have a general question. Does mcore not support qwen2.5vl? i.e. why is the implementation done in verl instead of in mcore (and verl imports mcore)?
Another question is, what do you think of works such as https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen2_5_vl/README.md#Megatron-Core%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E6%B5%81%E7%A8%8B ?

  1. the features needed by qwen2.5vl in mcore is under development since last year, some of them have been released (such as mrope position embedding in mcore 0.12), but the time of complete support is not guaranteed.
  2. PAI's implementation is the reference of mine. The difference is that I use the native mcore parts as possible to avoid duplicated codes, also support sequence packing for RL, and modify the vision preprocess part to be consistent to FSDP backend. I have added the acknowledgment to PAI team at the copyright position.

We at nvidia have developed the mbridge, a universal solution for RL frameworks to seamlessly use megatron with huggingface downloaded models. The support for various model archs including qwen2.5vl will be included in mbridge and be maintained by the nemo team. I will commit PRs to adopt the mbridge after its publish. Once working with mbridge, the verl's code for megatron usage will be clean and concise. We will support all RL frameworks's megatron experience including nemo-RL by maintain the mbridge.
But the release path and date is still under discussion. But it would not be very long. We have to support qwen2.5vl in verl for now.

@ISEEKYAN ISEEKYAN changed the title [mcore] qwen2.5vl [megatron] qwen2.5vl Jun 9, 2025
@ISEEKYAN ISEEKYAN changed the title [megatron] qwen2.5vl [megatron] feat: qwen2.5vl Jun 9, 2025
Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could u add a new dataset and a reference training record in https://verl.readthedocs.io/en/latest/algo/baseline.html ? thanks! (doing it in next PR is fine).
i do not have further comments for now. i think it's crucial to refactor the worker classes according to so that we can have standalone tests for model fwd/bwd. #1560 #1913

@ISEEKYAN
Copy link
Collaborator Author

could u add a new dataset and a reference training record in https://verl.readthedocs.io/en/latest/algo/baseline.html ? thanks! (doing it in next PR is fine). i do not have further comments for now. i think it's crucial to refactor the worker classes according to so that we can have standalone tests for model fwd/bwd. #1560 #1913

Sure, I will add a record of qwen2.5vl+megatron on next PR.

For the refactors, I will keep tracking and contribute from aspect of megatron

@eric-haibin-lin
Copy link
Collaborator

@dataproblems do you want to take a final look?

@ETOgaosion ETOgaosion merged commit 85fef90 into volcengine:main Jun 10, 2025
38 checks passed
@dataproblems
Copy link

@eric-haibin-lin - my bad! I missed this! I'll try to make sure I get the notifications from now on!!

@ISEEKYAN
Copy link
Collaborator Author

@eric-haibin-lin added a record and a recipe of training qwen2.5vl 7b, see #1969

@MaoChouHJM
Copy link
Contributor

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ?
https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

@ISEEKYAN
Copy link
Collaborator Author

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

@MaoChouHJM
Copy link
Contributor

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@ISEEKYAN
Copy link
Collaborator Author

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code.
I think we can fix it by code like

pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)

would you please contribute a PR to fix this?

@MaoChouHJM
Copy link
Contributor

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like

pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)

would you please contribute a PR to fix this?

of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.

@ISEEKYAN
Copy link
Collaborator Author

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like

pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)

would you please contribute a PR to fix this?

of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.

Both HF implementation and this megatron implementation support pure language as input. While it is a bug that existing code asserts that multimodal input exists.
Glad to see your contribution!

@MaoChouHJM
Copy link
Contributor

@ISEEKYAN hi thanks for your awesome work! it seems we should guarantee there must be one image in mcore_batch_sz ? https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92

multiple images are stacked in https://github.com/volcengine/verl/blob/main/verl/workers/actor/megatron_actor.py#L412

thanks for your replay! but when we use plain-text data (no images or no videos), there would be KeyError in https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_forward.py#L91-L92. maybe we should pad empty images in the forward function?

@MaoChouHJM It is a very good question. The qwen2.5vl it self support pure language input, so this is a bug in existing code. I think we can fix it by code like

pv = multi_modal_inputs["pixel_values"].to(input_ids.device) if multi_modal_inputs["pixel_values"] else None
igt = multi_modal_inputs["image_grid_thw"].to(input_ids.device) if multi_modal_inputs["image_grid_thw"].to(input_ids.device) else None
output_orig = model(
    input_ids=input_ids_rmpad,
    attention_mask=None,
    position_ids=position_ids,
    packed_seq_params=packed_seq_params,
    pixel_values=pv,
    image_grid_thw=igt,
)

would you please contribute a PR to fix this?

of course, i will try to contribute my first PR. you mentioned "The qwen2.5vl it self support pure language input", means the huggingface implemetion? aybe I can refer to this code and fix it.

Both HF implementation and this megatron implementation support pure language as input. While it is a bug that existing code asserts that multimodal input exists. Glad to see your contribution!

@ISEEKYAN #1999 here is my pr, would you please review it :>

hiyouga pushed a commit that referenced this pull request Jun 18, 2025
…-text and image-text (#1999)

### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - [ ] In format of: [modules] type: Title
- [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [ ] type is in `feat, fix, refactor, chore`
- [ ] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?


fix qwen2_vl on plain-text data and mix data of plain-text and
image-text, refer to #1286

### Test


test on gsm8k dataset and mix data of gsm8k and geo3k.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 23, 2025
…-text and image-text (volcengine#1999)

### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - [ ] In format of: [modules] type: Title
- [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [ ] type is in `feat, fix, refactor, chore`
- [ ] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?


fix qwen2_vl on plain-text data and mix data of plain-text and
image-text, refer to volcengine#1286

### Test


test on gsm8k dataset and mix data of gsm8k and geo3k.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
Tyizhanshen pushed a commit to HyperdriveHustle/verl that referenced this pull request Jul 1, 2025
…-text and image-text (volcengine#1999)

### Checklist Before Starting

- [ ] Searched for similar PR(s).
- [ ] Checked PR Title format
  - [ ] In format of: [modules] type: Title
- [ ] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [ ] type is in `feat, fix, refactor, chore`
- [ ] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?


fix qwen2_vl on plain-text data and mix data of plain-text and
image-text, refer to volcengine#1286

### Test


test on gsm8k dataset and mix data of gsm8k and geo3k.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants