Huvu/mcore wan official #2

huvunvidia · 2025-10-30T14:25:37Z

No description provided.

* fix cpu init during export Signed-off-by: yaoyu-33 <[email protected]> * export env fix Signed-off-by: yaoyu-33 <[email protected]> * delete_extra_state for TE related during checkpoint loading for export Signed-off-by: yaoyu-33 <[email protected]> * paths fixes Signed-off-by: yaoyu-33 <[email protected]> * add override_provider option for checkpoint loading Signed-off-by: yaoyu-33 <[email protected]> * add unit test for override_provider option Signed-off-by: yaoyu-33 <[email protected]> * remove debug lines Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * unit test fix Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>

* chore: Add issue template for model requests Signed-off-by: oliver könig <[email protected]> * copying over remaining templates Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

* ci: Skip if `docs-only` label is attached Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * update Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

* cleanup process group at end of performance script Signed-off-by: Ananth Subramaniam <[email protected]> * Update scripts/performance/run_script.py Signed-off-by: Ananth Subramaniam <[email protected]> * destroy pg for other scripts Signed-off-by: Ananth Subramaniam <[email protected]> * update Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]> Signed-off-by: Ananth Subramaniam <[email protected]>

* ci(fix): pre-flight Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * test Signed-off-by: oliver könig <[email protected]> * final Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

Signed-off-by: oliver könig <[email protected]>

* initial gemma commit Signed-off-by: Ananth Subramaniam <[email protected]> * gemma provider Signed-off-by: Ananth Subramaniam <[email protected]> * patch tests Signed-off-by: Ananth Subramaniam <[email protected]> * add gemma bridge + tests Signed-off-by: Ananth Subramaniam <[email protected]> * fix conftest Signed-off-by: Ananth Subramaniam <[email protected]> * reenable msc Signed-off-by: Ananth Subramaniam <[email protected]> * fix gemma test fallback Signed-off-by: Ananth Subramaniam <[email protected]> * try simpler tokenizer Signed-off-by: Ananth Subramaniam <[email protected]> * upload assets Signed-off-by: Ananth Subramaniam <[email protected]> * use pre-downloaded config for model provider test Signed-off-by: Ananth Subramaniam <[email protected]> * lint Signed-off-by: Ananth Subramaniam <[email protected]> * address feedback -s Signed-off-by: Ananth Subramaniam <[email protected]> * rebase Signed-off-by: Ananth Subramaniam <[email protected]> * rebase Signed-off-by: Ananth Subramaniam <[email protected]> * use mcore activations Signed-off-by: Ananth Subramaniam <[email protected]> * update test Signed-off-by: Ananth Subramaniam <[email protected]> * fix mock Signed-off-by: Ananth Subramaniam <[email protected]> * fix conversion script reference Signed-off-by: Ananth Subramaniam <[email protected]> * subclass Signed-off-by: Ananth Subramaniam <[email protected]> * update tests Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* [docs] packed sequences Signed-off-by: Ananth Subramaniam <[email protected]> * [docs] packed sequences Signed-off-by: Ananth Subramaniam <[email protected]> * address feedback Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <[email protected]> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* docs] placeholder page for performance summary Signed-off-by: Ananth Subramaniam <[email protected]> * add sections for releases Signed-off-by: Ananth Subramaniam <[email protected]> * improve description Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

… compatibility (NVIDIA-NeMo#829) * save latest_checkpointed_iteration for compatibility Signed-off-by: Ananth Subramaniam <[email protected]> * fix megatron fsdp test assertion Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

* exit profiler context Signed-off-by: Ananth Subramaniam <[email protected]> * disable vocab size logging in flops calculation Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

* Clear disk space before install check Signed-off-by: Charlie Truong <[email protected]> * Revert "Clear disk space before install check" This reverts commit 2c085f5. Signed-off-by: Charlie Truong <[email protected]> * Run bare metal install on self-hosted runners Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: Charlie Truong <[email protected]>

Signed-off-by: oliver könig <[email protected]>

…A-NeMo#607) * update llama and qwen models to use auto bridge and update recipes test as well Signed-off-by: yaoyu-33 <[email protected]> * temporary remove llama4 as it's not fully tested or verified. Signed-off-by: yaoyu-33 <[email protected]> * Revert "temporary remove llama4 as it's not fully tested or verified." This reverts commit 5217084. * temp save Signed-off-by: yaoyu-33 <[email protected]> * temp save Signed-off-by: yaoyu-33 <[email protected]> * Revert "temp save" This reverts commit 0c57e2b. * Revert "temp save" This reverts commit 0748d52. * update qwen's recipes Signed-off-by: yaoyu-33 <[email protected]> * update llama recipes Signed-off-by: yaoyu-33 <[email protected]> * remove some old recipe files Signed-off-by: yaoyu-33 <[email protected]> * update recipe files to match old recipes Signed-off-by: yaoyu-33 <[email protected]> * update recipe file Signed-off-by: yaoyu-33 <[email protected]> * update qwen recipes Signed-off-by: yaoyu-33 <[email protected]> * update llama recipes Signed-off-by: yaoyu-33 <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/qwen/qwen3.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * Update src/megatron/bridge/recipes/llama/llama2.py Co-authored-by: Ananth Subramaniam <[email protected]> Signed-off-by: Yu Yao <[email protected]> * recipe naming update Signed-off-by: yaoyu-33 <[email protected]> * update test Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * add TypedDict for args Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * update docstring Signed-off-by: yaoyu-33 <[email protected]> * unit test fix and license fix Signed-off-by: yaoyu-33 <[email protected]> * sync eval_interval and save_interval Signed-off-by: yaoyu-33 <[email protected]> * add comments Signed-off-by: yaoyu-33 <[email protected]> * set TRANSFORMERS_OFFLINE=1 in action.yml Signed-off-by: yaoyu-33 <[email protected]> * fix llama3 8b hf model path Signed-off-by: yaoyu-33 <[email protected]> * replay lr decay iters update on updated recipes Signed-off-by: yaoyu-33 <[email protected]> * Update action.yml Signed-off-by: Yu Yao <[email protected]> * add comments Signed-off-by: yaoyu-33 <[email protected]> * Add guard / mock for the places needs to download hf config in unit test Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> * add qwen functional test Signed-off-by: yaoyu-33 <[email protected]> * update recipe tests Signed-off-by: yaoyu-33 <[email protected]> * lint Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]>

Signed-off-by: Ananth Subramaniam <[email protected]>

…ation support - Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge. - Updated `DITForwardStep` class to use `__call__` method for forward steps. - Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`. - Adjusted tensor and context parallelism settings in `llama3_8b.py`. This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.

…into init_dit

- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity. - Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`. - Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization. - Introduced `DiffusionDataModuleConfig` for better dataset configuration management. - Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`. - Refined imports across various modules to ensure consistency and clarity. This commit enhances the configuration structure and model initialization process, improving maintainability and usability.

abhinavg4 and others added 30 commits September 30, 2025 14:23

Initial commit

2bb8969

[docs] Add canonical lora docs (NVIDIA-NeMo#821)

4bba0e6

Signed-off-by: Ananth Subramaniam <[email protected]>

ci: Bump pre-flight (NVIDIA-NeMo#854)

7e2eeaa

Signed-off-by: oliver könig <[email protected]>

Gemma2 provider + Bridge (NVIDIA-NeMo#856)

a4912e7

* gemma2 provider and bridge Signed-off-by: Ananth Subramaniam <[email protected]> * gemma2 model provider + bridge Signed-off-by: Ananth Subramaniam <[email protected]> --------- Signed-off-by: Ananth Subramaniam <[email protected]>

support async saving for CI end to end testing (NVIDIA-NeMo#804)

ad94387

Signed-off-by: Ananth Subramaniam <[email protected]>

docs: Revert 0.2.0 push (NVIDIA-NeMo#865)

a5d7c58

Signed-off-by: oliver könig <[email protected]>

add tests for functor design

96e7b4c

Signed-off-by: Ananth Subramaniam <[email protected]>

improve typing for forward step func and add tests for functors

4a750dd

Signed-off-by: Ananth Subramaniam <[email protected]>

update tests

e0e8611

Signed-off-by: Ananth Subramaniam <[email protected]>

make checks more robust

7f6ec50

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

d6b02c6

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

897da83

Signed-off-by: Ananth Subramaniam <[email protected]>

docstrings

b7ad487

Signed-off-by: Ananth Subramaniam <[email protected]>

fix tests

a6ae7a3

Signed-off-by: Ananth Subramaniam <[email protected]>

inject state once at the beginning of the loops

6883596

Signed-off-by: Ananth Subramaniam <[email protected]>

cleanup

23e9efc

Signed-off-by: Ananth Subramaniam <[email protected]>

add tests

ab4f32d

Signed-off-by: Ananth Subramaniam <[email protected]>

abhinavg4 and others added 11 commits October 6, 2025 09:33

Merge branch 'functor' of https://github.com/ananthsub/Megatron-Bridge …

db1b812

…into init_dit

diffusion_energon_datamodule

7a701f6

first commit

0a3ae83

update branch

5d12fc9

workable code

8847968

workable thd

09c9488

clean up, remove all CP for sbhd, CP now is only for thd

992f836

add example commands

aed722f

add example commands

b0a90e6

commit to use all Wan's components from DFM

713ab54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huvu/mcore wan official #2

Huvu/mcore wan official #2

Uh oh!

huvunvidia commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Huvu/mcore wan official #2

Are you sure you want to change the base?

Huvu/mcore wan official #2

Uh oh!

Conversation

huvunvidia commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants