Skip to content

Conversation

@huvunvidia
Copy link

No description provided.

abhinavg4 and others added 30 commits September 30, 2025 14:23
* fix cpu init during export

Signed-off-by: yaoyu-33 <[email protected]>

* export env fix

Signed-off-by: yaoyu-33 <[email protected]>

* delete_extra_state for TE related during checkpoint loading for export

Signed-off-by: yaoyu-33 <[email protected]>

* paths fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add override_provider option for checkpoint loading

Signed-off-by: yaoyu-33 <[email protected]>

* add unit test for override_provider option

Signed-off-by: yaoyu-33 <[email protected]>

* remove debug lines

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* unit test fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
* chore: Add issue template for model requests

Signed-off-by: oliver könig <[email protected]>

* copying over remaining templates

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* ci: Skip if `docs-only` label is attached

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* update

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* cleanup process group at end of performance script

Signed-off-by: Ananth Subramaniam <[email protected]>

* Update scripts/performance/run_script.py

Signed-off-by: Ananth Subramaniam <[email protected]>

* destroy pg for other scripts

Signed-off-by: Ananth Subramaniam <[email protected]>

* update

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
* ci(fix): pre-flight

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* final

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* initial gemma commit

Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma provider

Signed-off-by: Ananth Subramaniam <[email protected]>

* patch tests

Signed-off-by: Ananth Subramaniam <[email protected]>

* add gemma bridge + tests

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix conftest

Signed-off-by: Ananth Subramaniam <[email protected]>

* reenable msc

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix gemma test fallback

Signed-off-by: Ananth Subramaniam <[email protected]>

* try simpler tokenizer

Signed-off-by: Ananth Subramaniam <[email protected]>

* upload assets

Signed-off-by: Ananth Subramaniam <[email protected]>

* use pre-downloaded config for model provider test

Signed-off-by: Ananth Subramaniam <[email protected]>

* lint

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback -s

Signed-off-by: Ananth Subramaniam <[email protected]>

* rebase

Signed-off-by: Ananth Subramaniam <[email protected]>

* rebase

Signed-off-by: Ananth Subramaniam <[email protected]>

* use mcore activations

Signed-off-by: Ananth Subramaniam <[email protected]>

* update test

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix mock

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix conversion script reference

Signed-off-by: Ananth Subramaniam <[email protected]>

* subclass

Signed-off-by: Ananth Subramaniam <[email protected]>

* update tests

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <[email protected]>

* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* gemma2 provider and bridge

Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma2 model provider + bridge

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* docs] placeholder page for performance summary

Signed-off-by: Ananth Subramaniam <[email protected]>

* add sections for releases

Signed-off-by: Ananth Subramaniam <[email protected]>

* improve description

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
… compatibility (NVIDIA-NeMo#829)

* save latest_checkpointed_iteration for compatibility

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix megatron fsdp test assertion

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* exit profiler context

Signed-off-by: Ananth Subramaniam <[email protected]>

* disable vocab size logging in flops calculation

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* Clear disk space before install check

Signed-off-by: Charlie Truong <[email protected]>

* Revert "Clear disk space before install check"

This reverts commit 2c085f5.

Signed-off-by: Charlie Truong <[email protected]>

* Run bare metal install on self-hosted runners

Signed-off-by: Charlie Truong <[email protected]>

---------

Signed-off-by: Charlie Truong <[email protected]>
…A-NeMo#607)

* update llama and qwen models to use auto bridge and update recipes test as well

Signed-off-by: yaoyu-33 <[email protected]>

* temporary remove llama4 as it's not fully tested or verified.

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "temporary remove llama4 as it's not fully tested or verified."

This reverts commit 5217084.

* temp save

Signed-off-by: yaoyu-33 <[email protected]>

* temp save

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "temp save"

This reverts commit 0c57e2b.

* Revert "temp save"

This reverts commit 0748d52.

* update qwen's recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update llama recipes

Signed-off-by: yaoyu-33 <[email protected]>

* remove some old recipe files

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe files to match old recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe file

Signed-off-by: yaoyu-33 <[email protected]>

* update qwen recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update llama recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* recipe naming update

Signed-off-by: yaoyu-33 <[email protected]>

* update test

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* add TypedDict for args

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* update docstring

Signed-off-by: yaoyu-33 <[email protected]>

* unit test fix and license fix

Signed-off-by: yaoyu-33 <[email protected]>

* sync eval_interval and save_interval

Signed-off-by: yaoyu-33 <[email protected]>

* add comments

Signed-off-by: yaoyu-33 <[email protected]>

* set TRANSFORMERS_OFFLINE=1 in action.yml

Signed-off-by: yaoyu-33 <[email protected]>

* fix llama3 8b hf model path

Signed-off-by: yaoyu-33 <[email protected]>

* replay lr decay iters update on updated recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update action.yml

Signed-off-by: Yu Yao <[email protected]>

* add comments

Signed-off-by: yaoyu-33 <[email protected]>

* Add guard / mock for the places needs to download hf config in unit test

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* add qwen functional test

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe tests

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
…ation support

- Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge.
- Updated `DITForwardStep` class to use `__call__` method for forward steps.
- Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`.
- Adjusted tensor and context parallelism settings in `llama3_8b.py`.

This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
abhinavg4 and others added 11 commits October 6, 2025 09:33
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity.
- Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`.
- Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization.
- Introduced `DiffusionDataModuleConfig` for better dataset configuration management.
- Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`.
- Refined imports across various modules to ensure consistency and clarity.

This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants