Change to enable full iteration CUDA graph for LLMs by vasunvidia · Pull Request #14077 · NVIDIA-NeMo/NeMo

vasunvidia · 2025-07-01T00:06:40Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Enables Full iteration CUDA graph for NeMo models.

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

katec846

I’m not entirely sure what the final PR solution should look like, but for cases where pipeline parallelism is greater than 1, we currently have two options:

Remove the sync() call in p2p_communication.py though we would need to justify that it's safe to remove it.
Add assertions to ensure that config.batch_p2p_sync == False and config.variable_seq_lengths ==False

Also, for cases where VP > 1, we need to disable optim.overlap_param_gather on H100 to make it able to run.

nemo/collections/diffusion/models/dit/dit_layer_spec.py

nemo/lightning/megatron_parallel.py

nemo/collections/diffusion/models/dit/dit_layer_spec.py

nemo/collections/llm/gpt/model/base.py

nemo/lightning/megatron_parallel.py

scripts/performance/helpers.py

Initial commit Signed-off-by: Vasudevan Rengasamy <[email protected]> Rename FullCGWrapper to FullCudaGraphWrapper Signed-off-by: Vasudevan Rengasamy <[email protected]> Change full_cuda_graph to enable_whole_network_cuda_graph Signed-off-by: Vasudevan Rengasamy <[email protected]> Change Full Iteration CUDA graph argument name Change to check for partial CUDA graph condition with enable_cuda_graph flag Apply isort and black reformatting Signed-off-by: vasunvidia <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]> Use enable_cuda_graph+cuda_graph_scope for full iteration CG Signed-off-by: Vasudevan Rengasamy <[email protected]> Fix bug for VPP>1 Signed-off-by: Vasudevan Rengasamy <[email protected]> Add comments. Signed-off-by: Vasudevan Rengasamy <[email protected]> Add comments. Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: vasunvidia <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: Kate Cheng <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Tutorial and doc update Signed-off-by: Nune <[email protected]> * Clear outputs from notebook Signed-off-by: Nune <[email protected]> * Update docs/source/asr/models.rst Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> --------- Signed-off-by: Nune <[email protected]> Signed-off-by: nune-tadevosyan <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: Kate Cheng <[email protected]> Signed-off-by: Vasudevan Rengasamy <[email protected]>

This reverts commit 00751d5. Signed-off-by: Vasudevan Rengasamy <[email protected]>

This reverts commit af83c2d. Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: Vasudevan Rengasamy <[email protected]>

nemo/lightning/megatron_parallel.py

Signed-off-by: Vasudevan Rengasamy <[email protected]>

Signed-off-by: vasunvidia <[email protected]>

Signed-off-by: Vasudevan Rengasamy <[email protected]>

vasunvidia force-pushed the full_cuda_graph2 branch 2 times, most recently from d12fabd to d7364e5 Compare July 12, 2025 15:31

gautham-kollu requested review from erhoo82 and removed request for erhoo82 July 16, 2025 16:27

vasunvidia force-pushed the full_cuda_graph2 branch 3 times, most recently from 0c1a110 to b11fd52 Compare July 17, 2025 21:18

github-actions bot added the NLP label Jul 29, 2025

vasunvidia force-pushed the full_cuda_graph2 branch 2 times, most recently from 04e60ed to 71f2c6b Compare July 29, 2025 03:14

katec846 reviewed Jul 31, 2025

View reviewed changes

nemo/collections/diffusion/models/dit/dit_layer_spec.py Show resolved Hide resolved

nemo/lightning/megatron_parallel.py Outdated Show resolved Hide resolved

gautham-kollu added the Run CICD label Aug 1, 2025

gautham-kollu temporarily deployed to test August 1, 2025 17:07 — with GitHub Actions Inactive