feat: add post-training and custom-training support by qianlim · Pull Request #31 · nvidia-cosmos/cosmos-transfer1

qianlim · 2025-04-10T03:34:54Z

This PR brings support for training the cosmos-transfer models. Supports single or multi-node training with Tensor Parallel and Sequence Parallel. Supports both training customized models from scratch and post-training / fine-tuning from the released checkpoints.

It addresses Issue #3.

Since a large number of files are added, and the model classes are also updated (to enable support for training), careful review and testing is needed.

So far I've verified the generated config yaml aligns with the one we use to train the released models. I'll launch training jobs to verify the correctness of implementation.

[WIP]: test the training script, improve the README.

…pdate configs

…nd config, add script to convert ckpt to tp

qianlim · 2025-04-15T07:37:00Z

cosmos_transfer1/diffusion/networks/general_dit_ctrl_enc.py

        if isinstance(control_weight, torch.Tensor):
            if control_weight.ndim == 0:  # Single scalar tensor
-                control_weight = [float(control_weight)]
+                control_weight = [float(control_weight)] * len(guided_hints)


@tcwang0509 This line is different in current repo and i4. Could you advise?

Sure we can add it. Currently when ndim==0, len(guided_hints) is always 1, but it's good to make it general.

qianlim · 2025-04-15T07:41:01Z

cosmos_transfer1/diffusion/networks/general_dit_ctrl_enc.py

                    # Reshape to match THWBD format
                    weight_map = weight_map.permute(2, 3, 4, 0, 1)  # [T, H, W, B, 1]
-                    hint_val = control_feat * weight_map * gate
+                    weight_map = weight_map.view(T * H * W, 1, 1, B, 1)


@tcwang0509 also this line, i4 has this additional step of reshaping the weight_map but removed in the repo. Is it needed?

This is because in i4 we use TP but here we use CP.

pjannaty · 2025-04-15T22:39:55Z

Thanks Qianli. Can we please trigger CI?

… main README

pjannaty · 2025-04-17T18:51:15Z

LGTM. Thank you, Qianli!

CI tests pass

pjannaty

Let's add training tests in a follow up. Merging for now.

* feat: add post-training and custom-training support * feat: add separate model definitions supporting tp/sp for training; update configs * feat: add example Dataset class, add data augmentors, update config * feat: add example data class, add misc improvements to data loading and config, add script to convert ckpt to tp * fix: fix conflict in DiTEncoder * cleanup * feat: compelete README in examples/ for post/pre-training; update the main README * fix: multiple minor fixes on example dataset * fix: multiple minor fixes + improve example dataset performance * feat+fix: multiple fixes + refinements to README

feat: add post-training and custom-training support

ba55a8d

qianlim added the enhancement New feature or request label Apr 10, 2025

qianlim requested review from HannaMao, SeungjunNah, jwgu, mingyuliutw, pjannaty, sophiahhuang and tcwang0509 April 10, 2025 03:34

qianlim self-assigned this Apr 10, 2025

qianlim requested a review from arieling April 10, 2025 17:59

qianlim added 3 commits April 10, 2025 18:05

feat: add separate model definitions supporting tp/sp for training; u…

e41cc0c

…pdate configs

feat: add example Dataset class, add data augmentors, update config

0ec0a46

feat: add example data class, add misc improvements to data loading a…

ee9a31b

…nd config, add script to convert ckpt to tp

qianlim requested a review from tiffanycai6 April 15, 2025 03:33

qianlim added 2 commits April 15, 2025 00:27

fix: fix conflict in DiTEncoder

5149ca6

cleanup

47f6e08

qianlim commented Apr 15, 2025

View reviewed changes

qianlim added 4 commits April 15, 2025 17:13

feat: compelete README in examples/ for post/pre-training; update the…

dbf6683

… main README

fix: multiple minor fixes on example dataset

d02c81f

fix: multiple minor fixes + improve example dataset performance

8d10477

feat+fix: multiple fixes + refinements to README

56e976f

pjannaty approved these changes Apr 17, 2025

View reviewed changes

pjannaty merged commit 68d5d67 into main Apr 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add post-training and custom-training support#31

feat: add post-training and custom-training support#31
pjannaty merged 10 commits intomainfrom
qianlim/posttrain

qianlim commented Apr 10, 2025 •

edited

Loading

Uh oh!

qianlim Apr 15, 2025

Uh oh!

tcwang0509 Apr 15, 2025

Uh oh!

qianlim Apr 15, 2025

Uh oh!

tcwang0509 Apr 15, 2025

Uh oh!

pjannaty commented Apr 15, 2025

Uh oh!

pjannaty commented Apr 17, 2025 •

edited

Loading

Uh oh!

pjannaty left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qianlim commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qianlim Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

tcwang0509 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

qianlim Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

tcwang0509 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

pjannaty commented Apr 15, 2025

Uh oh!

pjannaty commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjannaty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qianlim commented Apr 10, 2025 •

edited

Loading

pjannaty commented Apr 17, 2025 •

edited

Loading