feat: add post-training and custom-training support#31
Merged
Conversation
…nd config, add script to convert ckpt to tp
qianlim
commented
Apr 15, 2025
| if isinstance(control_weight, torch.Tensor): | ||
| if control_weight.ndim == 0: # Single scalar tensor | ||
| control_weight = [float(control_weight)] | ||
| control_weight = [float(control_weight)] * len(guided_hints) |
Contributor
Author
There was a problem hiding this comment.
@tcwang0509 This line is different in current repo and i4. Could you advise?
Contributor
There was a problem hiding this comment.
Sure we can add it. Currently when ndim==0, len(guided_hints) is always 1, but it's good to make it general.
qianlim
commented
Apr 15, 2025
| # Reshape to match THWBD format | ||
| weight_map = weight_map.permute(2, 3, 4, 0, 1) # [T, H, W, B, 1] | ||
| hint_val = control_feat * weight_map * gate | ||
| weight_map = weight_map.view(T * H * W, 1, 1, B, 1) |
Contributor
Author
There was a problem hiding this comment.
@tcwang0509 also this line, i4 has this additional step of reshaping the weight_map but removed in the repo. Is it needed?
Contributor
There was a problem hiding this comment.
This is because in i4 we use TP but here we use CP.
Contributor
|
Thanks Qianli. Can we please trigger CI? |
Contributor
pjannaty
approved these changes
Apr 17, 2025
Contributor
pjannaty
left a comment
There was a problem hiding this comment.
Let's add training tests in a follow up. Merging for now.
atmguille
pushed a commit
to atmguille/cosmos-transfer1
that referenced
this pull request
Jul 16, 2025
* feat: add post-training and custom-training support * feat: add separate model definitions supporting tp/sp for training; update configs * feat: add example Dataset class, add data augmentors, update config * feat: add example data class, add misc improvements to data loading and config, add script to convert ckpt to tp * fix: fix conflict in DiTEncoder * cleanup * feat: compelete README in examples/ for post/pre-training; update the main README * fix: multiple minor fixes on example dataset * fix: multiple minor fixes + improve example dataset performance * feat+fix: multiple fixes + refinements to README
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR brings support for training the cosmos-transfer models. Supports single or multi-node training with Tensor Parallel and Sequence Parallel. Supports both training customized models from scratch and post-training / fine-tuning from the released checkpoints.
It addresses Issue #3.
Since a large number of files are added, and the model classes are also updated (to enable support for training), careful review and testing is needed.
So far I've verified the generated config yaml aligns with the one we use to train the released models. I'll launch training jobs to verify the correctness of implementation.
[WIP]: test the training script, improve the README.