[simplefsdp] fix & enable DSV3 manual bucketing #2080

ruisizhang123 · 2025-11-24T17:18:45Z

Validate DSV3 manual bucketing when EP/TP are enable. Tested on DSV3-16B model. Dependent on Pytorch PR

(Single Node: BS = 1)

Node	Method	Parallelism	Memory	TPS	Trace
1-Node (8H100)	SimpleFSDP (aot_eager)	FSDP=4 EP=2	51.11GiB(53.80%)	5,136	Link
1-Node (8H100)	FSDP2-eager	FSDP=4 EP=2	59.54GiB(62.68%)	5,942	Link
1-Node (8H100)	SimpleFSDP (aot_eager)	FSDP=2 TP=2 EP=2	42.21GiB(44.43%)	2,285	Link
1-Node (8H100)	FSDP2-eager	FSDP=2 TP=2 EP=2	45.41GiB(47.80%)	2,349	Link
8-Node (64H100)	SimpleFSDP (aot_eager)	FSDP=4 EP=2			Link
8-Node (64H100)	FSDP2-eager	FSDP=4 EP=2			Link
8-Node (64H100)	SimpleFSDP (aot_eager)	FSDP=2 TP=2 EP=2			Link
9-Node (64H100)	FSDP2-eager	FSDP=2 TP=2 EP=2			Link

Example Trace

ruisizhang123 · 2025-12-11T05:26:31Z

torchtitan/models/deepseek_v3/__init__.py

    ),
    "16B": DeepSeekV3ModelArgs(
+        vocab_size=102400,
+        dim=2048,


@tianyu-l Should we have another config to allow users to turn on/off flexattention? Currently, flexattention doesn't work well with AC here. cc. @soulitzer for AC issue follow up!

what was the symptom?

also if it doesn't work why do we add an entry for it -- is it for repro?

No, it's 16B_flexatten doesn't work. But in current DSV3 implementation, flexatten by default in turned on. I want to have a model config that, by default, turns off flex attention.

tianyu-l · 2025-12-11T10:32:11Z

torchtitan/experiments/simple_fsdp/deepseek_v3/parallelize.py

        for m in modules:
            if isinstance(m, list):
-                result.append(convert_modules_to_fqns(m, module_to_fqn_mapping))
+                if fqn_list := convert_modules_to_fqns(m, module_to_fqn_mapping):


what does the syntax mean -- assigning to fqn_list and check not None? It feels a bit unusual to read.

Also please add a comment on why we need this check

yes, added a comment for it.

tianyu-l · 2025-12-11T10:33:18Z

torchtitan/models/deepseek_v3/__init__.py

    ),
    "16B": DeepSeekV3ModelArgs(
+        vocab_size=102400,
+        dim=2048,


what was the symptom?

also if it doesn't work why do we add an entry for it -- is it for repro?

ruisizhang123 · 2025-12-12T01:35:18Z

torchtitan/experiments/simple_fsdp/reshard_after_forward.py

+    VIEW_OPS = {
+        torch.ops.aten.slice.Tensor,
+        torch.ops.aten.view.default,
+        torch.ops.aten.reshape.default,
+        torch.ops.aten.transpose.int,
+    }
+


@bdhirsh Following today's discussion, I updated reshard_after_fwd to enforce all VIEW_OPS after wait are recomputed.

In this fx-graph in tlparse (link), view_63-view_65 + transpose_8 is enforced to be recompute. Thus, I can successfully get correct reshard_after_fwd semantics.

However, in bwd, the _grouped_mm is recomputed, seems because we enforce this region to be MUST_RECOMPUTE. I feel like I'm in a rabbit hole that, if not RECOMPUTE transpose_8, I will not get correct FSDP semantics. But if I RECOMPUTE transpose_8, the follow up _grouped_mm is recomputed.

Wonder if you think I should fix this from simplefsdp side or partitioner side? 🤔

_to_copy_32: "bf16[1, 256, 256][65536, 256, 1]cuda:0" = torch.ops.aten._to_copy.default(view_58, dtype = torch.bfloat16); view_58 = None all_gather_into_tensor_19: "bf16[4, 256, 256][65536, 256, 1]cuda:0" = torch.ops._c10d_functional.all_gather_into_tensor.default(_to_copy_32, 4, '1'); _to_copy_32 = None wait_tensor_22: "bf16[4, 256, 256][65536, 256, 1]cuda:0" = torch.ops._c10d_functional.wait_tensor.default(all_gather_into_tensor_19); all_gather_into_tensor_19 = None view_63: "bf16[4, 256, 256][65536, 256, 1]cuda:0" = torch.ops.aten.view.default(wait_tensor_22, [4, 256, 256]); wait_tensor_22 = None view_64: "bf16[4, 256, 256][65536, 256, 1]cuda:0" = torch.ops.aten.view.default(view_63, [4, 256, 256]); view_63 = None view_65: "bf16[4, 256, 256][65536, 256, 1]cuda:0" = torch.ops.aten.view.default(view_64, [4, 256, 256]); view_64 = None transpose_8: "bf16[4, 256, 256][65536, 1, 256]cuda:0" = torch.ops.aten.transpose.int(view_65, -2, -1); view_65 = None _grouped_mm: "bf16[8*(((u2 + u3 + 39)//8)), 256][256, 1]cuda:0" = torch.ops.aten._grouped_mm.default(index_1, transpose_8, cumsum_2); transpose_8 = None

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 24, 2025

ruisizhang123 marked this pull request as draft November 24, 2025 17:19

ruisizhang123 force-pushed the ruisi/fix_manual_bucketing_dsv3 branch from f931aa9 to 88b700b Compare December 11, 2025 05:23

ruisizhang123 marked this pull request as ready for review December 11, 2025 05:24

ruisizhang123 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 11, 2025 05:24

ruisizhang123 commented Dec 11, 2025

View reviewed changes

tianyu-l reviewed Dec 11, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/fix_manual_bucketing_dsv3 branch from 88b700b to 35ad842 Compare December 12, 2025 01:08

enable DSV3 manual bucketing

3b0fdda

ruisizhang123 force-pushed the ruisi/fix_manual_bucketing_dsv3 branch from 35ad842 to 3b0fdda Compare December 12, 2025 01:12

ruisizhang123 requested a review from bdhirsh December 12, 2025 01:17

ruisizhang123 commented Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[simplefsdp] fix & enable DSV3 manual bucketing #2080

[simplefsdp] fix & enable DSV3 manual bucketing #2080

ruisizhang123 commented Nov 24, 2025 •

edited

Loading

Uh oh!

ruisizhang123 Dec 11, 2025

Uh oh!

tianyu-l Dec 11, 2025

Uh oh!

ruisizhang123 Dec 12, 2025

Uh oh!

tianyu-l Dec 11, 2025

Uh oh!

ruisizhang123 Dec 12, 2025

Uh oh!

tianyu-l Dec 11, 2025

Uh oh!

ruisizhang123 Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[simplefsdp] fix & enable DSV3 manual bucketing #2080

Are you sure you want to change the base?

[simplefsdp] fix & enable DSV3 manual bucketing #2080

Conversation

ruisizhang123 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruisizhang123 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruisizhang123 commented Nov 24, 2025 •

edited

Loading