Enable loss parallel, Ungate FP8 #2782

nathan-az · 2025-06-03T11:05:11Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

What are the changes made in this PR?

Ungate FP8 + TP, and clean up LLaMA-3 TP plans
Enables loss parallel
Enables compiling autograd
- Note that I have not been able to get this working so default to False, but leave it in for debugging purposes

Loss parallelism is the main feature. Loss curves look healthy, but memory utilisation is significantly lower now. This is also compatible (tested) with FP8.

Peak active memory usage scales aggressively (linearly) has tensor parallelism increases without loss parallelism

Peak active memory usage scales much more generously (still linear) with loss parallelism enabled

In the tp8 case (bs=8, seq len=8192), loss parallelism decreases active memory peak by about 30GB, about 35%. TPS remains the same.

Compiling autograd doesn't work (removed autograd compile for now)

Unfortunately compiling autograd doesn't seem to work and throws an FSDP error that I don't know how to debug. The error seems to indicate it's caused by some graph break during checkpointing. We can remove this from the PR, or leave it in for future debugging, defaulted to false. Running with TORCHDYNAMO_VERBOSE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 yielded:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 1118, in <module>
[rank7]:     sys.exit(recipe_main())
[rank7]:              ^^^^^^^^^^^^^
[rank7]:   File "/workspace/torchtune/torchtune/config/_parse.py", line 99, in wrapper
[rank7]:     sys.exit(recipe_main(conf))
[rank7]:              ^^^^^^^^^^^^^^^^^
[rank7]:   File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 1113, in recipe_main
[rank7]:     recipe.train()
[rank7]:   File "/workspace/torchtune/recipes/full_finetune_distributed.py", line 957, in train
[rank7]:     current_loss.backward()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 648, in backward
[rank7]:     torch.autograd.backward(
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank7]:     _engine_run_backward(
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank7]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/compiled_autograd.py", line 1041, in runtime_wrapper
[rank7]:     out = compiled_fn(
[rank7]:           ^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 372, in __call__
[rank7]:     return super().__call__(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
[rank7]:     return fn(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 840, in call_wrapped
[rank7]:     return self._wrapped_call(self, *args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 416, in __call__
[rank7]:     raise e
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/fx/graph_module.py", line 403, in __call__
[rank7]:     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
[rank7]:     return self._call_impl(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _call_impl
[rank7]:     return forward_call(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1469, in __call__
[rank7]:     return self._torchdynamo_orig_callable(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 625, in __call__
[rank7]:     return _compile(
[rank7]:            ^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1092, in _compile
[rank7]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
[rank7]:     return function(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 779, in compile_inner
[rank7]:     return _compile_inner(code, one_graph, hooks, transform)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
[rank7]:     out_code = transform_code_object(code, transform)
[rank7]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
[rank7]:     transformations(instructions, code_options)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 265, in _fn
[rank7]:     return fn(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 743, in transform
[rank7]:     tracer.run()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3484, in run
[rank7]:     super().run()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1359, in run
[rank7]:     while self.step():
[rank7]:           ^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1263, in step
[rank7]:     self.dispatch_table[inst.opcode](self, inst)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 831, in wrapper
[rank7]:     return inner_fn(self, inst)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2903, in CALL
[rank7]:     self._call(inst)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2897, in _call
[rank7]:     self.call_function(fn, args, kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1189, in call_function
[rank7]:     self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
[rank7]:     return getattr(self.realize(), name)(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py", line 516, in call_function
[rank7]:     return super().call_function(tx, args, kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py", line 291, in call_function
[rank7]:     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1206, in inline_user_function_return
[rank7]:     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3678, in inline_call
[rank7]:     return tracer.inline_call_()
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3881, in inline_call_
[rank7]:     self.run()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1359, in run
[rank7]:     while self.step():
[rank7]:           ^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1263, in step
[rank7]:     self.dispatch_table[inst.opcode](self, inst)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 831, in wrapper
[rank7]:     return inner_fn(self, inst)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2221, in CALL_FUNCTION_EX
[rank7]:     self.call_function(fn, argsvars.items, kwargsvars)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1189, in call_function
[rank7]:     self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py", line 1477, in call_function
[rank7]:     unimplemented_v2(
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/exc.py", line 521, in unimplemented_v2
[rank7]:     raise Unsupported(msg)
[rank7]: torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
[rank7]:   Explanation: Dynamo developers have intentionally marked that the function `_checkpoint_hook.__init__.<locals>.unpack_hook` in file `/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py` should not be traced.
[rank7]:   Hint: Avoid calling the function `_checkpoint_hook.__init__.<locals>.unpack_hook`.
[rank7]:   Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `_checkpoint_hook.__init__.<locals>.unpack_hook` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
[rank7]:   Hint: Please file an issue to PyTorch.

[rank7]:   Developer debug context: module: torch.utils.checkpoint, qualname: _checkpoint_hook.__init__.<locals>.unpack_hook, skip reason: <missing reason>


[rank7]: from user code:
[rank7]:    File "<eval_with_key>.35", line 1818, in forward
[rank7]:     call_hook_2 = torch__dynamo_external_utils_call_hook(getitem_13144, getitem_13145, hook_type = 'unpack_hook');  getitem_13144 = getitem_13145 = None
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 81, in call_hook
[rank7]:     result = hook(*args)

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Signed-off-by: Nathan Azrak <[email protected]>

pytorch-bot · 2025-06-03T11:05:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2782

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e275c88 with merge base 5b2e881 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Nathan Azrak <[email protected]>

ebsmothers

Thanks for the PR @nathan-az, these memory savings are really impressive! Lmk if any of the comments are unclear

torchtune/models/llama3/_parallelism.py

torchtune/training/_distributed.py

torchtune/models/llama3/_parallelism.py

recipes/full_finetune_distributed.py

ebsmothers · 2025-06-10T14:32:06Z

torchtune/models/llama3/_parallelism.py


-# TODO: expose this once tested
-def _fp8_llama_tp_plan() -> dict[str, ParallelStyle]:
+def fp8_llama_tp_plan(


I might be missing something here, but are we actually enabling this now? Like if I set enable_fp8_training=True and tensor_parallel_plan=base_llama_tp_plan, where do we actually hook this up?

I've been manually changing tensor_parallel_plan to torchtune.models.llama3.fp8_llama_tp_plan. Good call though - that's bad UX and we can just select the correct plan based on if fp8 is enabled. Will fix.

This commit addresses this. Required a bit of a refactor. Have also applied to LLaMA-4 so it raises an error more elegantly if someone tries FP8 + LLaMA-4 training.

Resolve this comment if you're happy with the solution :)

ebsmothers · 2025-06-10T14:41:55Z

torchtune/modules/loss/cross_entropy_loss.py

            reduction="sum",
            ignore_index=self.ignore_index,
        )
+        # the all-reduce later complains if a DTensor is returned


Is this comment referring to this all-reduce? Iiuc loss parallel with vanilla cross-entropy returns a DTensor (e.g.), so want to make sure we are not doing anything too funky here.

That's exactly the allreduce I'm referring to. And yes, the loss parallel CE loss does return a Replicate DTensor. Without the full_tensor, when we reach the all-reduce in the recipe, we get:

[rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.allreduce_.default!

The loss that comes out is a DTensor, but is a Replicate so I saw no difference between using full_tensor and using to_local in the TP case, versus not using loss parallel at all. Differences to DP (I use as a baseline) are very small. This is without packing, and adjusting effective batch size.

I did notice that DP+TP+CP appears to have more drastic differences between grad norms and losses versus DP. I expect this is probably due to the grad norm scaling, not the loss parallelism, although CP with/without loss parallel does exhibit slight differences.

note: I tried to keep the run names short. dp refers to dp8, cp or tp use both dim 2, with shard dim and batch size adjusted accordingly.

Suggestion: Unless you have a clearer idea, I'd lean towards leaving this full_tensor in, then follow up to investigate the CP difference.

I won't resolve this comment yet, but please do so if you're happy to leave this as-is for now.

pbontrager · 2025-06-10T15:36:03Z

torchtune/models/llama3/_parallelism.py

    layerwise_colwise_parallel_cls: type[ParallelStyle] = ColwiseParallel,
    layerwise_rowwise_parallel_cls: type[ParallelStyle] = RowwiseParallel,
    layerwise_prepare_module_input_cls: type[ParallelStyle] = PrepareModuleInput,
+    loss_parallel: bool = False,


I've been really trying to avoid this with the new loss functions and instead been having the loss functions modify the models directly. Otherwise we have to maintain a special TP plan for every kind of loss (see ligerloss). For the liger loss case you can just call full_tensor but I believe you can modify a model's output_layout after the fact too.

Mm yes this is a reasonable point. The current pattern was taken from torchtitan but we will soon support more loss functions than them.

We have two slightly tricky requirements:

Our training step needs to be aware of whether to enter the loss parallel context manager. For ones that use the standard pytorch cross entropy loss, this is the case. For ones like liger, probably not, even if they support loss parallelism

We should probably modify the TP plan based on loss parallelism rather than using redistribute in the forward step, since this would introduce additional collective (pytorch might be smart enough to optimise out, consecutive redistributes, but I'm not sure).

I think this is antithesis to what you're suggesting, but I wonder about relying more on the loss function here, using:

a method patch_tp_plan which makes any required modifications from the base tp plan prior to parallelisation in the model setup (as simple as a dict.update), which we default to a noop

a property use_loss_parallel_context_manager, defaulted to false, which indicates whether to use the context manager in the loss parallel case or not, which we use in the training loop

I'll address simple comments first, then implement this so it's easy to review, and simple to revert if you don't like it.

My rationale is that it turns loss parallelism into a first class citizen, and centralises any loss parallel-specific functionality to the loss class itself. Modifying the model I think will end up less straightforward depending on how loss is parallelised.

torchtune/modules/loss/cross_entropy_loss.py

… for now.

Signed-off-by: Nathan Azrak <[email protected]>

nathan-az · 2025-06-14T00:41:45Z

Converting to draft, will update according to #2824 once it merges then request review again.

Signed-off-by: Nathan Azrak <[email protected]>

nathan-az · 2025-06-16T10:13:32Z

I've made some changes to enable loss compilation in the non-loss parallel case. This did require taking some logic out of the function that is compiled, but it is only masking and slicing which I believe should be pretty trivial anyway, from both the memory and compute perspectives. I've confirmed the gains in memory utilisation against the standard PyTorch cross-entropy loss, as well as aligning loss curves.

Above is testing with LLaMA 3.1 8B, seq len 2**13, adjusting batch size for DP size, with 16 chunks (except "baseline" which does not use chunking). Very happy to see notable improvements in memory usage with the chunking, and more with loss parallelism, as well as minor tokens-per-second improvements from compile (in the DP case) and loss parallelism in the TP case.

We can experiment in the future with trying to get compile working in the TP case, but at least for now this is improved from the current state where it is fully disabled (in the DP case, I see 20% reduced peak memory usage currently).

p.s. this PR now takes heavily from @felipemello1's work #2824. Would be good to get felipe's thoughts too.

felipemello1

hey Nathan, thanks for the PR. I will leave the parallelism implementation to @ebsmothers and @pbontrager, since they were already on top of it. Regarding the loss, i like that you were able to port the other PR! Thanks for making it more modular and enabling compile. My two cents:

In which scenarios would we want to have mask_ignored_tokens=False? If there isnt a strong one, maybe we should remove the flag
I personally dont like how SFTLoss has many args related to parallelism:
a) 'supports_loss_parallel',
b) 'loss_parallel_enabled',
c) 'loss_paralell_requires_ctx_manager'.
d) 'use_loss_parallel_ctx_manager'

Maybe there is a way to simplify it? e.g. merge a) and c) and make b) and d) always True if input is TP? I am ok with having only way of doing things if it works for most of the cases

nathan-az · 2025-06-17T02:39:36Z

@felipemello1 fair, I erred on the side of leaving options to the users.

In which scenarios would we want to have mask_ignored_tokens=False? If there isnt a strong one, maybe we should remove the flag

Basically to have an option to enable compile, in cases where very few tokens are masked on average.

When packing to very high seq len, if samples aren't very long, packing should be very effective, and very few tokens could be masked. Thus, the user may see more gain by enabling compile. Once we have compile working in all case I support removing this option.

I personally dont like how SFTLoss has many args related to parallelism:

Agreed, this is a lot, again, for the sake of adding options for users. You're right - there's no clear downside to LP, we can just default to it with TP.

I've just pushed another commit. It reduces the above to only one class-level property: tp_requires_loss_parallel_manager, and an attribute to let the loss funciton be aware of when TP is enabled. Useful for selective compiling, but I think also just generally reasonable since behaviour can change between the two cases. I think it's quite a bit simpler, let me know your thoughts.

recipes/full_finetune_distributed.py

torchtune/modules/loss/cross_entropy_loss.py

torchtune/modules/loss/loss_types.py

recipes/full_finetune_distributed.py

torchtune/modules/loss/loss_types.py

recipes/full_finetune_distributed.py

torchtune/modules/loss/cross_entropy_loss.py

recipes/full_finetune_distributed.py

torchtune/modules/loss/cross_entropy_loss.py

Signed-off-by: Nathan Azrak <[email protected]>

ebsmothers

OK no remaining concerns from my side, this looks good to me. We need to get CI fixed (@felipemello1 is working on this in #2841 and elsewhere), once we can get a clean CI run in this should be good to merge.

felipemello1 · 2025-06-23T13:36:05Z

please merge main so tests can pass

nazrak-atlassian added 5 commits June 3, 2025 10:07

Add loss parallel to ParallelDims, and train context manager

8aa66d9

Signed-off-by: Nathan Azrak <[email protected]>

Add loss parallel support to LinearCrossEntropyLoss

f9315dd

Signed-off-by: Nathan Azrak <[email protected]>

ungate fp8, clean up llama3 parallelism, add loss parallel to TP plans

fd7872c

Signed-off-by: Nathan Azrak <[email protected]>

Add loss parallel support to full finetune recipe

f0a40fa

Signed-off-by: Nathan Azrak <[email protected]>

allow enabling autograd compile even though it doesn't work

6e93e83

Signed-off-by: Nathan Azrak <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2025

correct import paths in unit tests

566142a

Signed-off-by: Nathan Azrak <[email protected]>

nathan-az mentioned this pull request Jun 4, 2025

TP + FP8 - NotImplementedError for certain operations #2629

Closed

Merge branch 'pytorch:main' into enable_loss_parallel

b3ae7c4

ebsmothers reviewed Jun 10, 2025

View reviewed changes

pbontrager reviewed Jun 10, 2025

View reviewed changes

torchtune/modules/loss/cross_entropy_loss.py Outdated Show resolved Hide resolved

nathan-az added 6 commits June 11, 2025 03:35

remove autograd compile for now

defc48f

Remove unnecessary full_tensor in LinearCrossEntropyLoss

6eb0d85

remove layerwise prefix in llama3 tp plans

ae4e751

Refactor tp plans to support fp8 training via arg

c4abec6

Refactor loss parallel into custom loss classes

f037fab

Fix Replicate caused by tensor_split in Linear CE, disallow masking…

ed61113

… for now.

nathan-az requested review from ebsmothers and pbontrager June 11, 2025 16:03

nathan-az added 3 commits June 12, 2025 00:23

re-introduce masking in tensor parallel linear CE loss.

5875292

Signed-off-by: Nathan Azrak <[email protected]>

revert accidental num_tokens metric in recipe.

bba42d5

Signed-off-by: Nathan Azrak <[email protected]>

clean up linear CE loss

87d5fd6

Signed-off-by: Nathan Azrak <[email protected]>

nathan-az marked this pull request as draft June 14, 2025 00:41

nathan-az added 2 commits June 15, 2025 00:09

explicitly flatten target and hidden chunks

f5efb0f

Signed-off-by: Nathan Azrak <[email protected]>

refactor to allow compile in non-parallel case

40df86d

nathan-az marked this pull request as ready for review June 16, 2025 10:07

nathan-az added 3 commits June 16, 2025 10:27

comment clarity

c0df2b2

reshape + mask before chunking, allow toggling masking in loss

def5ed4

clean up comments

ef60191

felipemello1 mentioned this pull request Jun 16, 2025

OOM handling and recovery #2830

Open

Clean up docstrings

65c1d3a

felipemello1 reviewed Jun 17, 2025

View reviewed changes

simplify SFTLoss contract

8294215

remove comment

f3227eb

felipemello1 reviewed Jun 17, 2025

View reviewed changes

recipes/full_finetune_distributed.py Show resolved Hide resolved

felipemello1 reviewed Jun 17, 2025

View reviewed changes

torchtune/modules/loss/cross_entropy_loss.py Outdated Show resolved Hide resolved

torchtune/modules/loss/cross_entropy_loss.py Outdated Show resolved Hide resolved

torchtune/modules/loss/loss_types.py Outdated Show resolved Hide resolved

nathan-az added 3 commits June 17, 2025 23:00

cleanup

f1e3330

fix docstrings in loss_types

7ceec03

remove unused indices arg

bb58300

ebsmothers reviewed Jun 18, 2025

View reviewed changes

nathan-az added 3 commits June 19, 2025 00:38

make python3.9 happy

b89260a

Signed-off-by: Nathan Azrak <[email protected]>

Fix typehints, remove __init__ from SFTLoss ABC

f4060ac

Signed-off-by: Nathan Azrak <[email protected]>

rename context parallel ctx manager

2a5714c

Signed-off-by: Nathan Azrak <[email protected]>

ebsmothers approved these changes Jun 21, 2025

View reviewed changes

Merge branch 'pytorch:main' into enable_loss_parallel

e275c88

felipemello1 merged commit 9983bbc into meta-pytorch:main Jun 24, 2025
14 checks passed

nathan-az mentioned this pull request Jul 2, 2025

Ungate FP8 + TP #2781

Closed

13 tasks

songhappy mentioned this pull request Jul 14, 2025

TP broken #2880

Open

nathan-az mentioned this pull request Aug 4, 2025

Add validation for LinearCrossEntropyLoss with custom_sharded_layers #2900

Open

Enable loss parallel, Ungate FP8 #2782

Enable loss parallel, Ungate FP8 #2782

Uh oh!

Conversation

nathan-az commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Peak active memory usage scales aggressively (linearly) has tensor parallelism increases without loss parallelism

Peak active memory usage scales much more generously (still linear) with loss parallelism enabled

Compiling autograd doesn't work (removed autograd compile for now)

Test plan

UX

Uh oh!

pytorch-bot bot commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2782

✅ No Failures

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebsmothers Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nathan-az commented Jun 14, 2025

Uh oh!

nathan-az commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathan-az commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Jun 23, 2025

Uh oh!

nathan-az commented Jun 3, 2025 •

edited

Loading

pytorch-bot bot commented Jun 3, 2025 •

edited

Loading

nathan-az Jun 11, 2025 •

edited

Loading

nathan-az Jun 11, 2025 •

edited

Loading

nathan-az Jun 11, 2025 •

edited

Loading

nathan-az commented Jun 16, 2025 •

edited

Loading

felipemello1 left a comment •

edited

Loading