Add feature ligerceloss #2741

mananchawla2005 · 2025-05-16T13:37:54Z

PR: Add LigerFusedCrossEntropyLoss

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other

Closes #2692

Changelog

Added new LigerFusedCrossEntropyLoss class that provides memory-efficient cross entropy loss using fused CUDA kernels using liger-kernels
Implemented proper handling of distributed tensors (DTensor) with gradient hooks
Added comprehensive unit tests comparing against standard PyTorch cross entropy
Added docstrings with usage examples

Test plan

Added unit tests in test_liger_ce_loss.py comparing against F.cross_entropy
Tests verify correctness with:
- Regular tensor inputs
- Batched inputs requiring reshaping
- Ignored indices
Added proper docstrings with usage examples
Tests pass locally on CUDA device
Pre-commit hooks and linters pass

UX

Example usage in docstring:

# Initialize model and loss
model = Transformer(...)  # model with skip_output_layer=True
loss = LigerFusedCrossEntropyLoss()
loss.set_model_output(model)  # This captures model's output layer

# Forward pass
hidden_states = model(inputs)  # [batch_size, seq_len, hidden_dim]
targets = labels  # [batch_size, seq_len]

# If needed, reshape to [batch_size*seq_len, hidden_dim]
hidden_states = hidden_states.view(-1, hidden_states.size(-1))
targets = targets.view(-1)

loss_value = loss(hidden_states, targets)

The implementation provides better performance and memory efficiency compared to the chunked LinearCrossEntropyLoss by:

Using fused CUDA kernels
Combining linear projection and cross entropy in a single operation
Properly handling distributed tensors

All tests verify numerical correctness against PyTorch's native cross entropy within expected tolerances.

Hi PyTorch Team,

This is my first PR to a machine learning project, and I’ve tried to ensure the code is correct and well-structured. I’ve implemented the functionality for LigerCeLoss and included a test case that verifies its behavior with both masked and reshaped inputs.

Due to hardware limitations, I wasn't able to fully validate the distributed functionality or run the entire test suite across multiple GPUs. However, I’ve implemented support for DTensor-based weights, hidden states, and targets, and included the logic for gather → fuse → scatter to handle gradient computation for sharded weights in distributed settings.

Looking forward to your feedback! @pbontrager @joecummings

pytorch-bot · 2025-05-16T13:37:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2741

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit a7a5fcb with merge base c7a92e4 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_eleuther_eval.py::TestEleutherEval::test_hf_eval_vision

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-16T13:38:01Z

Hi @mananchawla2005!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

mananchawla2005 · 2025-05-16T13:58:46Z

CLA filled!

facebook-github-bot · 2025-05-16T14:10:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

pbontrager

Thank you for this contribution! Really clean first pass at the problem. I left some comments around testing and questions about how you're handling DTensors. For linting, you can checkout our contributing guide on how to setup precommit hooks.

pbontrager · 2025-05-16T13:57:09Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+        # Validate the results are close enough
+        assert_expected(fused_loss, standard_loss, rtol=1e-2, atol=1e-2)
+
+    def test_liger_fused_cross_entropy_loss_with_reshape(self):


For our SFTLoss type we can assume the input is "[bsz, seq_len, emb_dim]", so I don't think we need this second test.

I think we should have a second test, but it should be a distributed test. Same as the first test but with 4 gpus required and FSDP size 2 and TP size 2. If you need help on how to initialize the model that way I can give you the code.

pbontrager · 2025-05-16T14:00:56Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+
+
+class TestLigerFusedCrossEntropyLoss:
+    def test_liger_fused_cross_entropy_loss(self):


Since this test requires cuda you should add the "@gpu_test(gpu_count=1)" decorator from "from tests.test_utils import gpu_test". Along with testing the loss value, I think it would be good to test a single forward and backward pass with opt step to ensure all the gradients are propagating back correctly too. You can use "fixed_init_model" (also from test_utils) as well to make it easier to initialize the model the same way each time

I think it would be good to add @pytest.mark.parametrize("compile", [False, True]) to the test and pass in compile as an argument on whether to call apply_compile_strategy on the loss

pyproject.toml

torchtune/modules/loss/cross_entropy_loss.py

pbontrager · 2025-05-16T14:20:32Z

torchtune/modules/loss/cross_entropy_loss.py

+        orig_w = self.linear_projection.weight
+        if isinstance(orig_w, DTensor):
+            mesh, placements = orig_w.device_mesh, orig_w.placements
+            w = orig_w.full_tensor().detach().clone().requires_grad_(True)


Can you tell me more about what you're doing here? Does the liger loss require you to detach the weight? Why detach it only to manually register that gradients get reapplied? Also, I don't think you'd want to register a hook every forward pass.

torchtune/modules/loss/cross_entropy_loss.py

codecov-commenter · 2025-05-16T14:48:18Z

Codecov Report

Attention: Patch coverage is 86.66667% with 12 lines in your changes missing coverage. Please review.

Project coverage is 62.74%. Comparing base (c8e670b) to head (b4bf352).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/modules/loss/cross_entropy_loss.py	62.50%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2741      +/-   ##
==========================================
+ Coverage   60.64%   62.74%   +2.09%     
==========================================
  Files         428      431       +3     
  Lines       26091    26479     +388     
==========================================
+ Hits        15823    16613     +790     
+ Misses      10268     9866     -402

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…wla2005/torchtune into add-feature-ligerceloss

mananchawla2005 · 2025-05-18T19:08:11Z

@pbontrager Hey thanks for an indepth review! I have tried to resolve most of the issues that were coming earlier however for the distributed test would love if you can help a little with the code as well as testing it cause I dont have a distributed gpu setup.

Andrei-Aksionov · 2025-05-19T16:51:44Z

torchtune/modules/loss/cross_entropy_loss.py

+        # self.forward = torch.compile(
+        #     self.forward, *args, **kwargs
+        # )
+        return self


Do you need to compile a liger kernel at all?

Hey it was added in response to @pbontrager

I think it would be good to add @pytest.mark.parametrize("compile", [False, True]) to the test and pass in compile as an argument on whether to call apply_compile_strategy on the loss

Oh, ok, now I see 🙂

So the problem is that the kernel is automatically compiled.
It was obvious from the README of their repo - there was no mentioning that a user needs to call torch.compile manually, which leaves only one option. Yes, Liger provides custom optimized triton kernels, but without compilation they won't work.

So, after digging a bit of their codebase, here is how it works:

LigerFusedLinearCrossEntropyFunction has custom forward and backward implementations. (Let's focus on forward variant for now.)

Inside it, a fused_linear_cross_entropy_forward is called ...

... which calls a triton kernel, that has triton.jit wrapper.

Of course, you can control even that with

with torch._dynamo.disable():

and a flag that is disabled by default and enabled in apply_compile_strategy, but since Triton kernel code is not valid Python code for direct execution on a GPU without a compilation, there is no much sense in it.
Perhaps maybe control compilation of service code inside a loss code around this kernel, but I believe it won't make that much difference 🤔.

In other words, since it was never intended to use liger without a compilation, perhaps just skip this method without any warnings?

I have removed the warning and added a doc string that this is JIT compiled

Andrei-Aksionov · 2025-05-20T10:32:45Z

torchtune/modules/loss/cross_entropy_loss.py

+        try:
+            import liger_kernel.ops.fused_linear_cross_entropy
+
+            self.fused_linear_ce = liger_kernel.ops.fused_linear_cross_entropy


A dummy question: why did you deciced to go this route instead of a loss class as described in README: https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#3-compose-your-own-model

Hey, you are right to point out that! I chose the lower-level ops approach for bias handling as one under transformers don't have a option for bias parameter and during distributed training we need to handle DTensor. If thats not required I can replace it with the one in readme.

But it looks like forward method of the class also accepts bias, which by default is None.

Basically, there is nothing wrong with your approach, just the one with the class looks, at least to me, slightly cleaner.

Since the class version was just a thin wrapper around LigerFusedLinearCrossEntropyFunction, and our loss class is also a thin wrapper around the same functionality, it feels more right to me that we just directly call LigerFusedLinearCrossEntropyFunction and operate at the same abstraction level as the nn.Module you linked to.

Ok, fair enough.
Linear loss calls F.cross_entropy and this one calls a function - a lil bit of uniformity.

intervitens · 2025-05-21T21:28:36Z

I think it makes more sense to reshape hidden_states and targets to [batch_size*seq_len] inside the loss forward and to set reduction to mean instead of sum in order to match the behavior of existing LinearCrossEntropyLoss and allow the new LigerLinearCrossEntropy to be used as a drop-in replacement in recipe configs.

I've tested a distributed LoRA finetune of llama3.1-8B with those changes, and it seems to work fine, amount of reserved memory was reduced and the difference in loss was minimal.
https://wandb.ai/intervitens/8B-lora-dist-liger

mananchawla2005 · 2025-05-22T11:45:40Z

@intervitens Hey thanks for checking out the distributed training! Glad that it works well. I have incorporated your changes. It would also be very helpful if you can provide an initial starting for adding the distributed test.

Andrei-Aksionov · 2025-05-22T13:44:24Z

torchtune/modules/loss/cross_entropy_loss.py

+        batch_size, seq_len, emb_dim = hidden_states.shape
+        hidden_states = hidden_states.reshape(
+            -1, emb_dim
+        )  # [batch_size*seq_len, emb_dim]
+        targets = targets.reshape(-1)  # [batch_size*seq_len]


Feels like since you don't reuse B, T, C values anywhere, it could be done simpler:

Suggested change

batch_size, seq_len, emb_dim = hidden_states.shape

hidden_states = hidden_states.reshape(

-1, emb_dim

) # [batch_size*seq_len, emb_dim]

targets = targets.reshape(-1) # [batch_size*seq_len]

hidden_states = hidden_states.flatten(0, 1) # (batch_size*seq_len, hidden_size)

targets = targets.flatten() # (batch_size*seq_len)

Andrei-Aksionov · 2025-05-22T13:44:59Z

torchtune/modules/loss/cross_entropy_loss.py

        )
+        if total_elements == 0:
+            return loss
        return loss


So, basically, return loss regardless? 🙂

pbontrager

Thank you for doing all the updates! I think it's close now. I'm going to help with the unit tests and then once you have a chance to make any changes based on Andrei's comments, we should be good to land.

pbontrager · 2025-05-22T14:32:35Z

pyproject.toml

    "wandb",
    "expecttest",
+    # Triton:
+    "triton>=2.3.1 ; platform_system != 'Windows'",


Is this dependency explicitly needed? Pytorch already includes triton I believe.

pbontrager · 2025-05-22T14:45:17Z

torchtune/modules/loss/cross_entropy_loss.py

+                b.register_hook(_scatter_b)
+                self._b_hook_registered = True
+
+        loss, _ = self.fused_linear_ce.LigerFusedLinearCrossEntropyFunction.apply(


nit: couldn't self.fused_linear_ce = LigerFusedLinearCrossEntropyFunction in your init and then here you'd just have self.fused_linear_ce.apply(...)?

One more nit: could we simplify the arguments list, since most of the values that are provided are actually equal to the default ones?

class LigerFusedLinearCrossEntropyFunction(torch.autograd.Function): @staticmethod @amp_custom_fwd def forward( ctx, _input, weight, target, bias=None, ce_weight=None, ignore_index=-100, lse_square_scale=0.0, label_smoothing=0.0, reduction="mean", softcap=None, return_z_loss: bool = False, ):

pbontrager · 2025-05-22T15:00:42Z

torchtune/modules/loss/cross_entropy_loss.py

+        if isinstance(w, DTensor):
+            mesh, placements = w.device_mesh, w.placements
+            w = w.full_tensor()
+            if not hasattr(self, "_w_hook_registered"):


I think "full_tensor" handles gradient placement and we don't need to do this ourselves link. I can test removing this though.

pbontrager · 2025-05-22T15:04:17Z

torchtune/modules/loss/cross_entropy_loss.py

+        try:
+            import liger_kernel.ops.fused_linear_cross_entropy
+
+            self.fused_linear_ce = liger_kernel.ops.fused_linear_cross_entropy


Since the class version was just a thin wrapper around LigerFusedLinearCrossEntropyFunction, and our loss class is also a thin wrapper around the same functionality, it feels more right to me that we just directly call LigerFusedLinearCrossEntropyFunction and operate at the same abstraction level as the nn.Module you linked to.

pbontrager · 2025-05-22T15:07:28Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+from torchtune.training.seed import set_seed
+
+
+@gpu_test(gpu_count=1)


Let me edit and push some changes to these tests. I can run them myself then for you.

mananchawla2005 · 2025-05-23T18:31:21Z

@pbontrager I have simplifed distributed handling as well done the refactoring as per suggested changes. Would be greatful if you can help me add tests for distributed.

mananchawla2005 · 2025-05-29T16:10:26Z

Hey any updates?

pbontrager · 2025-05-29T18:57:53Z

@mananchawla2005 I've gotten the unit tests working to test FSDP + TP for a single training step with the Liger loss. I haven't pushed the changes to your PR yet because the DTensor backward hook doesn't seem to work correctly so I'm trying to fix that and get the test to pass. I should be able to get back to this and get something to you by the end of this week.

pbontrager · 2025-05-30T20:57:18Z

I pushed the tests with some changes here but the tests aren't passing in the distributed case. I've tried playing around with getting the full tensor in a full TP setting and full FSDP setting (the default test is a mix of both) but I'm still getting numerical differences. @ebsmothers do you have any ideas here?

ebsmothers

Thanks @mananchawla2005 for getting this most of the way! I have a few more small comments based on @pbontrager's changes, then this should be good to go. @pbontrager please make sure to test on post-6/4 nightlies given the changes in pytorch/pytorch#154704. Stamping to unblock

ebsmothers · 2025-06-06T19:40:59Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+
+        dist.destroy_process_group()
+
+    @gpu_test(gpu_count=WORLD_SIZE)


nit: just say 4 explicitly here. I like to be able to look directly at the test and see how many GPUs it needs

ebsmothers · 2025-06-06T19:43:42Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+
+        # Verify:
+        # 1. Validate the results are close enough
+        assert_expected(fused_loss, standard_loss, rtol=1e-2, atol=1e-2)


Is this as close as we can get?

ebsmothers · 2025-06-06T19:50:47Z

torchtune/modules/loss/cross_entropy_loss.py

+    """Memory efficient Cross-entropy loss that uses fused CUDA kernels to compute the loss.
+    Combines the linear projection with the cross-entropy calculation for better performance
+    and memory efficiency. This is an approximation of CrossEntropyLoss and may have small
+    numerical differences compared to the standard implementation. This is a wrapper around
+    `LigerFusedLinearCrossEntropyFunction` from the `liger_kernel` package.


Can we add a reference to the Liger paper or repo here?

ebsmothers · 2025-06-06T19:58:43Z

tests/torchtune/modules/loss/test_liger_ce_loss.py

+        plan = {
+            "output": ColwiseParallel(
+                input_layouts=Replicate(), output_layouts=Replicate()
+            )
+        }


Is this actually a nontrivial application of TP? (Like should we at least have some shard for layer or something?)

nathan-az · 2025-06-11T00:46:44Z

torchtune/modules/loss/cross_entropy_loss.py

+            raise RuntimeError("Must call set_model_output() before forward()")
+
+        if isinstance(hidden_states, DTensor):
+            hidden_states = hidden_states.full_tensor()


So the compute isn't replicated across TP workers unnecessarily, could the hidden states and targets both be sharded on the sequence dimension, losses calculated per token, then correctly reduced after? (Basically to make better uses of the devices in a TP group).

Could be a follow-up PR, not MVP for liger. Just raising while I work on adding loss parallel for the standard LinearCrossEntropyLoss https://github.com/pytorch/torchtune/pull/2782/files

(EDIT: I can probably look to add this in future)

vadimkantorov · 2025-06-16T12:54:10Z

Any plans for contributing this great fused linear cross entropy loss impl directly into PyTorch core?

Fused Linear and Cross-Entropy Loss torch.nn.functional.linear_cross_entropy pytorch/pytorch#124480
Inductor vs. Liger Performance Track pytorch/pytorch#139908 (so far it shows that Inductor under-performs compared to Liger, so having the impl in torchtune run against this benchmark would be good)

As probably the same question is relevant for torchtitan, so would be best to have it upstreamed? (although, maybe a copy of Liger Triton (or CUDA?) kernel would probably need to be upstreamed, this PR is a wrapper around Liger package) And this way, it would also provide a good baseline for the Inductor codegen

I think it's been now quite some time that fused linear cross entropy has proved it's actually useful for warranting its inclusion in core (especially in the meanwhile Inductor underperforms codegen for this op)

ebsmothers · 2025-06-20T18:15:11Z

@vadimkantorov actually this is something that @ngimel has brought up as well, she may have a better idea on the latest status of things here.

Any plans for contributing this great fused linear cross entropy loss impl directly into PyTorch core?

Fused Linear and Cross-Entropy Loss torch.nn.functional.linear_cross_entropy pytorch#124480

Inductor vs. Liger Performance Track pytorch#139908 (so far it shows that Inductor under-performs compared to Liger, so having the impl in torchtune run against this benchmark would be good)

As probably the same question is relevant for torchtitan, so would be best to have it upstreamed? (although, maybe a copy of Liger Triton (or CUDA?) kernel would probably need to be upstreamed, this PR is a wrapper around Liger package) And this way, it would also provide a good baseline for the Inductor codegen

I think it's been now quite some time that fused linear cross entropy has proved it's actually useful for warranting its inclusion in core (especially in the meanwhile Inductor underperforms codegen for this op)

vadimkantorov · 2025-06-20T18:33:54Z

I wonder if both triton-code and triton-produced cubin / ptx can be included in core distribution for populating local artifact cache . Like so we can both preserve triton's hackability and zero wait time for eager mode...

mikaylagawarecki · 2025-06-20T19:41:54Z

Hey @vadimkantorov,

Any plans for contributing this great fused linear cross entropy loss impl directly into PyTorch core?

Fused Linear and Cross-Entropy Loss torch.nn.functional.linear_cross_entropy pytorch#124480

Inductor vs. Liger Performance Track pytorch#139908 (so far it shows that Inductor under-performs compared to Liger, so having the impl in torchtune run against this benchmark would be good)

As probably the same question is relevant for torchtitan, so would be best to have it upstreamed? (although, maybe a copy of Liger Triton (or CUDA?) kernel would probably need to be upstreamed, this PR is a wrapper around Liger package) And this way, it would also provide a good baseline for the Inductor codegen

I think it's been now quite some time that fused linear cross entropy has proved it's actually useful for warranting its inclusion in core (especially in the meanwhile Inductor underperforms codegen for this op)

Wanted to let you know that a fused linear cross entropy in core is on our roadmap, we plan to work on it in the next month or so

vadimkantorov · 2025-06-20T19:57:41Z

Any plans for a road forward for inclusion in core directly triton codes and torch.compile-d codes?

mikaylagawarecki · 2025-06-20T20:14:12Z

Any plans for a road forward for inclusion in core directly triton codes and torch.compile-d codes?

wonder if both triton-code and triton-produced cubin / ptx can be included in core distribution for populating local artifact cache . Like so we can both preserve triton's hackability and zero wait time for eager mode...

Not sure if this is what you're referring to, but I believe there is already infra to include triton code in core https://github.com/pytorch/pytorch/blob/a69e27ca5ad4287add73972ef1b34b469e3c7d23/torch/cuda/__init__.py#L1669-L1702 that is used for a very small number of ops

As to the triton-produced cubin / ptx I'm not sure how that would look like but I'll investigate when we look at upstreaming the fused linear cross entropy

vadimkantorov · 2025-06-20T20:42:09Z

Exactly, I wonder what is needed for Triton code to be used for more ops in core. If it's cold start time for eager, then shipping pre-cached / pre-generated / pre-compiled ptx could be the solution (and still allow hackability if one wants to copy-paste and modify the triton code)

ngimel · 2025-06-20T21:02:21Z

Precompiled triton is not really an option, triton is doing too much specialization, so startup time/unexpected reompiles will always be a problem for using triton in eager.

vadimkantorov · 2025-06-20T21:14:00Z

Is it possible to somehow precompile / pregenerate from triton some version of ptx which at least would run on all relevant hardwares?

Or are some new features in triton needed to control how it specializes / force it to specialize less? (I guess ideally one would like to have some sort of okay AOT ptx generation from triton which would at least run on many hardwares - even if not reaching the best possible perf. The PyTorch could ship ptx for a few popular important hardwares, and for others force the user to run triton compilation in their own machine if they want extra perf. This would allow users to write and maintain triton core where now they have to rewrite in cuda )

Or is it possible to lower triton into the cuda/c++ code?

nathan-az · 2025-07-07T02:39:15Z

@mikaylagawarecki If there are any public issues/PRs that we could follow to track progress for a fused linear loss in Core, that would be great. I don't see it currently in the 2.9.0 milestones.

vadimkantorov · 2025-07-08T17:17:45Z

Wanted to let you know that a fused linear cross entropy in core is on our roadmap, we plan to work on it in the next month or so

@mikaylagawarecki If maybe vllm's impls could be upstreamed, vllm could transition to use more of core modules. Hopefully, vllm/sglang can transition to be using these new pytorch core modules (I wonder if this should inform the design/numerics in any ways) and thus reducing fragmentation...

Yet another recent impl of fused linear cross entropy (indicating fragmentation): https://github.com/volcengine/verl/blob/main/verl/utils/kernel/kernels.py

Also, a tutorial is needed for implementing forward+backward losses fusing Linear + some cross-entropy-like loss (e.g. important case is GRPO)

vadimkantorov · 2025-07-08T23:21:47Z

Also curious if the pattern of fusing chunked Linear and some loss computation can be also implemented as a more generic/compilable higher-order op in PyTorch core. Seems the same pattern arises / is needed for GRPO loss and reducing peak VRAM for super-long reasoning traces (and maybe can be done with other loss variants as well)...

mananchawla2005 added 2 commits May 16, 2025 17:49

feat(loss): Add LigerFusedCrossEntropyLoss module

02b12d3

Merge branch 'main' of https://github.com/mananchawla2005/torchtune

b4bf352

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2025

pbontrager reviewed May 16, 2025

View reviewed changes

torchtune/modules/loss/cross_entropy_loss.py Show resolved Hide resolved

mananchawla2005 added 4 commits May 18, 2025 23:37

Merge branch 'pytorch:main' into add-feature-ligerceloss

283fef4

fixed liger naming dropped cloning and fixed tests

5273751

Merge branch 'add-feature-ligerceloss' of https://github.com/manancha…

16446ca

…wla2005/torchtune into add-feature-ligerceloss

linting

f340ea6

Andrei-Aksionov reviewed May 19, 2025

View reviewed changes

Andrei-Aksionov reviewed May 20, 2025

View reviewed changes

joecummings mentioned this pull request Mar 30, 2025

v0.7.0 tracker #2538

Open

4 tasks

fixed dimensions and removed compile

6f0ff3c

Andrei-Aksionov reviewed May 22, 2025

View reviewed changes

pbontrager reviewed May 22, 2025

View reviewed changes

vadimkantorov mentioned this pull request May 22, 2025

Fused Linear and Cross-Entropy Loss torch.nn.functional.linear_cross_entropy pytorch/pytorch#124480

Open

simplied distributed handling and remove total elements

af388bb

Added distributed unit tests and update loss

5373808

pbontrager added 3 commits June 5, 2025 12:57

fixed unit tests

8c54ef1

Merge branch 'main' into add-feature-ligerceloss

489bc4a

fix after merge

a7a5fcb

ebsmothers approved these changes Jun 6, 2025

View reviewed changes

pbontrager mentioned this pull request Jun 10, 2025

Enable loss parallel, Ungate FP8 #2782

Merged

13 tasks

nathan-az reviewed Jun 11, 2025

View reviewed changes

felipemello1 mentioned this pull request Jun 16, 2025

OOM handling and recovery #2830

Open

vadimkantorov mentioned this pull request Jul 17, 2025

Model implmenetation using Liger Kernel layers huggingface/transformers#38416

Open



		class TestLigerFusedCrossEntropyLoss:
		def test_liger_fused_cross_entropy_loss(self):

		from torchtune.training.seed import set_seed


		@gpu_test(gpu_count=1)

Add feature ligerceloss #2741

Are you sure you want to change the base?

Add feature ligerceloss #2741

Conversation

mananchawla2005 commented May 16, 2025

PR: Add LigerFusedCrossEntropyLoss

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2741

❌ 1 New Failure, 2 Cancelled Jobs

Uh oh!

facebook-github-bot commented May 16, 2025

Action Required

Process

Uh oh!

mananchawla2005 commented May 16, 2025

Uh oh!

facebook-github-bot commented May 16, 2025

Uh oh!

pbontrager left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mananchawla2005 commented May 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intervitens commented May 21, 2025

Uh oh!

mananchawla2005 commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented May 16, 2025 •

edited

Loading

pbontrager left a comment •

edited

Loading

codecov-commenter commented May 16, 2025 •

edited

Loading

nathan-az Jun 11, 2025 •

edited

Loading

vadimkantorov commented Jun 16, 2025 •

edited

Loading