Uncontaminated Sample Packing #3525

djsaunde · 2025-10-29T17:01:05Z

This PR adds sample packing support. It uses TRL's SFTConfig packing=True and padding_free=True args to pack the sequences, and we compute packed_seq_lengths metadata and thread it through the model forward pass. This metadata is used to create block causal masks for SDPA and xformers attention, and is passed to the flash attention varlen API which handles the block causal masking itself under the hood (we need to do this ourselves because of our custom forward pass, whereas TRL handles the sequence length metadata internally in their trainer).

I added a few unit tests. I also wrote a quick bash script for smoke testing some common model architectures: gist, which runs.

Below is a comparison of short unsloth/qwen2.5-0.5b training runs. The losses don't match because we're seeing more / different samples on each step. But the scale and trend match, which is the important bit.

Commands:

No sample packing:

python unsloth-cli.py --model_name unsloth/qwen2.5-0.5b --dataset yahma/alpaca-cleaned --per_device_train_batch_size 8 --max_steps 50 --max_seq_length 2048

Sample packing:

python unsloth-cli.py --model_name unsloth/qwen2.5-0.5b --dataset yahma/alpaca-cleaned --per_device_train_batch_size 1 --max_steps 50 --max_seq_length 2048 --sample_packing

Note that we use --per_device_train_batch_size 1 in the latter case since we are packing multiple examples into a single [1, max_seq_length] tensor.

The benefit of this approach is that we're able to discard a lot of zero padding, and therefore get higher token/s training throughput. The below plot shows that we're able to get through our dataset ~20% faster. These gains depend on the dataset and configured --max_seq_length; if we increase this we generally get better packing efficiency => higher throughput.

I manually tested on SDPA and flash attention, but I still need to test xformers attention since I couldn't get it to build for blackwell.

TODO

test xformers attention

djsaunde · 2025-10-30T16:42:20Z

Follow up: DRY up attention code. We re-implement a big if / else block for selecting / running the attention per modeling file. We can factor this out into a separate module and call a helper function. CC @Datta0

djsaunde · 2025-10-30T18:22:15Z

I added support for passing position IDs to RoPE (needed for correctness, just like attention), and a (fused QK) triton kernel for the RoPE embedding (similar to what exists currently for the non-packing case).

Benchmarks show we're competitive to the triton kernel for the non-packing case while numerical ~match and significantly beat the torch slow path:

RoPE kernel benchmark sweep (microseconds per call)

seqlen	varlen	dense	old	new	speedup	max abs Δ	mean abs Δ
256	False	198.501	–	–	–	–	–
256	True	–	429.066	223.670	1.918	4.768e-07	1.136e-08
512	False	413.377	–	–	–	–	–
512	True	–	1149.956	566.851	2.029	4.768e-07	1.170e-08
1024	False	1113.990	–	–	–	–	–
1024	True	–	2784.808	1140.053	2.443	4.768e-07	1.187e-08
2048	False	2341.204	–	–	–	–	–
2048	True	–	5525.063	2372.505	2.329	4.768e-07	1.214e-08
4096	False	4675.885	–	–	–	–	–
4096	True	–	11354.554	4681.061	2.426	4.768e-07	1.239e-08
8192	False	9285.158	–	–	–	–	–
8192	True	–	21901.080	9323.563	2.349	4.768e-07	1.256e-08

djsaunde · 2025-10-31T21:20:11Z

I added helpers for attention backend selection / running that each of fast_forward methods call (+ units tests) as requested. This removed a lot of duplicate if / elif / else codeblocks in favor of a single attention_dispatch.py module.

unsloth/kernels/rope_embedding.py

unsloth/models/granite.py

djsaunde · 2025-11-02T01:42:02Z

Tested and pushed a fix for xformers attention, this PR should be good to go now.

One open question: should we make sample packing the default for pretrain / SFT workloads? It should always work and provides better throughput than without. It's a bit of a shift though; it reshapes samples to [1, max_seq_length] so we discard per_device_train_batch_size in favor of just changing max_seq_length.

One option is just to reshape so samples have shape [1, max_seq_length * per_device_batch_size]. This allows us to keep per_device_batch_size > 1, but it's probably a little confusing for the user.

Another option is to strongly recommend using sample packing in a logged message on the command line (if not already enabled).

We can also explore this in a follow up PR if we don't want to make a decision now.

djsaunde · 2025-11-04T00:18:04Z

I added some utils and updated the CLI to work OOTB with DDP. Just use torchrun --nproc_per_node=N or accelerate launch on a multi-GPU machine and it should just work.

These utils should be reusable in our notebooks / scripts too!

PS: DDP working relies on removing the @torch.compile decorator from unsloth_zoo/patch_torch_functions.py::cross_entropy as it results in a double compile somehow. I think @danielhanchen is fixing this.

danielhanchen · 2025-11-04T08:41:37Z

Tested and pushed a fix for xformers attention, this PR should be good to go now.

One open question: should we make sample packing the default for pretrain / SFT workloads? It should always work and provides better throughput than without. It's a bit of a shift though; it reshapes samples to [1, max_seq_length] so we discard per_device_train_batch_size in favor of just changing max_seq_length.

One option is just to reshape so samples have shape [1, max_seq_length * per_device_batch_size]. This allows us to keep per_device_batch_size > 1, but it's probably a little confusing for the user.

Another option is to strongly recommend using sample packing in a logged message on the command line (if not already enabled).

We can also explore this in a follow up PR if we don't want to make a decision now.

Yes the goal is to allow the padding free collator then it auto gets a perf boost :) We can do this for the next PR if that helps

I also fixed the torch.compile issue for CE (verifying now)

djsaunde · 2025-11-05T19:26:52Z

disabled the batch_size == 1 check; now if the user passes in batch_size > 1, we take advantage of TRL's logic to flatten from (batch_size, max_seq_length) to (1, total_tokens) (where total_tokens <= batch_size * max_seq_length). This makes it easier for folks to use without changing their batch_size / max_seq_length config.

djsaunde requested review from danielhanchen and mmathew23 October 29, 2025 17:01

djsaunde self-assigned this Oct 29, 2025

djsaunde force-pushed the packing branch from 6e45dad to fdebcef Compare October 29, 2025 17:04

djsaunde changed the title ~~Packing~~ sample packing Oct 29, 2025

djsaunde force-pushed the packing branch 2 times, most recently from c07d6bd to c23f676 Compare October 30, 2025 18:22

shimmyshimmer changed the title ~~sample packing~~ Uncontaminated packing Oct 30, 2025

shimmyshimmer changed the title ~~Uncontaminated packing~~ Uncontaminated Sample Packing Oct 30, 2025

djsaunde changed the title ~~Uncontaminated Sample Packing~~ sample packing Oct 31, 2025

djsaunde changed the title ~~sample packing~~ Uncontaminated Sample Packing Oct 31, 2025

Datta0 reviewed Nov 1, 2025

View reviewed changes

unsloth/kernels/rope_embedding.py Outdated Show resolved Hide resolved

unsloth/models/granite.py Outdated Show resolved Hide resolved

djsaunde force-pushed the packing branch from 43683c8 to 0cb1eb7 Compare November 2, 2025 01:35

djsaunde added 3 commits November 5, 2025 17:22

implement (sdpa, xformers, fa2) sample packing

7e84956

attention dispatching

3181c45

ddp working OOTB with CLI

f52456e

djsaunde force-pushed the packing branch from 64cb991 to c417db4 Compare November 5, 2025 17:22

packed SWA and softcap support

88d45aa

djsaunde force-pushed the packing branch from c417db4 to 88d45aa Compare November 5, 2025 17:45

enable batch flattening

c01eae8

djsaunde force-pushed the packing branch from 4e5f1ae to c01eae8 Compare November 5, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uncontaminated Sample Packing #3525

Uncontaminated Sample Packing #3525

djsaunde commented Oct 29, 2025 •

edited

Loading

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

djsaunde commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

djsaunde commented Nov 2, 2025 •

edited

Loading

Uh oh!

djsaunde commented Nov 4, 2025 •

edited

Loading

Uh oh!

danielhanchen commented Nov 4, 2025 •

edited

Loading

Uh oh!

djsaunde commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uncontaminated Sample Packing #3525

Are you sure you want to change the base?

Uncontaminated Sample Packing #3525

Conversation

djsaunde commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsaunde commented Oct 30, 2025

Uh oh!

djsaunde commented Oct 30, 2025

RoPE kernel benchmark sweep (microseconds per call)

Uh oh!

djsaunde commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

djsaunde commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsaunde commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djsaunde commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djsaunde commented Oct 29, 2025 •

edited

Loading

djsaunde commented Oct 31, 2025 •

edited

Loading

djsaunde commented Nov 2, 2025 •

edited

Loading

djsaunde commented Nov 4, 2025 •

edited

Loading

danielhanchen commented Nov 4, 2025 •

edited

Loading