potential bfd improvement #4295

sachmatkris · 2025-10-16T09:38:16Z

sachmatkris
Oct 16, 2025

Hello,

I am working on a case where the goal is to finetune llm on qna pairs. I decided to use bfd packing to ensure that packed sentences do not get cut and pressed into buckets of max_seq_length (apart from the last one) as it always results in shifts that happen naturally and with small max_seq_length, a big portion of samples from my dataset get cut. To combat this, bfd sounds like a perfect solution. However, bfd does truncate the sequences that exceed max_seq_length. While it is reasonable, on resource-constrained instances this is not an optimal solution, as the data is lost.

I propose another approach, where instead of truncating, the algorithm works approaches tokenized instances from longest to shortest (default behaviour) and tries to fit them into appropriate size bins to fill them fully where possible and then starts a new one unfilled if needed

i.e. for a sequence of length 500 with max_seq_length 150, it would result in 150-150-150-50 bins. Then, when e.g. 400 length sequence is next, it results in 150-150-100, BUT the 100 does not get allocated to unfilled 50, i.e. sequences that exceed max length, when cut, are ALWAYS put into a NEW bin. This is done to prevent potential issues with BOS/EOS tokens that are observed when doing the manual chunking. Only those sequences that fit into the remaining space (case A - 100 tokens left, case B - 100 tokens) get filled into the unfinished bins.

I have a prototype code for this but it definitely needs to be reviewed before passing as a potential feature.

sachmatkris · 2025-10-23T22:31:00Z

sachmatkris
Oct 23, 2025
Author

anyone?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

potential bfd improvement #4295

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

potential bfd improvement #4295

Uh oh!

sachmatkris Oct 16, 2025

Replies: 1 comment

Uh oh!

sachmatkris Oct 23, 2025 Author

sachmatkris
Oct 16, 2025

sachmatkris
Oct 23, 2025
Author