potential bfd improvement #4295
sachmatkris
started this conversation in
Ideas
Replies: 1 comment
-
|
anyone? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am working on a case where the goal is to finetune llm on qna pairs. I decided to use bfd packing to ensure that packed sentences do not get cut and pressed into buckets of max_seq_length (apart from the last one) as it always results in shifts that happen naturally and with small max_seq_length, a big portion of samples from my dataset get cut. To combat this, bfd sounds like a perfect solution. However, bfd does truncate the sequences that exceed max_seq_length. While it is reasonable, on resource-constrained instances this is not an optimal solution, as the data is lost.
I propose another approach, where instead of truncating, the algorithm works approaches tokenized instances from longest to shortest (default behaviour) and tries to fit them into appropriate size bins to fill them fully where possible and then starts a new one unfilled if needed
i.e. for a sequence of length 500 with max_seq_length 150, it would result in 150-150-150-50 bins. Then, when e.g. 400 length sequence is next, it results in 150-150-100, BUT the 100 does not get allocated to unfilled 50, i.e. sequences that exceed max length, when cut, are ALWAYS put into a NEW bin. This is done to prevent potential issues with BOS/EOS tokens that are observed when doing the manual chunking. Only those sequences that fit into the remaining space (case A - 100 tokens left, case B - 100 tokens) get filled into the unfinished bins.
I have a prototype code for this but it definitely needs to be reviewed before passing as a potential feature.
Beta Was this translation helpful? Give feedback.
All reactions