Skip to content

Conversation

@MeryylleA
Copy link

This commit refactors the BpeTrainer to address an Out-of-Memory (OOM) bug that occurred when training on large-alphabet corpora.

The count_pairs function previously loaded all unique token pairs and their counts into a single AHashMap, leading to excessive memory usage with large alphabets.

This fix introduces a new count_pairs_chunked function that processes words in chunks, significantly reducing memory consumption. The do_train function has been updated to use this new chunked approach.

The library compiles successfully with these changes, and this new strategy should prevent OOM errors during training on large datasets.

This commit refactors the `BpeTrainer` to address an Out-of-Memory (OOM) bug
that occurred when training on large-alphabet corpora.

The `count_pairs` function previously loaded all unique token pairs and their
counts into a single `AHashMap`, leading to excessive memory usage with
large alphabets.

This fix introduces a new `count_pairs_chunked` function that processes
words in chunks, significantly reducing memory consumption. The `do_train`
function has been updated to use this new chunked approach.

The library compiles successfully with these changes, and this new strategy
should prevent OOM errors during training on large datasets.
@MeryylleA MeryylleA closed this Jun 19, 2025
@MeryylleA MeryylleA deleted the feat/chunked-pair-counting branch June 19, 2025 12:48
@ArthurZucker
Copy link
Collaborator

sorry did not have time to go though it, did you end up finding a fix ?

@MeryylleA
Copy link
Author

@ArthurZucker Hi Arthur, this commit was actually a test involving Google Jules in a new version. I'm very sorry for having opened this PR without definitively correcting the problem. This PR is not complete, so I closed it. Sorry for the delay in responding and for any inconvenience.

@ArthurZucker
Copy link
Collaborator

No worries 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants