Fix: Refactor BpeTrainer to use chunked pair counting #1805

MeryylleA · 2025-06-19T12:47:59Z

This commit refactors the BpeTrainer to address an Out-of-Memory (OOM) bug that occurred when training on large-alphabet corpora.

The count_pairs function previously loaded all unique token pairs and their counts into a single AHashMap, leading to excessive memory usage with large alphabets.

This fix introduces a new count_pairs_chunked function that processes words in chunks, significantly reducing memory consumption. The do_train function has been updated to use this new chunked approach.

The library compiles successfully with these changes, and this new strategy should prevent OOM errors during training on large datasets.

This commit refactors the `BpeTrainer` to address an Out-of-Memory (OOM) bug that occurred when training on large-alphabet corpora. The `count_pairs` function previously loaded all unique token pairs and their counts into a single `AHashMap`, leading to excessive memory usage with large alphabets. This fix introduces a new `count_pairs_chunked` function that processes words in chunks, significantly reducing memory consumption. The `do_train` function has been updated to use this new chunked approach. The library compiles successfully with these changes, and this new strategy should prevent OOM errors during training on large datasets.

ArthurZucker · 2025-06-24T12:07:55Z

sorry did not have time to go though it, did you end up finding a fix ?

MeryylleA · 2025-06-25T20:07:50Z

@ArthurZucker Hi Arthur, this commit was actually a test involving Google Jules in a new version. I'm very sorry for having opened this PR without definitively correcting the problem. This PR is not complete, so I closed it. Sorry for the delay in responding and for any inconvenience.

ArthurZucker · 2025-06-26T14:08:19Z

No worries 🤗

MeryylleA closed this Jun 19, 2025

MeryylleA deleted the feat/chunked-pair-counting branch June 19, 2025 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Refactor BpeTrainer to use chunked pair counting #1805

Fix: Refactor BpeTrainer to use chunked pair counting #1805

Uh oh!

MeryylleA commented Jun 19, 2025

Uh oh!

ArthurZucker commented Jun 24, 2025

Uh oh!

MeryylleA commented Jun 25, 2025

Uh oh!

ArthurZucker commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Refactor BpeTrainer to use chunked pair counting #1805

Fix: Refactor BpeTrainer to use chunked pair counting #1805

Uh oh!

Conversation

MeryylleA commented Jun 19, 2025

Uh oh!

ArthurZucker commented Jun 24, 2025

Uh oh!

MeryylleA commented Jun 25, 2025

Uh oh!

ArthurZucker commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants