Skip to content

Conversation

@oleksandrlukashov
Copy link
Contributor

No description provided.

@urchade urchade requested a review from Ingvarstep June 11, 2025 20:44
@urchade
Copy link
Owner

urchade commented Jun 11, 2025

LGTM

@Ingvarstep
Copy link
Collaborator

Good job overall, I have a few suggestions on how to make it more flexible and efficient:

  • I think it would be better to create a dictionary of splitters. If a new language appears, we can just add a new splitter;
  • By default, the splitter for other languages is WhitespaceTokenSplitter, I propose to make it flexible, it can be spacy with language key - xx.
  • Ideally, all tokenizer dependencies make optional, please, see how it was done for onnx-gpu.

@Ingvarstep Ingvarstep merged commit 638310c into urchade:main Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants