Skip to content

Conversation

@simllll
Copy link
Contributor

@simllll simllll commented Nov 11, 2025

Description

this implements the blingfire tokenizer via WASM wrapper.

fixes #822

BUT this is currently only working for ESM builds, as the wasm module needs to be loadead async. I guess there are ways to solve this, but haven't looked into it for now. Do not merge, as I have disabled the cjs build for the agent.

Changes Made

  • added implemenetatoin for blingfire tokenizer
  • added a test suite
  • added description of how we can build the wasm file

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included

Testing

  • Automated tests added/updated (if applicable)
  • All tests pass

Additional Notes


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

@changeset-bot
Copy link

changeset-bot bot commented Nov 11, 2025

⚠️ No Changeset found

Latest commit: cf6a166

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

blingfire sentence tokenizer missing

1 participant