Skip to content

blingfire sentence tokenizer missing #822

@simllll

Description

@simllll

Describe the bug

The blingfire sentence tokenizer is only avialable in python right now, there is a "quite easy" option to bring this to typescript via WASM.

to have this written down somewhere, here are the steps I followed to get this running.

  1. clone blingfire repo
  2. follow https://github.com/microsoft/BlingFire/blob/master/wasm/readme.md
  3. change Makefile do run: em++ ../blingfiretools/blingfiretokdll/blingfiretokdll.cpp ../blingfiretools/blingfiretokdll/*.cxx ../blingfireclient.library/src/*.cpp -s WASM=1 -s EXPORTED_FUNCTIONS="[_GetBlingFireTokVersion, _TextToSentences, _TextToWords, _TextToIds, _SetModel, _FreeModel, _WordHyphenationWithModel, _malloc, _free]" -s "EXPORTED_RUNTIME_METHODS=['lengthBytesUTF8', 'stackAlloc', 'stringToUTF8', 'UTF8ToString', 'cwrap']" -s ALLOW_MEMORY_GROWTH=1 -s DISABLE_EXCEPTION_CATCHING=0 -I ../blingfireclient.library/inc/ -I ../blingfirecompile.library/inc/ -DHAVE_ICONV_LIB -DHAVE_NO_SPECSTRINGS -D_VERBOSE -DBLING_FIRE_NOAP -DBLING_FIRE_NOWINDOWS -DNDEBUG -O3 -s MODULARIZE=1 -s EXPORT_ES6 --std=c++11 -o blingfire.js
    (adds -s MODULARIZE=1, -s EXPORT_ES6 and fixes malloc/free exports.
  4. copy blingfire.js + blingfire.wasm to livekit :)
  5. get blingfire_wrapper and adapt how they load the module:
import createModule from './blingfire.js';

const Module = await createModule()
  1. use the module wrapper:
import { TextToSentences } from './blingfire_wrapper.js';

console.log('TextToSentences', TextToSentences('This is a sentence. And another one.'));

Relevant log output

No response

Describe your environment

linux

Minimal reproducible example

No response

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions