Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 131 additions & 17 deletions bindings/nodejs/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,18 @@
# Sentence segmenter

[![tests](https://github.com/wikimedia/sentencex-rust/actions/workflows/test.yml/badge.svg)](https://github.com/wikimedia/sentencex-rust/actions/workflows/test.yml)
[![Rust Tests](https://github.com/wikimedia/sentencex/actions/workflows/rust.yml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/rust.yml)
[![Node.js Tests](https://github.com/wikimedia/sentencex/actions/workflows/node.yaml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/node.yaml)
[![Python Tests](https://github.com/wikimedia/sentencex/actions/workflows/python.yaml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/python.yaml)

A sentence segmentation library in Go language with wide language support optimized for speed and utility.
A sentence segmentation library written in Rust language with wide language support optimized for speed and utility.

## Bindings

Besides native Rust, bindings for the following programming languages are available:

* [Python](https://pypi.org/project/sentencex/)
* [Nodejs](https://www.npmjs.com/package/sentencex)
* [Web(Wasm)](https://www.npmjs.com/package/sentencex-wasm)

## Approach

Expand All @@ -29,45 +39,148 @@ The sentence segmentation in this library is **non-destructive**. This means, if

## Usage

### Rust

Install the library using

```bash
npm install sentencex
cargo add sentencex
```

Then, any text can be segmented as follows.

```javascript
const sentencex = require(".");
```rust
use sentencex::segment;

fn main() {
let text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
let sentences = segment("en", text);

console.log(
sentencex.segment("en", "This is first sentence. This is another one."),
);
for (i, sentence) in sentences.iter().enumerate() {
println!("{}. {}", i + 1, sentence);
}
}
```

The first argument is language code, second argument is text to segment. The `segment` method returns an array of identified sentences.

If you need more detailed information about sentence boundaries, you can use the `get_sentence_boundaries` method:
### Python

Install from PyPI:

```bash
pip install sentencex
```

```python
import sentencex

text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development."

# Segment text into sentences
sentences = sentencex.segment("en", text)
for i, sentence in enumerate(sentences, 1):
print(f"{i}. {sentence}")

# Get sentence boundaries with indices
boundaries = sentencex.get_sentence_boundaries("en", text)
for boundary in boundaries:
print(f"Sentence: '{boundary['text']}' (indices: {boundary['start_index']}-{boundary['end_index']})")
```

See [bindings/python/example.py](bindings/python/example.py) for more examples.

### Node.js

Install from npm:

```bash
npm install sentencex
```

```javascript
const sentencex = require(".");
import { segment, get_sentence_boundaries } from 'sentencex';

const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";

// Segment text into sentences
const sentences = segment("en", text);
sentences.forEach((sentence, i) => {
console.log(`${i + 1}. ${sentence}`);
});

console.log(
sentencex.get_sentence_boundaries("en", "This is first sentence. This is another one."),
);
// Get sentence boundaries with indices
const boundaries = get_sentence_boundaries("en", text);
boundaries.forEach(boundary => {
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
});
```

For CommonJS usage:

```javascript
const { segment, get_sentence_boundaries } = require('sentencex');
```

See [bindings/nodejs/example.js](bindings/nodejs/example.js) for more examples.

### WebAssembly (Browser)

Install from npm:

```bash
npm install sentencex-wasm
```

or use a CDN like `https://esm.sh/sentencex-wasm`

```javascript
import init, { segment, get_sentence_boundaries } from 'https://esm.sh/sentencex-wasm;

async function main() {
// Initialize the WASM module
await init();

const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";

// Segment text into sentences
const sentences = segment("en", text);
sentences.forEach((sentence, i) => {
console.log(`${i + 1}. ${sentence}`);
});

// Get sentence boundaries with indices
const boundaries = get_sentence_boundaries("en", text);
boundaries.forEach(boundary => {
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
});
}

main();
```

This method returns an array of objects with the following properties:
- `start_index`: The starting position of the sentence in the original text
- `end_index`: The ending position of the sentence in the original text
- `text`: The actual sentence text

## Language support

The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.

## Performance

Following is a sample output of sentence segmenting [The Complete Works of William Shakespeare](https://www.gutenberg.org/files/100/100-0.txt).
This file is 5.29MB. As you can see below, it took half a second.

```bash
$ curl https://www.gutenberg.org/files/100/100-0.txt | ./target/release/sentencex -l en > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5295k 100 5295k 0 0 630k 0 0:00:08 0:00:08 --:--:-- 1061k
Found 40923 paragraphs
Processing 540 chunks
Time taken for segment(): 521.071603ms
Total sentences: 153736
```


Measured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).

The following libraries are used for benchmarking:
Expand All @@ -94,6 +207,7 @@ The following libraries are used for benchmarking:
## Thanks

- <https://github.com/diasks2/pragmatic_segmenter> for test cases. The English golden rule set is also sourced from it.
- <https://github.com/mush42/tqsm> for an earlier Rust port of this library.

## License

Expand Down
2 changes: 1 addition & 1 deletion bindings/nodejs/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"main": "index.cjs",
"repository": {
"type": "git",
"url": "https//github.com/wikimedia/sentencex"
"url": "git+https://github.com/wikimedia/sentencex.git"
},
"author": "Santhosh Thottingal <[email protected]>",
"license": "MIT",
Expand Down
Loading