Skip to content

Commit 3fcdcbf

Browse files
authored
Merge pull request #32 from wikimedia/20251120-node-typo-docs-fix
20251120 node typo docs fix
2 parents e95bbe2 + 26f8b85 commit 3fcdcbf

2 files changed

Lines changed: 132 additions & 18 deletions

File tree

bindings/nodejs/README.md

Lines changed: 131 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
11
# Sentence segmenter
22

3-
[![tests](https://github.com/wikimedia/sentencex-rust/actions/workflows/test.yml/badge.svg)](https://github.com/wikimedia/sentencex-rust/actions/workflows/test.yml)
3+
[![Rust Tests](https://github.com/wikimedia/sentencex/actions/workflows/rust.yml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/rust.yml)
4+
[![Node.js Tests](https://github.com/wikimedia/sentencex/actions/workflows/node.yaml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/node.yaml)
5+
[![Python Tests](https://github.com/wikimedia/sentencex/actions/workflows/python.yaml/badge.svg)](https://github.com/wikimedia/sentencex/actions/workflows/python.yaml)
46

5-
A sentence segmentation library in Go language with wide language support optimized for speed and utility.
7+
A sentence segmentation library written in Rust language with wide language support optimized for speed and utility.
8+
9+
## Bindings
10+
11+
Besides native Rust, bindings for the following programming languages are available:
12+
13+
* [Python](https://pypi.org/project/sentencex/)
14+
* [Nodejs](https://www.npmjs.com/package/sentencex)
15+
* [Web(Wasm)](https://www.npmjs.com/package/sentencex-wasm)
616

717
## Approach
818

@@ -29,45 +39,148 @@ The sentence segmentation in this library is **non-destructive**. This means, if
2939

3040
## Usage
3141

42+
### Rust
43+
3244
Install the library using
3345

3446
```bash
35-
npm install sentencex
47+
cargo add sentencex
3648
```
3749

3850
Then, any text can be segmented as follows.
3951

40-
```javascript
41-
const sentencex = require(".");
52+
```rust
53+
use sentencex::segment;
54+
55+
fn main() {
56+
let text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
57+
let sentences = segment("en", text);
4258

43-
console.log(
44-
sentencex.segment("en", "This is first sentence. This is another one."),
45-
);
59+
for (i, sentence) in sentences.iter().enumerate() {
60+
println!("{}. {}", i + 1, sentence);
61+
}
62+
}
4663
```
4764

4865
The first argument is language code, second argument is text to segment. The `segment` method returns an array of identified sentences.
4966

50-
If you need more detailed information about sentence boundaries, you can use the `get_sentence_boundaries` method:
67+
### Python
68+
69+
Install from PyPI:
70+
71+
```bash
72+
pip install sentencex
73+
```
74+
75+
```python
76+
import sentencex
77+
78+
text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development."
79+
80+
# Segment text into sentences
81+
sentences = sentencex.segment("en", text)
82+
for i, sentence in enumerate(sentences, 1):
83+
print(f"{i}. {sentence}")
84+
85+
# Get sentence boundaries with indices
86+
boundaries = sentencex.get_sentence_boundaries("en", text)
87+
for boundary in boundaries:
88+
print(f"Sentence: '{boundary['text']}' (indices: {boundary['start_index']}-{boundary['end_index']})")
89+
```
90+
91+
See [bindings/python/example.py](bindings/python/example.py) for more examples.
92+
93+
### Node.js
94+
95+
Install from npm:
96+
97+
```bash
98+
npm install sentencex
99+
```
51100

52101
```javascript
53-
const sentencex = require(".");
102+
import { segment, get_sentence_boundaries } from 'sentencex';
103+
104+
const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
105+
106+
// Segment text into sentences
107+
const sentences = segment("en", text);
108+
sentences.forEach((sentence, i) => {
109+
console.log(`${i + 1}. ${sentence}`);
110+
});
54111

55-
console.log(
56-
sentencex.get_sentence_boundaries("en", "This is first sentence. This is another one."),
57-
);
112+
// Get sentence boundaries with indices
113+
const boundaries = get_sentence_boundaries("en", text);
114+
boundaries.forEach(boundary => {
115+
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
116+
});
117+
```
118+
119+
For CommonJS usage:
120+
121+
```javascript
122+
const { segment, get_sentence_boundaries } = require('sentencex');
123+
```
124+
125+
See [bindings/nodejs/example.js](bindings/nodejs/example.js) for more examples.
126+
127+
### WebAssembly (Browser)
128+
129+
Install from npm:
130+
131+
```bash
132+
npm install sentencex-wasm
133+
```
134+
135+
or use a CDN like `https://esm.sh/sentencex-wasm`
136+
137+
```javascript
138+
import init, { segment, get_sentence_boundaries } from 'https://esm.sh/sentencex-wasm;
139+
140+
async function main() {
141+
// Initialize the WASM module
142+
await init();
143+
144+
const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
145+
146+
// Segment text into sentences
147+
const sentences = segment("en", text);
148+
sentences.forEach((sentence, i) => {
149+
console.log(`${i + 1}. ${sentence}`);
150+
});
151+
152+
// Get sentence boundaries with indices
153+
const boundaries = get_sentence_boundaries("en", text);
154+
boundaries.forEach(boundary => {
155+
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
156+
});
157+
}
158+
159+
main();
58160
```
59161
60-
This method returns an array of objects with the following properties:
61-
- `start_index`: The starting position of the sentence in the original text
62-
- `end_index`: The ending position of the sentence in the original text
63-
- `text`: The actual sentence text
64162
65163
## Language support
66164
67165
The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.
68166
69167
## Performance
70168
169+
Following is a sample output of sentence segmenting [The Complete Works of William Shakespeare](https://www.gutenberg.org/files/100/100-0.txt).
170+
This file is 5.29MB. As you can see below, it took half a second.
171+
172+
```bash
173+
$ curl https://www.gutenberg.org/files/100/100-0.txt | ./target/release/sentencex -l en > /dev/null
174+
% Total % Received % Xferd Average Speed Time Time Time Current
175+
Dload Upload Total Spent Left Speed
176+
100 5295k 100 5295k 0 0 630k 0 0:00:08 0:00:08 --:--:-- 1061k
177+
Found 40923 paragraphs
178+
Processing 540 chunks
179+
Time taken for segment(): 521.071603ms
180+
Total sentences: 153736
181+
```
182+
183+
71184
Measured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).
72185
73186
The following libraries are used for benchmarking:
@@ -94,6 +207,7 @@ The following libraries are used for benchmarking:
94207
## Thanks
95208
96209
- <https://github.com/diasks2/pragmatic_segmenter> for test cases. The English golden rule set is also sourced from it.
210+
- <https://github.com/mush42/tqsm> for an earlier Rust port of this library.
97211
98212
## License
99213

bindings/nodejs/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"main": "index.cjs",
77
"repository": {
88
"type": "git",
9-
"url": "https//github.com/wikimedia/sentencex"
9+
"url": "git+https://github.com/wikimedia/sentencex.git"
1010
},
1111
"author": "Santhosh Thottingal <[email protected]>",
1212
"license": "MIT",

0 commit comments

Comments
 (0)