You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -29,45 +39,148 @@ The sentence segmentation in this library is **non-destructive**. This means, if
29
39
30
40
## Usage
31
41
42
+
### Rust
43
+
32
44
Install the library using
33
45
34
46
```bash
35
-
npm install sentencex
47
+
cargo add sentencex
36
48
```
37
49
38
50
Then, any text can be segmented as follows.
39
51
40
-
```javascript
41
-
constsentencex=require(".");
52
+
```rust
53
+
usesentencex::segment;
54
+
55
+
fnmain() {
56
+
lettext="The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
57
+
letsentences=segment("en", text);
42
58
43
-
console.log(
44
-
sentencex.segment("en", "This is first sentence. This is another one."),
45
-
);
59
+
for (i, sentence) insentences.iter().enumerate() {
60
+
println!("{}. {}", i+1, sentence);
61
+
}
62
+
}
46
63
```
47
64
48
65
The first argument is language code, second argument is text to segment. The `segment` method returns an array of identified sentences.
49
66
50
-
If you need more detailed information about sentence boundaries, you can use the `get_sentence_boundaries` method:
67
+
### Python
68
+
69
+
Install from PyPI:
70
+
71
+
```bash
72
+
pip install sentencex
73
+
```
74
+
75
+
```python
76
+
import sentencex
77
+
78
+
text ="The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development."
consttext="The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
105
+
106
+
// Segment text into sentences
107
+
constsentences=segment("en", text);
108
+
sentences.forEach((sentence, i) => {
109
+
console.log(`${i +1}. ${sentence}`);
110
+
});
54
111
55
-
console.log(
56
-
sentencex.get_sentence_boundaries("en", "This is first sentence. This is another one."),
const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
This method returns an array of objects with the following properties:
61
-
-`start_index`: The starting position of the sentence in the original text
62
-
-`end_index`: The ending position of the sentence in the original text
63
-
-`text`: The actual sentence text
64
162
65
163
## Language support
66
164
67
165
The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.
68
166
69
167
## Performance
70
168
169
+
Following is a sample output of sentence segmenting [The Complete Works of William Shakespeare](https://www.gutenberg.org/files/100/100-0.txt).
170
+
This file is 5.29MB. As you can see below, it took half a second.
171
+
172
+
```bash
173
+
$ curl https://www.gutenberg.org/files/100/100-0.txt | ./target/release/sentencex -l en > /dev/null
174
+
% Total % Received % Xferd Average Speed Time Time Time Current
0 commit comments