Skip to content

Commit cc8bf62

Browse files
committed
* Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.
1 parent eab2376 commit cc8bf62

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

spacy/tests/tokenizer/test_infix.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ def test_ellipsis(en_tokenizer):
2424
tokens = en_tokenizer('best...known')
2525
assert len(tokens) == 3
2626

27+
def test_big_ellipsis(en_tokenizer):
28+
'''Test regression identified in Issue #360'''
29+
tokens = en_tokenizer(u'$45...............Asking')
30+
assert len(tokens) > 2
31+
32+
2733

2834
def test_email(en_tokenizer):
2935
tokens = en_tokenizer('[email protected]')

spacy/tokenizer.pyx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,8 @@ cdef class Tokenizer:
227227
for match in matches:
228228
infix_start = match.start()
229229
infix_end = match.end()
230+
if infix_start == start:
231+
continue
230232
span = string[start:infix_start]
231233
tokens.push_back(self.vocab.get(tokens.mem, span), False)
232234

0 commit comments

Comments
 (0)