Skip to content

NLP silently hangs indefinitely on strings with many periods #957

@catalytic1618

Description

@catalytic1618

The following code block induces a hung state in the NLP module (this is text from a real-world corpus, similar errors happen on long Java namespaces (e.g. org.apache.x.y.z).

nlp = spacy.load("en_core_web_sm")
nlp("0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48")

Our inspection suggests the tokenizer is thrashing, perhaps owing to a regex of exploding complexity. We've routed around the damage with a context manager that uses signal.SIGALRM to timeout if spacy takes too long, but this issue was the source of much confusion as regards seemingly simple jobs that were running for extended periods of time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBugs and behaviour differing from documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions