Skip to content

inconsistent sentence boundaries before and after serialization #322

@thricedotted

Description

@thricedotted

I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:

> text = u"I bought a couch from IKEA. It wasn't very comfortable."

> parse = nlp(text)

> parse_from_bytes = Doc(nlp.vocab).from_bytes(parse.to_bytes())

> [s.text for s in parse.sents]
[u"I bought a couch from IKEA. It wasn't very comfortable."]

> [s.text for s in parse_from_bytes.sents]
[u'I bought a couch from IKEA.', u"It wasn't very comfortable."]

> parse.to_bytes() == parse_from_bytes.to_bytes()
True

This happened to be one where the sentence boundaries were more correct after the conversion, but I have other examples where it actually breaks the parse EDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.

Not sure if this means there is a bug in the serialization or initial sentence boundary detection!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBugs and behaviour differing from documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions