-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentation
Description
I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:
> text = u"I bought a couch from IKEA. It wasn't very comfortable."
> parse = nlp(text)
> parse_from_bytes = Doc(nlp.vocab).from_bytes(parse.to_bytes())
> [s.text for s in parse.sents]
[u"I bought a couch from IKEA. It wasn't very comfortable."]
> [s.text for s in parse_from_bytes.sents]
[u'I bought a couch from IKEA.', u"It wasn't very comfortable."]
> parse.to_bytes() == parse_from_bytes.to_bytes()
TrueThis happened to be one where the sentence boundaries were more correct after the conversion, but I have other examples where it actually breaks the parse EDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.
Not sure if this means there is a bug in the serialization or initial sentence boundary detection!
Metadata
Metadata
Assignees
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentation