inconsistent sentence boundaries before and after serialization

I've been running into a problem where a parse's sentence boundaries change after converting it to a bytestring:

``` python
> text = u"I bought a couch from IKEA. It wasn't very comfortable."

> parse = nlp(text)

> parse_from_bytes = Doc(nlp.vocab).from_bytes(parse.to_bytes())

> [s.text for s in parse.sents]
[u"I bought a couch from IKEA. It wasn't very comfortable."]

> [s.text for s in parse_from_bytes.sents]
[u'I bought a couch from IKEA.', u"It wasn't very comfortable."]

> parse.to_bytes() == parse_from_bytes.to_bytes()
True
```

This happened to be one where the sentence boundaries were more correct _after_ the conversion, but I have other examples where ~~it actually breaks the parse~~ EDIT: the parse is already broken; in the original, two ROOTs appear in the same sentence, whereas in the from_bytes version, the ROOTs are forced to be in different sentences.

Not sure if this means there is a bug in the serialization or initial sentence boundary detection!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

inconsistent sentence boundaries before and after serialization #322

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

inconsistent sentence boundaries before and after serialization #322

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions