Skip to content

Lemmatization errors when text contains contracted forms of 'be' #674

@gppatt

Description

@gppatt

I've noticed some inconsistent behavior here:

nlp = spacy_nlp(u"I'm hungry. You're hungry. He's hungry. It's hungry. We're hungry. They're hungry.")
for tok in nlp:
print tok, tok.lemma_

Gives output:

I i
'm be
hungry hungry
. .
You you
're 're
hungry hungry
. .
He he
's '
hungry hungry
. .
It it
's '
hungry hungry
. .
We we
're 're
hungry hungry
. .
They they
're 're
hungry hungry
. .

A related error is for "won't" (and for the much rarer "shan't"):

nlp = spacy_nlp(u"They won't move.")
for tok in nlp:
print tok, tok.lemma_

They they
wo wo
n't not
move move
. .

I think I once even saw a similar lemmatization error for "can't", but I am not able to recreate this error.

Your Environment

OSX 10.11.6
Spyder 3.0.0
spaCy 1.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions