Skip to content

NER misses - appears much worse than https://demos.explosion.ai/displacy-ent/ #977

@is55555

Description

@is55555

Bug (?)

python -m spacy info --markdown

Info about spaCy

  • spaCy version: 1.7.3
  • Platform: Darwin-16.5.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md, en_depent_web_md, en_vectors_glove_md

(replicated in Linux as well)


I'm trying Spacy for ORG and PERSON detection after seeing satisfactory results in the demo site. ( https://demos.explosion.ai/displacy-ent/ )

As a test I run Spacy over a large repository with companies tagged by OpenCalais, and I got really bad overlap (around 85% complete misses). Going manually through examples I've found that the demo website does much better. I will include a concrete example in a comment not to make this one too lengthy.

So I tried to manually include a few company names to see if I could ameliorate this somewhat, and I found that a (possibly) similar problem happens and it happens to others as well. In issue #105 (closed), at the bottom comment by m93s 21 days ago, he describes the same outputs I'm getting rather than the ones I'm supposed to get in the very example code provided. The matcher is missing even the match just provided in the code, so something seems definitely off.

Running https://github.com/explosion/spaCy/blob/master/examples/matcher_example.py I get this output:

Before
Google Now PERSON ['NNP', 'RB']
After
Google Now PERSON ['NNP', 'RB']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True
the Brisbane Broncos ORG

And with 'en_depent_web_md' I get:
Before
Google PERSON ['NNP']
After
Google PERSON ['NNP']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True

Note that this misses even the Google Now PRODUCT [u'NNP', u'RB'] just inserted in the very example. Maybe there has been a change that affects the models?

Metadata

Metadata

Assignees

No one assigned

    Labels

    lang / enEnglish language data and modelsmodelsIssues related to the statistical models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions