-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Bug (?)
python -m spacy info --markdown
Info about spaCy
- spaCy version: 1.7.3
- Platform: Darwin-16.5.0-x86_64-i386-64bit
- Python version: 3.6.0
- Installed models: en, en_core_web_md, en_depent_web_md, en_vectors_glove_md
(replicated in Linux as well)
I'm trying Spacy for ORG and PERSON detection after seeing satisfactory results in the demo site. ( https://demos.explosion.ai/displacy-ent/ )
As a test I run Spacy over a large repository with companies tagged by OpenCalais, and I got really bad overlap (around 85% complete misses). Going manually through examples I've found that the demo website does much better. I will include a concrete example in a comment not to make this one too lengthy.
So I tried to manually include a few company names to see if I could ameliorate this somewhat, and I found that a (possibly) similar problem happens and it happens to others as well. In issue #105 (closed), at the bottom comment by m93s 21 days ago, he describes the same outputs I'm getting rather than the ones I'm supposed to get in the very example code provided. The matcher is missing even the match just provided in the code, so something seems definitely off.
Running https://github.com/explosion/spaCy/blob/master/examples/matcher_example.py I get this output:
Before
Google Now PERSON ['NNP', 'RB']
After
Google Now PERSON ['NNP', 'RB']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True
the Brisbane Broncos ORG
And with 'en_depent_web_md' I get:
Before
Google PERSON ['NNP']
After
Google PERSON ['NNP']
Sydney True
sydney False
Sydney True
sydney True
SYDNEY True
Note that this misses even the Google Now PRODUCT [u'NNP', u'RB'] just inserted in the very example. Maybe there has been a change that affects the models?