Skip to content

New Entity Recognition #937

@ghost

Description

Hi, I tried to add and use new entities.
Here is my code.

`
import spacy

nlp = spacy.load('en')

def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])

matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='company-transocean', label='company', attrs={}, specs=[[{spacy.attrs.ORTH: 'Transocean Ltd'}]], on_match=merge_phrases)
matcher.add(entity_key='company-transocean-ltd', label='company', attrs={}, specs=[[{spacy.attrs.ORTH: 'Transocean'}]], on_match=merge_phrases)
doc = nlp(u"""Tell me about Macys Inc in Japan and about Transocean Ltd.""")
matcher(doc)
print(['%s|%s' % (t.orth_, t.ent_type_) for t in doc])

`

output

['Tell|', 'me|', 'about|', 'Macys|ORG', 'Inc|ORG', 'in|', 'Japan|GPE', 'and|', 'about|', 'Transocean|company', 'Ltd.|ORG']

It's start to work but not as i expect

And i have 2 questions

  1. i want to put 2 types of name of the same company "Transocean Ltd" and "Transocean" it's the same company but system recognized only "Transocean" and think that Ltd. is separate. I want only Transocean Ltd|Company
  2. How to save it, that in new start of script, spacy can use all new added entities, because i don't want to load new entities all the time when script starts

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageGeneral spaCy usage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions