-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Implementation of dependency pattern-matching algorithm #1120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for this! Trying to leave comments with GitHub's review functionality. Hopefully I don't mess things up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is really handy, as discussed on Gitter. Thanks for submitting it!
There are a couple of fairly cosmetic things to fix to keep it in-style with the rest of the library. It might be easier for me or @ines to do this -- otherwise we might bounce back and forth asking for changes that are easier to make than they are to describe.
The main things:
-
Add short attribution comment to code drawn from jinja2. jinja2 is BSD licensed, so there's no problem shipping their code -- we just need to attribute it in a comment.
-
The
PatternParserclass is stateless, so to keep with the rest of the library style it should be flattened into top-level functions, with the regular expressions moved into global variables. -
spacy.strings.hash_stringshould be used instead ofmd5in parser.py -
Tests should be converted to use py.test, to match the library style.
-
Docs, which will probably involve a final iteration on the namings etc to match the rest of the library.
|
Hey! Sorry for my late reply. I can take care of 1. and 3. by myself. I think it's better than you deal with 2., as I'm not sure what changes are required to comply with the library style. Edit: I've solved issues 1. and 3. |
|
Is there a version of this patch for the spaCy 2.0 alpha? |
|
@honnibal Can you give me some indications about how you would like the documentation to be structured (new page distinct from 'Rule-based matching' or the same,...) ? And if you want the rename a few classes/functions, I can update them if you give me the new names. |
|
@raphael0202 Apolgies for taking so long on this. I retract the suggestion I made about class method vs global functions. It's actually better as you've implemented it, in a class --- this will make it easier to add a new parser if people want. I'm going to go ahead and merge this, and port to v2. This leaves the documentation outstanding, but I'll add the example you give to the |
Currently, there is no easy way in spaCy to match a complex pattern with respect to a given dependency tree. Stanford CoreNLP offers this kind of feature with its Semgrex patterns (https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
This PR contains an implementation of a simple pattern matching algorithm, along with some high level wrapping classes.
A quick demonstration snippet:
We start by creating a
DependencyTreefor theDoc. This class models the document dependency tree.Then we compile the query into a
Patternusing thePatternParser. The syntax is quite simple:orth_is 'fox'.dep_matching the regexam.*orth_matches the regexbrown|yellowhas fox as parent, with whatheverdep_DependencyTree.matchreturns a list ofPatternMatch. Notice that we can assign names to anonymous or defined nodes ([word:fox]=f). We can get theTokenmapped to the fox node usingmatch['f'].Unit tests are missing(edit: unit tests added), and the docstrings should be improved a bit. I'm open to any suggestion or remarks :)Disclaimer: To write the query parser, I took inspiration from jinja2, and used 3 classes from the project as baseline (Token, TokenStreamIterator and TokenStream). I don't know if there could be licensing issues because of this.
Types of changes
Checklist: