Skip to content

Conversation

@raphael0202
Copy link
Contributor

@raphael0202 raphael0202 commented Jun 10, 2017

Currently, there is no easy way in spaCy to match a complex pattern with respect to a given dependency tree. Stanford CoreNLP offers this kind of feature with its Semgrex patterns (https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).

This PR contains an implementation of a simple pattern matching algorithm, along with some high level wrapping classes.

A quick demonstration snippet:

import spacy
from spacy.pattern import PatternParser, DependencyTree

nlp = spacy.load('en')
doc = nlp("The quick brown fox jumped over the lazy dog.")
tree = DependencyTree(doc)

query = """fox [word:fox]=f
           [lemma:quick]=q >/am.*/ fox
           [word:/brown|yellow/] > fox"""

pattern = PatternParser.parse(query)
matches = tree.match(pattern)

assert len(matches) == 1
match = matches[0]

assert match['f'] == doc[3]

We start by creating a DependencyTree for the Doc. This class models the document dependency tree.
Then we compile the query into a Pattern using the PatternParser. The syntax is quite simple:

  • we define a node named 'fox', that must match in the dep tree a token whose orth_ is 'fox'.
  • an anonymous token whose lemma is 'quick' must have fox as parent, with a dep_ matching the regex am.*
  • another anonymous token whose orth_ matches the regex brown|yellow has fox as parent, with whathever dep_

DependencyTree.match returns a list of PatternMatch. Notice that we can assign names to anonymous or defined nodes ([word:fox]=f). We can get the Token mapped to the fox node using match['f'].

Unit tests are missing (edit: unit tests added), and the docstrings should be improved a bit. I'm open to any suggestion or remarks :)

Disclaimer: To write the query parser, I took inspiration from jinja2, and used 3 classes from the project as baseline (Token, TokenStreamIterator and TokenStream). I don't know if there could be licensing issues because of this.

Types of changes

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (addition to documentation of spaCy)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@honnibal
Copy link
Member

Thanks for this! Trying to leave comments with GitHub's review functionality. Hopefully I don't mess things up.

Copy link
Member

@honnibal honnibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is really handy, as discussed on Gitter. Thanks for submitting it!

There are a couple of fairly cosmetic things to fix to keep it in-style with the rest of the library. It might be easier for me or @ines to do this -- otherwise we might bounce back and forth asking for changes that are easier to make than they are to describe.

The main things:

  1. Add short attribution comment to code drawn from jinja2. jinja2 is BSD licensed, so there's no problem shipping their code -- we just need to attribute it in a comment.

  2. The PatternParser class is stateless, so to keep with the rest of the library style it should be flattened into top-level functions, with the regular expressions moved into global variables.

  3. spacy.strings.hash_string should be used instead of md5 in parser.py

  4. Tests should be converted to use py.test, to match the library style.

  5. Docs, which will probably involve a final iteration on the namings etc to match the rest of the library.

@raphael0202
Copy link
Contributor Author

raphael0202 commented Jul 1, 2017

Hey! Sorry for my late reply. I can take care of 1. and 3. by myself. I think it's better than you deal with 2., as I'm not sure what changes are required to comply with the library style.
About 4., tests are already using pytest, they are only organized in classes for convenience. As these classes are stateless, it can however be splitted in functions too if you need it.
About 5., I can write the content, but I guess it will be easier if you take care of the formatting and the integration in spaCy documentation.

Edit: I've solved issues 1. and 3.

@wadetb
Copy link

wadetb commented Jul 31, 2017

Is there a version of this patch for the spaCy 2.0 alpha?

@raphael0202
Copy link
Contributor Author

@honnibal Can you give me some indications about how you would like the documentation to be structured (new page distinct from 'Rule-based matching' or the same,...) ? And if you want the rename a few classes/functions, I can update them if you give me the new names.

@ines ines added enhancement Feature requests and improvements v2 port ❎ ⚠️ wip Work in progress labels Sep 26, 2017
@honnibal
Copy link
Member

honnibal commented Oct 24, 2017

@raphael0202 Apolgies for taking so long on this.

I retract the suggestion I made about class method vs global functions. It's actually better as you've implemented it, in a class --- this will make it easier to add a new parser if people want.

I'm going to go ahead and merge this, and port to v2. This leaves the documentation outstanding, but I'll add the example you give to the exampes/ directory, and add a docs ticket for the docs.

@honnibal honnibal merged commit 8775efb into explosion:master Oct 24, 2017
@ines ines added v1 spaCy v1.x and removed v2 port ❎ labels Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Feature requests and improvements v1 spaCy v1.x ⚠️ wip Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants