Implementation of dependency pattern-matching algorithm #1120

raphael0202 · 2017-06-10T23:17:39Z

Currently, there is no easy way in spaCy to match a complex pattern with respect to a given dependency tree. Stanford CoreNLP offers this kind of feature with its Semgrex patterns (https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).

This PR contains an implementation of a simple pattern matching algorithm, along with some high level wrapping classes.

A quick demonstration snippet:

import spacy
from spacy.pattern import PatternParser, DependencyTree

nlp = spacy.load('en')
doc = nlp("The quick brown fox jumped over the lazy dog.")
tree = DependencyTree(doc)

query = """fox [word:fox]=f
           [lemma:quick]=q >/am.*/ fox
           [word:/brown|yellow/] > fox"""

pattern = PatternParser.parse(query)
matches = tree.match(pattern)

assert len(matches) == 1
match = matches[0]

assert match['f'] == doc[3]

We start by creating a DependencyTree for the Doc. This class models the document dependency tree.
Then we compile the query into a Pattern using the PatternParser. The syntax is quite simple:

we define a node named 'fox', that must match in the dep tree a token whose orth_ is 'fox'.
an anonymous token whose lemma is 'quick' must have fox as parent, with a dep_ matching the regex am.*
another anonymous token whose orth_ matches the regex brown|yellow has fox as parent, with whathever dep_

DependencyTree.match returns a list of PatternMatch. Notice that we can assign names to anonymous or defined nodes ([word:fox]=f). We can get the Token mapped to the fox node using match['f'].

~~Unit tests are missing~~ (edit: unit tests added), and the docstrings should be improved a bit. I'm open to any suggestion or remarks :)

Disclaimer: To write the query parser, I took inspiration from jinja2, and used 3 classes from the project as baseline (Token, TokenStreamIterator and TokenStream). I don't know if there could be licensing issues because of this.

Types of changes

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality to spaCy)
Breaking change (fix or feature causing change to spaCy's existing functionality)
Documentation (addition to documentation of spaCy)

Checklist:

My change requires a change to spaCy's documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

honnibal · 2017-06-20T08:56:32Z

Thanks for this! Trying to leave comments with GitHub's review functionality. Hopefully I don't mess things up.

honnibal

Overall this is really handy, as discussed on Gitter. Thanks for submitting it!

There are a couple of fairly cosmetic things to fix to keep it in-style with the rest of the library. It might be easier for me or @ines to do this -- otherwise we might bounce back and forth asking for changes that are easier to make than they are to describe.

The main things:

Add short attribution comment to code drawn from jinja2. jinja2 is BSD licensed, so there's no problem shipping their code -- we just need to attribute it in a comment.
The PatternParser class is stateless, so to keep with the rest of the library style it should be flattened into top-level functions, with the regular expressions moved into global variables.
spacy.strings.hash_string should be used instead of md5 in parser.py
Tests should be converted to use py.test, to match the library style.
Docs, which will probably involve a final iteration on the namings etc to match the rest of the library.

raphael0202 · 2017-07-01T11:05:00Z

Hey! Sorry for my late reply. I can take care of 1. and 3. by myself. I think it's better than you deal with 2., as I'm not sure what changes are required to comply with the library style.
About 4., tests are already using pytest, they are only organized in classes for convenience. As these classes are stateless, it can however be splitted in functions too if you need it.
About 5., I can write the content, but I guess it will be easier if you take care of the formatting and the integration in spaCy documentation.

Edit: I've solved issues 1. and 3.

wadetb · 2017-07-31T02:54:11Z

Is there a version of this patch for the spaCy 2.0 alpha?

raphael0202 · 2017-08-13T13:39:37Z

@honnibal Can you give me some indications about how you would like the documentation to be structured (new page distinct from 'Rule-based matching' or the same,...) ? And if you want the rename a few classes/functions, I can update them if you give me the new names.

honnibal · 2017-10-24T09:43:46Z

@raphael0202 Apolgies for taking so long on this.

I retract the suggestion I made about class method vs global functions. It's actually better as you've implemented it, in a class --- this will make it easier to add a new parser if people want.

I'm going to go ahead and merge this, and port to v2. This leaves the documentation outstanding, but I'll add the example you give to the exampes/ directory, and add a docs ticket for the docs.

raphael0202 added 8 commits June 11, 2017 01:06

Implementation of Pattern

e55199d

Check in PatternParser that the generated Pattern is valid

8ff4f51

Move add_node and add_edge methods to the Tree base class

d9c5673

Do not add the root token to the adjacency map

4ca8a39

Fix node matching bug caused by lower function

d010f5a

Add 'ent' to node matching key

4289a21

Improve logging

1849a11

Add basic unit tests for Pattern

4663736

honnibal requested changes Jun 20, 2017

View reviewed changes

raphael0202 added 3 commits July 1, 2017 13:09

Add a disclaimer about classes copied from the Jinja2 project

c3d722d

Use spacy hash_string function instead of md5

f474883

Fix fuzzy unit tests

8592f3d

ines added enhancement Feature requests and improvements v2 port ❎ ⚠️ wip Work in progress labels Sep 26, 2017

honnibal merged commit 8775efb into explosion:master Oct 24, 2017

ines mentioned this pull request Nov 14, 2017

Add TokensRegex functionality to Spacy #1567

Closed

ines added v1 spaCy v1.x and removed v2 port ❎ labels Mar 27, 2018

skrcode mentioned this pull request Sep 4, 2018

WIP - Dependency Tree Pattern Matcher #2732

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implementation of dependency pattern-matching algorithm #1120

Implementation of dependency pattern-matching algorithm #1120

Uh oh!

raphael0202 commented Jun 10, 2017 •

edited

Loading

Uh oh!

honnibal commented Jun 20, 2017

Uh oh!

honnibal left a comment

Uh oh!

raphael0202 commented Jul 1, 2017 •

edited

Loading

Uh oh!

wadetb commented Jul 31, 2017

Uh oh!

raphael0202 commented Aug 13, 2017

Uh oh!

honnibal commented Oct 24, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Implementation of dependency pattern-matching algorithm #1120

Implementation of dependency pattern-matching algorithm #1120

Uh oh!

Conversation

raphael0202 commented Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Types of changes

Checklist:

Uh oh!

honnibal commented Jun 20, 2017

Uh oh!

honnibal left a comment

Choose a reason for hiding this comment

Uh oh!

raphael0202 commented Jul 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wadetb commented Jul 31, 2017

Uh oh!

raphael0202 commented Aug 13, 2017

Uh oh!

honnibal commented Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

raphael0202 commented Jun 10, 2017 •

edited

Loading

raphael0202 commented Jul 1, 2017 •

edited

Loading

honnibal commented Oct 24, 2017 •

edited

Loading