perf: don't try to match rules that will never match #830

williballenthin · 2021-11-09T17:18:47Z

This PR adds an optimization to the rule matching engine that will avoid evaluating rules that it can prove will never match.

The optimization adds a pre-processing step that groups rules into "easy" and "hard" buckets. Easy rules are those that use only "easy" features, that is, features implemented via hash lookup, such as mnemonic, number, string, API, etc. Hard rules are those that use at least one "hard" feature, such as features that require a scan, like substring/regex/bytes, and rules that depend on other rules via match statements.

For the easy bucket, rules are indexed by the features they reference. This lets the evaluation engine quickly find the rules that overlap with a given ruleset. For example, given the features api: send and api: recv, the engine picks only rules that reference api: send or api: recv and none of the rules that deal with registry manipulation, etc.

The rule engine evaluates the hard bucket the same way it did before. I anticipate (but did not test) that scanning features are expensive to check (for each string feature, scan with each string statement, around 577 today), so we don't try to be clever here. Also, rules that depend on other rules are tricky to handle correctly. I'd imagine there would need to be multiple rounds of matching and pruning, and therefore it would be difficult to implement and confusing to read. Therefore, we don't attempt anything special for rules with dependencies: we match them like before, evaluating each rule in topological order.

In summary, the optimization drastically cuts down on the number of rules the engine has to evaluate, leading to much higher performance:

label	count(evaluations)	avg(time)	min(time)	max(time)
`18c30e4` base	108,121	0.28s	0.26s	0.31s
`1823775` better rule selection	39,590	0.14s	0.11s	0.17s
63b47ee better rule selection + short circuiting	24,110	0.08s	0.06s	0.09s

(via: PMA01-01, 30 iterations)

Adding just this optimization reduces the evaluation count by 64% and matching runtime by 50%!

But even better: this optimization is independent and orthogonal to #827/#828 (short circuiting & query optimizer) so the performance improvements are complementary. When both are enabled, evaluation count is reduced by 78% and matching runtime by 66%!

Checklist

No CHANGELOG update needed
No new tests needed
No documentation update needed

capa/rules.py

williballenthin · 2021-11-10T19:58:25Z

tests/test_match.py

+    features1, matches1 = capa.engine.match(rules, features, va)
+
+    ruleset = capa.rules.RuleSet(rules)
+    features2, matches2 = ruleset.match(scope, features, va)
+
+    for feature, locations in features1.items():
+        assert feature in features2
+        assert locations == features2[feature]
+
+    for rulename, results in matches1.items():
+        assert rulename in matches2
+        assert len(results) == len(matches2[rulename])


here we assert that the new matching algorithm returns the same results as the older, well-tested algorithm

williballenthin · 2021-11-10T21:18:37Z

I identified a number of edge cases in feature indexing having to do with not:, optional:, 0 or more: and count(foo): 0 or more. In some of these cases, it isn't possible to compute the set of features that might match, so its best to consider all of these as "hard" statements and evaluate the rule conservatively.

At this point, I'm not aware of any other edge cases or concerns, so I'll re-open this PR for review and discussion for merge (assuming no test failures).

williballenthin · 2021-11-10T21:29:50Z

scripts/lint.py

+class OptionalNotUnderAnd(Lint):
+    name = "rule contains an `optional` or `0 or more` statement that's not found under an `and` statement"
+    recommendation = "clarify the rule logic and ensure `optional` and `0 or more` is always found under `and`"
+    violation = False
+
+    def check_rule(self, ctx: Context, rule: Rule):
+        self.violation = False
+
+        def rec(statement):
+            if isinstance(statement, capa.engine.Statement):
+                if not isinstance(statement, capa.engine.And):
+                    for child in statement.get_children():
+                        if isinstance(child, capa.engine.Some) and child.count == 0:
+                            self.violation = True
+
+                for child in statement.get_children():
+                    rec(child)
+
+        rec(rule.statement)
+
+        return self.violation


this found the style issue here! mandiant/capa-rules@0dc67ab

…tion

mr-tz

awesome work

capa/rules.py

mr-tz · 2021-11-11T10:09:40Z

capa/rules.py

+        candidate_rules = [self.rules[name] for name in candidate_rule_names]
+        features2, easy_matches = ceng.match(candidate_rules, features, va)
+
+        # note that we've stored the updated feature set in `features2`.


thanks for the doc!

mr-tz · 2021-11-11T18:42:12Z

I ran a quick comparison across 50 files to compare the rule hits and did not notice any regressions!

Co-authored-by: Moritz <[email protected]>

William Ballenthin added 4 commits November 9, 2021 09:51

engine: document match routine

8badf22

rules: make Scope an enum

1311da9

rules: ruleset: add optimized match routine

e647ae2

main: use ruleset.match instead of engine.mathc

1823775

williballenthin added the enhancement New feature or request label Nov 9, 2021

williballenthin requested review from Ana06, mike-hunhoff and mr-tz November 9, 2021 17:18

This comment has been minimized.

Sign in to view

williballenthin added dont merge Indicate a PR that is still being worked on and removed dont merge Indicate a PR that is still being worked on labels Nov 9, 2021

changelog

e05f8c7

mr-tz reviewed Nov 9, 2021

View reviewed changes

capa/rules.py Show resolved Hide resolved

mr-tz reviewed Nov 9, 2021

View reviewed changes

capa/rules.py Outdated Show resolved Hide resolved

William Ballenthin added 3 commits November 9, 2021 16:28

Merge branch 'master' into perf/rule-selection

8cb04e4

rules: index easy/hard: better handle not: statements

2bf05ac

rules: match: more documentation

67884dd

williballenthin requested a review from mr-tz November 9, 2021 23:51

This comment has been minimized.

Sign in to view

williballenthin marked this pull request as draft November 10, 2021 00:14

William Ballenthin added 4 commits November 10, 2021 12:44

rules: ruleset: fix collection of features under not statements

1406dc2

tests: split out match tests and validate alternative algorithms

845df28

pep8

2d68fb2

engine: remove old import

6039a33

williballenthin commented Nov 10, 2021

View reviewed changes

William Ballenthin added 4 commits November 10, 2021 13:36

rules: code consistency

3a12722

rules: easy/hard rules: detect not/optional at the root

f7ab2fb

linter: add checks for not and optional not under and

72c2ffc

rules: easy/hard: simplify indexing by considering not: hard

1aaaa89

William Ballenthin added 3 commits November 10, 2021 14:13

linter: optional maps to some, not range

e550d48

rules: easy/hard: better detect edge cases in optional, some, and range

68c86cf

pep8

80fb9de

mypy

a6b3666

williballenthin commented Nov 10, 2021

View reviewed changes

Merge branch 'master' of github.com:fireeye/capa into perf/rule-selec…

cdfacc6

…tion

williballenthin marked this pull request as ready for review November 10, 2021 21:32

mr-tz approved these changes Nov 11, 2021

View reviewed changes

williballenthin and others added 2 commits November 12, 2021 11:51

Update capa/rules.py

9b5e8ff

Co-authored-by: Moritz <[email protected]>

rules: better variable name

83253eb

williballenthin merged commit 57fe1e2 into master Nov 12, 2021

williballenthin deleted the perf/rule-selection branch November 12, 2021 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: don't try to match rules that will never match #830

perf: don't try to match rules that will never match #830

Uh oh!

williballenthin commented Nov 9, 2021 •

edited

Loading

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

williballenthin Nov 10, 2021

Uh oh!

williballenthin commented Nov 10, 2021

Uh oh!

williballenthin Nov 10, 2021

Uh oh!

mr-tz left a comment

Uh oh!

Uh oh!

Uh oh!

mr-tz Nov 11, 2021

Uh oh!

mr-tz commented Nov 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: don't try to match rules that will never match #830

perf: don't try to match rules that will never match #830

Uh oh!

Conversation

williballenthin commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

williballenthin Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

williballenthin commented Nov 10, 2021

Uh oh!

williballenthin Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

mr-tz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mr-tz Nov 11, 2021

Choose a reason for hiding this comment

Uh oh!

mr-tz commented Nov 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

williballenthin commented Nov 9, 2021 •

edited

Loading