Skip to content

Conversation

@williballenthin
Copy link
Collaborator

@williballenthin williballenthin commented Nov 9, 2021

This PR adds an optimization to the rule matching engine that will avoid evaluating rules that it can prove will never match.

The optimization adds a pre-processing step that groups rules into "easy" and "hard" buckets. Easy rules are those that use only "easy" features, that is, features implemented via hash lookup, such as mnemonic, number, string, API, etc. Hard rules are those that use at least one "hard" feature, such as features that require a scan, like substring/regex/bytes, and rules that depend on other rules via match statements.

For the easy bucket, rules are indexed by the features they reference. This lets the evaluation engine quickly find the rules that overlap with a given ruleset. For example, given the features api: send and api: recv, the engine picks only rules that reference api: send or api: recv and none of the rules that deal with registry manipulation, etc.

The rule engine evaluates the hard bucket the same way it did before. I anticipate (but did not test) that scanning features are expensive to check (for each string feature, scan with each string statement, around 577 today), so we don't try to be clever here. Also, rules that depend on other rules are tricky to handle correctly. I'd imagine there would need to be multiple rounds of matching and pruning, and therefore it would be difficult to implement and confusing to read. Therefore, we don't attempt anything special for rules with dependencies: we match them like before, evaluating each rule in topological order.

In summary, the optimization drastically cuts down on the number of rules the engine has to evaluate, leading to much higher performance:

label count(evaluations) avg(time) min(time) max(time)
18c30e4 base 108,121 0.28s 0.26s 0.31s
1823775 better rule selection 39,590 0.14s 0.11s 0.17s
63b47ee better rule selection + short circuiting 24,110 0.08s 0.06s 0.09s

(via: PMA01-01, 30 iterations)

Adding just this optimization reduces the evaluation count by 64% and matching runtime by 50%!

But even better: this optimization is independent and orthogonal to #827/#828 (short circuiting & query optimizer) so the performance improvements are complementary. When both are enabled, evaluation count is reduced by 78% and matching runtime by 66%!

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

@williballenthin williballenthin added the enhancement New feature or request label Nov 9, 2021
@williballenthin

This comment has been minimized.

@williballenthin williballenthin added dont merge Indicate a PR that is still being worked on and removed dont merge Indicate a PR that is still being worked on labels Nov 9, 2021
@williballenthin williballenthin requested a review from mr-tz November 9, 2021 23:51
@williballenthin

This comment has been minimized.

@williballenthin

This comment has been minimized.

@williballenthin williballenthin marked this pull request as draft November 10, 2021 00:14
Comment on lines +26 to +37
features1, matches1 = capa.engine.match(rules, features, va)

ruleset = capa.rules.RuleSet(rules)
features2, matches2 = ruleset.match(scope, features, va)

for feature, locations in features1.items():
assert feature in features2
assert locations == features2[feature]

for rulename, results in matches1.items():
assert rulename in matches2
assert len(results) == len(matches2[rulename])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we assert that the new matching algorithm returns the same results as the older, well-tested algorithm

@williballenthin
Copy link
Collaborator Author

I identified a number of edge cases in feature indexing having to do with not:, optional:, 0 or more: and count(foo): 0 or more. In some of these cases, it isn't possible to compute the set of features that might match, so its best to consider all of these as "hard" statements and evaluate the rule conservatively.

At this point, I'm not aware of any other edge cases or concerns, so I'll re-open this PR for review and discussion for merge (assuming no test failures).

Comment on lines +365 to +385
class OptionalNotUnderAnd(Lint):
name = "rule contains an `optional` or `0 or more` statement that's not found under an `and` statement"
recommendation = "clarify the rule logic and ensure `optional` and `0 or more` is always found under `and`"
violation = False

def check_rule(self, ctx: Context, rule: Rule):
self.violation = False

def rec(statement):
if isinstance(statement, capa.engine.Statement):
if not isinstance(statement, capa.engine.And):
for child in statement.get_children():
if isinstance(child, capa.engine.Some) and child.count == 0:
self.violation = True

for child in statement.get_children():
rec(child)

rec(rule.statement)

return self.violation
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this found the style issue here! mandiant/capa-rules@0dc67ab

@williballenthin williballenthin marked this pull request as ready for review November 10, 2021 21:32
Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work

candidate_rules = [self.rules[name] for name in candidate_rule_names]
features2, easy_matches = ceng.match(candidate_rules, features, va)

# note that we've stored the updated feature set in `features2`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the doc!

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 11, 2021

I ran a quick comparison across 50 files to compare the rule hits and did not notice any regressions!

@williballenthin williballenthin merged commit 57fe1e2 into master Nov 12, 2021
@williballenthin williballenthin deleted the perf/rule-selection branch November 12, 2021 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants