-
Notifications
You must be signed in to change notification settings - Fork 632
perf: don't try to match rules that will never match #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| features1, matches1 = capa.engine.match(rules, features, va) | ||
|
|
||
| ruleset = capa.rules.RuleSet(rules) | ||
| features2, matches2 = ruleset.match(scope, features, va) | ||
|
|
||
| for feature, locations in features1.items(): | ||
| assert feature in features2 | ||
| assert locations == features2[feature] | ||
|
|
||
| for rulename, results in matches1.items(): | ||
| assert rulename in matches2 | ||
| assert len(results) == len(matches2[rulename]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we assert that the new matching algorithm returns the same results as the older, well-tested algorithm
|
I identified a number of edge cases in feature indexing having to do with At this point, I'm not aware of any other edge cases or concerns, so I'll re-open this PR for review and discussion for merge (assuming no test failures). |
| class OptionalNotUnderAnd(Lint): | ||
| name = "rule contains an `optional` or `0 or more` statement that's not found under an `and` statement" | ||
| recommendation = "clarify the rule logic and ensure `optional` and `0 or more` is always found under `and`" | ||
| violation = False | ||
|
|
||
| def check_rule(self, ctx: Context, rule: Rule): | ||
| self.violation = False | ||
|
|
||
| def rec(statement): | ||
| if isinstance(statement, capa.engine.Statement): | ||
| if not isinstance(statement, capa.engine.And): | ||
| for child in statement.get_children(): | ||
| if isinstance(child, capa.engine.Some) and child.count == 0: | ||
| self.violation = True | ||
|
|
||
| for child in statement.get_children(): | ||
| rec(child) | ||
|
|
||
| rec(rule.statement) | ||
|
|
||
| return self.violation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this found the style issue here! mandiant/capa-rules@0dc67ab
mr-tz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work
| candidate_rules = [self.rules[name] for name in candidate_rule_names] | ||
| features2, easy_matches = ceng.match(candidate_rules, features, va) | ||
|
|
||
| # note that we've stored the updated feature set in `features2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the doc!
|
I ran a quick comparison across 50 files to compare the rule hits and did not notice any regressions! |
Co-authored-by: Moritz <[email protected]>
This PR adds an optimization to the rule matching engine that will avoid evaluating rules that it can prove will never match.
The optimization adds a pre-processing step that groups rules into "easy" and "hard" buckets. Easy rules are those that use only "easy" features, that is, features implemented via hash lookup, such as mnemonic, number, string, API, etc. Hard rules are those that use at least one "hard" feature, such as features that require a scan, like substring/regex/bytes, and rules that depend on other rules via
matchstatements.For the easy bucket, rules are indexed by the features they reference. This lets the evaluation engine quickly find the rules that overlap with a given ruleset. For example, given the features
api: sendandapi: recv, the engine picks only rules that referenceapi: sendorapi: recvand none of the rules that deal with registry manipulation, etc.The rule engine evaluates the hard bucket the same way it did before. I anticipate (but did not test) that scanning features are expensive to check (for each string feature, scan with each string statement, around 577 today), so we don't try to be clever here. Also, rules that depend on other rules are tricky to handle correctly. I'd imagine there would need to be multiple rounds of matching and pruning, and therefore it would be difficult to implement and confusing to read. Therefore, we don't attempt anything special for rules with dependencies: we match them like before, evaluating each rule in topological order.
In summary, the optimization drastically cuts down on the number of rules the engine has to evaluate, leading to much higher performance:
(via: PMA01-01, 30 iterations)
Adding just this optimization reduces the evaluation count by 64% and matching runtime by 50%!
But even better: this optimization is independent and orthogonal to #827/#828 (short circuiting & query optimizer) so the performance improvements are complementary. When both are enabled, evaluation count is reduced by 78% and matching runtime by 66%!
Checklist