Skip to content

New Static Analysis Scope Suggestion: Superblock #2657

@yelhamer

Description

@yelhamer

Summary

This issue is for adding a new scope that groups consecutive basic blocks together and allow rule authors to write logic that matches on such clusters of basic blocks. This scope would closely resemble the "span" scope in the dynamic analysis flavour, and would —in theory— reduce false positives —for function-scope rules— and false negatives —for bb-scope rules.

My suggestion for the new scope's name to be "superblock", since it is used in compiler and static binary analysis research to denote sequences of basic blocks that form an execution path.

This scope was initially suggested during a chat with @williballenthin.

Motivation

The idea is that sometimes the features necessary for a capability to match could be spread across a sequence (2 or more) of basic blocks, and would therefore be invisible to a rule that describes that capability using the "basic block" scope resulting in a false negative. One way to get around this is to set the rule's scope to "function", but this could generate false positives if the sample contains a large function with many basic blocks in it, since the capa matching engine could end up correlating features that are unrelated and that are from disjoint execution paths in that large function.

This issue can be observed more clearly in the case of "error checking" in malware for instance. Suppose a rule requires 2 API call features for it to match. If the 2 calls exist in the same block and our rule has the "basic block" scope, then there would be a match. However, if the malware author happens to have some error-checking code after the first API call in the form of an if-else statement, then the capability would instead be spread across two basic blocks rather than one, and the aforementioned rule with the "basic block" scope set would therefore not match. One possible solution would be to set the rule's scope to "function", but this would increase the possibility of false positives as previously mentioned.

Addition of this scope would also provide a parallel to the dynamic "span" scope, although the naming and placement of the scope (between function and bb) might be a bit confusing. Calling it "superblock" and not making a direct connection to the "span" scope might help prevent such confusion.

Implementation Suggestions

I tried to come up with an implementation plan that conforms with capa's current codebase (specifically the way the "span" scope is implemented), and I came up with the following:

  1. Add a "SuperblockMatcher" class to the capabilities.dynamic module that would be analogous to the "SpanMatcher" class on the capabilities.static module both in structure and integration to the capability extraction logic. Each time a basic block is analysed, the superblock matcher object would add that basic block to a binary tree that it hosts internally, while refraining from doing any rule matching until all of the basic blocks have been analysed and added to that tree.

  2. Once all of the basic blocks have been analysed and added to the previously mentioned binary tree, the superblock matcher would attempt to match superblock rules on the set of all features and rules extracted from the previously-analysed basic blocks without taking into consideration that these basic blocks could be part of disjoint superblocks.

  3. After obtaining a list of all superblock rules matched on the collected basic blocks (without discerning the superblock that each of these basic blocks are part of), the superblock matcher object would traverse the previously-constructed binary tree from top to bottom in a sliding window fashion, while retaining only the matches whose composing features can all be found together in sequence with no "filler" basic blocks in between. A "filler" basic block —relative to the rule being currently considered— is a basic block that has no features necessary for that rule to match, and therefore signals a "rupture" in that superblock.

  4. The remaining superblock rules are then passed forward for function-scope rules to use them for matching, and are also displayed to the user in either of the following suggested fashions:
    a. Denote each superblock match by sequence of basic block addresses that make it up.
    b. Denote each superblock match by the address of the basic block that starts the superblock, followed by how many basic blocks after that make up the match (might be less verbose at the cost of losing details).

Note 1: I also suggest that the superblock matcher's binary tree would be implemented on top of a dictionary with each node (basic block) being stored in that dictionary using its address as the key.

Note 2: I also suggest the superblock address to be an (ordered) set of the addresses of its constituent basic blocks.

Looking forward to hear your thoughts on this! :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions