December 20, 2018

Release v0.1.0

Overview:

Support for Windows
Symbolic execution memory management rework
Support for wide operations in each jitter
Good bye SymbolPool, welcome LocationDB
IRCFG out of IntermediateRepresentation
SSA
Explicit operators
Linux environment emulation

Here is a short post to highlight some of the new features and main architectural changes of Miasm’s release v0.1.0. The comprehensive list of changes is available here.

Some of these recent changes will maybe be detailed later in a dedicated post.

Support for Windows

Thanks to the amazing job of @0vercl0k, Miasm now runs natively on Windows (without requiring WSL). It even supports LLVM, MSVC and each commit landing to master is now also tested on Windows thanks to AppVeyor.

Symbolic execution memory management rework

In Miasm, the symbolic execution engine keeps loads and writes addresses symbolic. To handle wrapping memory cases, target address was compared to addresses already used in the memory. This comparison, in order to manage both symbolic and concrete addresses, was done using the expression simplifier.

For instance, to store X (32 bits) at address A₁, Miasm tried to see if A₁, A₁ + 1, A₁ + 2, A₁ + 3 were overlapping with any other existing address. This process is quite slow, and was also limiting access size by a hardcoded constant (to avoid too many computations).

Let’s remark that an access is always in the form Symbolic + Constant. For instance, @[EAX + EBX + 12] is splitted in EAX + EBX (symbolic base) and 12 (offset). @[1234] is splitted in (empty) (symbolic base) and 1234 (offset).

The new engine is based on the following hypothesis: if two accesses are made on a different symbolic base, they “live” in different world and hence, they cannot overlap. Thus, overlapping detection is now done by first splitting the addresses into symbolic bases and offsets, and then by comparing the offsets, which is faster and avoid above limitation.

This mechanism allegedly avoids a common, hard problem: memory aliasing. Miasm doesn’t handle it for now. It will consider that in the initial symbolic execution environment, aliases are already resolved. If two symbols alias, it must be explicit. Otherwise, they will never alias.

Support for wide operations in each jitter

Miasm C-based jitters (GCC and the no longer supported TCC) were limited in operand sizes to 64 bits. Even if the LLVM jitter was not suffering from this limitation, the jitters shared common structures (such as the CPU registers). Thus, one can switch from one to another smoothly.

With the introduction of bignum, this is no longer an issue. To illustrate it, Miasm now supports most of the SSE2 instructions.

Good bye SymbolPool, welcome LocationDB

The symbol pool element was responsible of storing names and offsets. It has been replaced by the more modular LocationDB, which may be seen as a “database” for location information.

Now, a location is represented by a LocKey, which is basically a key for a LocationDB entry. Zero, one or several names can be associated with it. Zero or one offset can also be associated. Indeed, in Miasm, a location is not necessarily tight to an offset; for instance, generated IRBlocks must be “somewhere”, but must not collide with existing offsets.

Thus, an instruction or an expression can use a LocKey to represent a location, for instance the possible destination of an IRDst assignment. To respect the homogeneity of expressions, this LocKey is encapsulated in the brand new ExprLoc. The hack-ish ExprId containing an asm_label or a name no longer exists; an ExprId now always stands for a variable.

The modular structure should permit the use of LocationDB to store more information, such as comments or code architecture hints (ARM / thumb, etc.).

IRCFG out of IntermediateRepresentation

IntermediateRepresentation was playing two roles: translation from assembly to IR, and holding a bunch of related IRBlocks.

It is now split in:

IRCFG: a CFG for IR, made of IRBlocks. This is the IR version of the AsmBlock
IntermediateRepresentation: translation from assembly to IR. This is the IR version of the DisasmEngine

So now, one likely has only one IntermediateRepresentation instance, but several IRCFG ones. The different APIs have been updated to reflect this change, and this is likely the main source of script breaks.

SSA

The SSA (Single Static Assignment) has been implemented in Miasm in this PR. This has been implemented as a transformation which takes an IR control flow graph, and rewrites it in SSA form.

The SSA transformation is very common in the compilation field, and gives a lot of interesting properties to ease program analysis. This transformation is, and will be used in some Miasm analysis.

Today, the constant propagation in Miasm is achieved using symbolic emulation which is a bit heavy. A new transformation called PropagateExpr has been implemented using the SSA transformation. Without diving in the details, let’s say it propagates expressions based on rules in order to aggregate multiple little assignments into more readable expressions. The type propagation could also be rewritten using SSA in order to enhance result quality and improve speed analysis.

Explicit operators

Previously, Miasm fully expanded flags and conditional tests. For the following code:

cmp     eax, ebx
jg      XXX

The resulting IR was:

af = ((EAX ^ EBX) ^ (EAX + -EBX))[4:5]
pf = parity((EAX + -EBX) & 0xFF)
zf = (EAX + -EBX)?(0x0,0x1)
of = ((EAX ^ (EAX + -EBX)) & (EAX ^ EBX))[31:32]
nf = (EAX + -EBX)[31:32]
cf = (((EAX ^ EBX) ^ (EAX + -EBX)) ^ ((EAX ^ (EAX + -EBX)) & (EAX ^ EBX)))[31:32]

EIP = (zf | (nf + -of))?(loc_key_41,loc_key_40)
IRDst = (zf | (nf + -of))?(loc_key_41,loc_key_40)

The problem is that using this representation, it is hard to answer simple questions like:

which arguments are tested by the condition?
which kind of condition is it?

We decided to rewrite some mnemonics of each architecture semantic in order to generate the following IR:

af = ((EAX ^ EBX) ^ (EAX + -EBX))[4:5]
pf = parity((EAX + -EBX) & 0xFF)
zf = FLAG_EQ_CMP(EAX, EBX)
of = FLAG_SUB_OF(EAX, EBX)
nf = FLAG_SIGN_SUB(EAX, EBX)
cf = FLAG_SUB_CF(EAX, EBX)

EIP = CC_S>(nf, of, zf)?(loc_key_40,loc_key_41)
IRDst = CC_S>(nf, of, zf)?(loc_key_40,loc_key_41)

As you can see, we have added some new “high level” operators: their goal is to replace explicit computations or flags tests. The interesting thing is that in the previous code, if we operate some expression propagations, we will end up with:

IRDst = CC_S>(
        FLAG_SIGN_SUB(EAX, EBX),
        FLAG_SUB_OF(EAX, EBX),
        FLAG_EQ_CMP(EAX, EBX)
    )?(loc_key_40,loc_key_41)

And guess what? We have simplification rules which reduce this to:

IRDst = (EBX <s EAX)?(loc_key_40,loc_key_41)

Note that it is still possible to generate explicit flag computations, for z3 solving for example, using expr_simp_high_to_explicit simplifier.

Linux environment emulation

In the same way that library APIs are stubbed, it is now possible to go one step further. In this mode, the code runs from the first loader instruction and go through library loading to the main user code.

To do so, one can now stub syscalls, and then, the behavior of the emulated kernel. The targeted code also remains sandboxed (we hope). As before, it is also possible to emulate ARM code (including ld, stubing the corresponding syscalls) on an x86 host.

The support is in its early stages (spoiler: Miasm cannot simulate a full featured kernel). But still, this could be used to avoid stubing functions, for DSE, to test some behavior in response to a specific syscall output, etc.

Posted by commial, serpilliere

Tags: release, IR