This post really isn’t up to the standard of most others here. However, there comes a time when writing anything is better than nothing. I’m also glazing over a lot of details that would be important in implementing a Symbolic Execution Engine.
I am interested in the discovery of memory-corruption vulnerabilities. The two often-used methods for finding these vulnerabilities, fuzzing and reverse-engineering, are producing fewer results, and some people have stopped looking for memory-corruption vulnerabilities altogether. As our mainstream software is coded to higher, and more secure, standards, the identification of these vulnerabilities is becoming harder, and we need new methods to find them.
Enter Symbolic Execution. I’m still learning about this method of vulnerability identification and implementing it into rop_tools (source no longer available), but I thought I would share the basic concepts with you. Know that a joint effort between UC Berkely, Carnegie Mellon and the College of William and Mary has brought us BitBlaze, an academic pursuit using much of the technology I’ll be covering.
Let’s start with a simple vulnerability:
int main (int argc, char * argv[])
{
char buf[16];
int i;
for (i = 0; (i < 16) && (argv[1][i] != '\0'); i++)
buf[i] = argv[1][i];
return 0;
}
And here’s gcc’s output with no optimizations
0000000000400494 main: 400494 55.............. push rbp 400495 4889e5.......... mov rbp, rsp 400498 897ddc.......... mov [rbp-0x24], edi 40049b 488975d0........ mov [rbp-0x30], rsi 40049f c745fc00000000.. mov dword [rbp-0x4], 0x0 4004a6 eb23............ /--< jmp main+55 (4004cb) 4004a8 488b45d0........ | /> mov rax, [rbp-0x30] 4004ac 4883c008........ | | add rax, 0x8 4004b0 488b10.......... | | mov rdx, [rax] 4004b3 8b45fc.......... | | mov eax, [rbp-0x4] 4004b6 4898............ | | cdqe 4004b8 4801d0.......... | | add rax, rdx 4004bb 0fb610.......... | | movzx edx, byte [rax] 4004be 8b45fc.......... | | mov eax, [rbp-0x4] 4004c1 4898............ | | cdqe 4004c3 885405e0........ | | mov [rbp+rax-0x20], dl 4004c7 8345fc01........ | | add dword [rbp-0x4], 0x1 4004cb 837dfc0f........ \-+> cmp dword [rbp-0x4], 0xf 4004cf 7f1a............ /+< jg main+87 (4004eb) 4004d1 488b45d0........ || mov rax, [rbp-0x30] 4004d5 4883c008........ || add rax, 0x8 4004d9 488b10.......... || mov rdx, [rax] 4004dc 8b45fc.......... || mov eax, [rbp-0x4] 4004df 4898............ || cdqe 4004e1 4801d0.......... || add rax, rdx 4004e4 0fb600.......... || movzx eax, byte [rax] 4004e7 84c0............ || test al, al 4004e9 75bd............ |\< jnz main+20 (4004a8) 4004eb b800000000...... \-> mov eax, 0x0 4004f0 5d.............. pop rbp 4004f1 c3.............. ret
If we were to fuzz this executable, we would create several inputs in the hopes that one of these inputs would overwrite some critical location in memory, and that overwrite would cause the application to crash.
In Symbolic Execution we will run our code in a virtualized environment. Instead of using static values for inputs, such as with fuzzing, we will determine which values can change and we will use ranges for those values. For example, the valid values for argc are variable. When running your code during fuzzing, you pick a value for argc, say 2. With Symbolic Execution, your Symbolic Execution Engine would reason about what values are possible, say 1-255, and would attempt to identify the execution paths possible with this range of values.
For example, we know the string pointed to by [ebp+0xc]+0×4, or argv[1], can be ANYTHING up to the length our shell will allow. For me, this is roughly 2^15 bytes, minus some, but for our purposes we’ll say 10,000.
As we go through our example, we will focus on the instructions that control execution flow:
4004a6 eb23............ /--< jmp main+55 (4004cb) 4004cf 7f1a............ /+< jg main+87 (4004eb) 4004e9 75bd............ |\< jnz main+20 (4004a8) 4004f1 c3.............. ret
We’re also interested in locations that are VALID destinations to land back into. For the above instructions, they are:
4004a8 488b45d0........ | /> mov rax, [rbp-0x30] 4004cb 837dfc0f........ \-+> cmp dword [rbp-0x4], 0xf 4004eb b800000000...... \-> mov eax, 0x0
And the location in .text immediately following the call into main, IE the address held in [ebp+0x4], or the return pointer.
Now we get to the first jump.
4004a6 eb23............ /--< jmp main+55 (4004cb)
This jump is non-conditional, so we aren’t very concerned with it. We’re not going to be able to alter the flow of execution based on this JMP.
We skip to the next jump.
4004cb 837dfc0f........ \-+> cmp dword [rbp-0x4], 0xf 4004cf 7f1a............ /+< jg main+87 (4004eb)
This is a conditional jump. We can better express this jump as the following:
JUMP(+87, dword [rbp-0x4] <= 0xf)
The BitBlaze team choose to take x86 assembly and reassemble it to a simpler intermediate language. This was a very good idea, as it greatly reduces the complexity of analysis. While we’re going to glaze over some of the specifics implemented by BitBlaze, we will capture the intent or their intermediate language.
This jump is, also, for our-purposes right now, non-conditional, in that the value at [rbp-0x4] is clearly defined. We know exactly what it is at this point in the execution of our program, and can accurately predict where the control of our program will flow to. Let’s move to the next JMP.
4004d1 488b45d0........ || mov rax, [rbp-0x30] 4004d5 4883c008........ || add rax, 0x8 4004d9 488b10.......... || mov rdx, [rax] 4004dc 8b45fc.......... || mov eax, [rbp-0x4] 4004df 4898............ || cdqe 4004e1 4801d0.......... || add rax, rdx 4004e4 0fb600.......... || movzx eax, byte [rax] 4004e7 84c0............ || test al, al 4004e9 75bd............ |\< jnz main+20 (4004a8)
This JUMP can be rewritten as:
JUMP(+20, 0x00 != byte [[[rbp-0x30]+0x8] + [rbp-0x4]])
Well, this is the ith byte of our second argument from the command line. This is, for our purposes, conditional, as we do not know the value of that argument (let’s not go into how we would implement this for now). So what do we do?
We find the range of values that would cause the jump to execute, and the values that would not cause the jump to execute. We then set the memory/variables (where registers are variables) to the range of these values, and we try both flows of execution.
In other words, we create two models of our memory and variables. In one model, we set byte [[[rbp-0x30]+0×8] + [rbp-0x4]] to 0, and in the other model, we set byte [[[rbp-0x30]+0×8] + [rbp-0x4]] to all possible values other than 0.
Path selection becomes important and will determine how long it takes for your symbolic execution engine to find the memory corruption bug. You can also implement additional checks, such as ASSERT that certain values in memory do not change (return address would be a natural first pick).
Assuming we do not implement an ASSERT, which would not catch all memory corruption errors anyway, how would we determine when we have corrupted memory in a manner that corrupts the Instruction Pointer?
When we get to our ret instruction, we will rewrite it as:
JUMP([rsp], TRUE)
All values in [rsp] must equal the values we determined as valid destinations to land back into. Remember we determined this earlier. Depending on what you’re looking for, this may be the destination of all calls, jumps, and register jumps (you may do some special tweaking for register jumps in something so as the Procedure Linkage Table), or just addresses in .text in general (which may cause you to miss something important!).
As our symbolic execution engine continues to execute, we eventually hit our ret instruction where the value of [rsp] is a series of ranges, something like ((0×01-0xff){8}). This would obviously include addresses outside of our approved addresses. We would then raise an exception.