Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004 Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004 Program Provenance Guessing the Source Compiler from Binary Code Nathan Rosenblum
Why compiler provenance? 2 Guessing the Source Compiler IDA Pro
Why should this work? 3 Guessing the Source Compiler
4 test edi,edi jle 4004ae mov eax,0x0 lea eax,[rdx+rax] imul edx,eax add eax,0x1 cmp edi,eax jg 4004a1 mov eax,edx ret xor edx,edx test edi,edi jle add edx,eax imul eax,edx inc edx cmp edx,edi jl 40097e ret int bar(int foo) { int i, j; for(i=0;i<foo;++i) { i = j + i; j *= i; } return j; } GCCICC
Modeling binary code 5 Guessing the Source Compiler program binary gcc icc i i ₋₁ i ₊₁ i ₊₂ icc none …… compiler labels … c ff d0 c9 c ec e b b4 24 ec … underlying bytes 8d b d bc c c c c 9b match_init zp_init_keys seekable padding addrs. data
Describing code 6 Guessing the Source Compiler 〈mov [IMM], RAX ; * ; sub [IMM], RAX〉 abstracts several IA32 opcodes single-instruction wildcard hide immediate values …… instruction-level control flow- level branch 〈mov [IMM], RAX ; * ; sub [IMM], RAX〉 〈add[IMM], RDX ; * ; sub RAX, RCX〉 〈push EBP ; mov ESP, EBP〉 〈shl[IMM], RAX ; shr[IMM], RAX〉 〈 *; * ; sub [IMM], RAX〉 [math elided]
Guessing the Source Compiler Results [R, Miller, Zhu PASTE ‘10] single compiler mixed compiler GCC ICCMSVC 92.5% 93.7% 5.3% 2.3% or 2.8% 6.4% error types
Finer detail: compiler versions, optimization 8 Guessing the Source Compiler Major versions? Minor versions? Low optimization vs. high optimization? Highly optimized code? GCC 3.x vs 4.x GCC 4.2 vs 4.3 GCC -O0 vs -O3 GCC –O2 vs –O3 easy 99% easy85-99% easy99% hard60%
Future work 9 Guessing the Source Compiler int bar(int foo) { int i, j; for(i=0;i<foo;++i) { i = j + i; j *= i; } return j; }