University of Colorado Pin Building Customized Program Analysis Tools with Dynamic Instrumentation CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood Intel Vijay Janapa Reddi University of Colorado http://rogue.colorado.edu/Pin PLDI’05
C Pin is a new dynamic binary instrumentation system Insert extra code into programs to collect information about execution Program analysis: Code coverage, call-graph generation, memory-leak detection Architectural study: Processor simulation, fault injection Existing binary-level instrumentation systems: Static: ATOM, EEL, Etch, Morph Dynamic: Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO C Pin is a new dynamic binary instrumentation system PLDI’05
Advantages of Pin Instrumentation Easy-to-use Instrumentation API Instrumentation code written in C/C++/asm ATOM-like API, based on procedure calls Instrumentation tools portable across platforms Same tools work on IA32, EM64T (x86-64), Itanium, ARM Same tools work on Linux and Windows (ongoing work) Low instrumentation overhead Pin automatically optimizes instrumentation code Pin can attach instrumentation to a running process Robust Handle mixed code and data, variable-length instructions, dynamically-generated code Transparent Application sees original addresses, values, and stack content PLDI’05
A Pintool for Tracing Memory Writes #include <iostream> #include "pin.H" FILE* trace; VOID RecordMemWrite(VOID* ip, VOID* addr, UINT32 size) { fprintf(trace, “%p: W %p %d\n”, ip, addr, size); } VOID Instruction(INS ins, VOID *v) { if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); int main(int argc, char * argv[]) { PIN_Init(argc, argv); trace = fopen(“atrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); return 0; executed immediately before a write is executed Same source code works on the 4 architectures => Pin takes care of different addressing modes No need to manually save/restore application state => Pin does it for you automatically and efficiently executed when an instruction is dynamically compiled PLDI’05
Dynamic Instrumentation Original code Code cache 7’ 2’ 1’ Exits point back to Pin 2 3 1 7 4 5 6 Pin Pin fetches trace starting block 1 and start instrumentation PLDI’05
Dynamic Instrumentation Original code Code cache 1 7’ 2’ 1’ 2 3 5 4 6 7 Pin Pin transfers control into code cache (block 1) PLDI’05
Dynamic Instrumentation Original code Code cache trace linking 2 3 1 7 4 5 6 7’ 2’ 1’ 6’ 5’ 3’ Pin Pin fetches and instrument a new trace PLDI’05
Pin’s Software Architecture Address space Pintool 3 programs (Pin, Pintool, App) in same address space: User-level only Instrumentation APIs: Through which Pintool communicates with Pin JIT compiler: Dynamically compile and instrument Emulation unit: Handle insts that can’t be directly executed (e.g., syscalls) Code cache: Store compiled code => Coordinated by VM Pin Instrumentation APIs Virtual Machine (VM) Application Code Cache JIT Compiler Emulation Unit Operating System Hardware PLDI’05
Pin Internal Details Loading of Pin, Pintool, & Application An Improved Trace Linking Technique Register Re-allocation Instrumentation Optimizations Multithreading Support PLDI’05
Register Re-allocation Instrumented code needs extra registers. E.g.: Virtual registers available to the tool A virtual stack pointer pointing to the instrumentation stack Many more … Approaches to get extra registers: Ad-hoc (e.g., DynamoRIO, Strata, DynInst) Whenever you need a register, spill one and fill it afterward Re-allocate all registers during compilation Local allocation (e.g., Valgrind) Allocate registers independently within each trace Global allocation (Pin) Allocate registers across traces (can be inter-procedural) PLDI’05
Valgrind’s Register Re-allocation Original Code Trace 1 mov 1, %eax mov 2, %esi cmp %ecx, %edx mov %eax, SPILLeax mov %esi, SPILLebx jz t’ mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx %edx %ecx %esi %ebx %eax Physical Virtual %edi re-allocate t: Trace 2 t’: mov SPILLeax, %eax mov SPILLebx ,%edi add 1, %eax sub 2, %edi C Simple but inefficient All modified registers are spilled at a trace’s end Refill registers at a trace’s beginning PLDI’05
Pin’s Register Re-allocation Scenario (1): Compiling a new trace at a trace exit mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx t: Original Code re-allocate Trace 2 mov 2, %esi jz t’ Trace 1 sub 2, %esi t’: Compile Trace 2 using the binding at Trace 1’s exit: %edx %ecx %esi %ebx %eax Physical Virtual C No spilling/filling needed across traces PLDI’05
Pin’s Register Re-allocation Scenario (2): Targeting an already generated trace at a trace exit Trace 1 (being compiled) Original Code mov 1, %eax mov 2, %esi cmp %ecx, %edx mov %esi, SPILLebx mov SPILLebx, %edi jz t’ mov 1, %eax mov 2, %ebx cmp %ecx, %edx jz t add 1, %eax sub 2, %ebx re-allocate %edx %ecx %esi %ebx %eax Physical Virtual %edi t: Trace 2 (in code cache) t’: add 1, %eax sub 2, %edi C Minimal spilling/filling code PLDI’05
Instrumentation Optimizations Inline instrumentation code into the application Avoid saving/restoring eflags with liveness analysis Schedule inlined instrumentation code PLDI’05
Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> BBL_InsertCall(bbl, IPOINT_BEFORE, docount(), IARG_UINT32, BBL_NumIns(bbl), IARG_END) C 33 extra instructions executed altogether Instrument without applying any optimization Trace bridge() mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> pushf push %edx push %ecx push %eax movl 0x3, %eax call docount pop %eax pop %ecx pop %edx popf ret docount() add %eax,icount ret mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05
Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> C 11 extra instructions executed mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05
Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining + eflags liveness analysis Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> C 7 extra instructions executed add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05
Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> Inlining + eflags liveness analysis + scheduling Trace cmov %esi, %edi add 0x3, icount cmp %edi, (%esp) jle <target1’> C 2 extra instructions executed add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05
Pin Instrumentation Performance Runtime overhead of basic-block counting with Pin on IA32 (SPEC2K using reference data sets) PLDI’05
Comparison among Dynamic Instrumentation Tools Runtime overhead of basic-block counting with three different tools Valgrind is a popular instrumentation tool on Linux Call-based instrumentation, no inlining DynamoRIO is the performance leader in binary dynamic optimization Manually inline, no eflags liveness analysis and scheduling C Pin automatically provides efficient instrumentation PLDI’05
Pin Applications Sample tools in the Pin distribution: Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler Some tools developed and used inside Intel: Opcodemix (analyze code generated by compilers) PinPoints (find representative regions in programs to simulate) A tool for detecting memory bugs Some companies are writing their own Pintools: A major database vendor, a major search engine provider Some universities using Pin in teaching and research: U. of Colorado, MIT, Harvard, Princeton, U of Minnesota, Northeastern, Tufts, University of Rochester, … PLDI’05
Conclusions Pin Downloadable from http://rogue.colorado.edu/Pin A dynamic instrumentation system for building your own program analysis tools Easy to use, robust, transparent, efficient Tool source compatible on IA32, EM64T, Itanium, ARM Works on large applications database, search engine, web browsers, … Available on Linux; Windows version coming soon Downloadable from http://rogue.colorado.edu/Pin User manual, many example tools, tutorials 3300 downloads since 2004 July PLDI’05
Acknowledgments Prof Dan Connors Intel Bistro Team Mark Charney Hosting Pin website at U of Colorado Intel Bistro Team Providing the Falcon decoder/encoder Suggesting instrumentation scheduling Mark Charney Providing the XED decoder/encoder Ramesh Peri Implementing part of Itanium Instrumentation PLDI’05
Backup PLDI’05
Talk Outline A Sample Pintool Pin Internal Details Experimental Results Pin Applications Conclusions PLDI’05
Trace Linking Trace linking is a very effective optimization Bypass VM when transferring from one trace to another Slowdown without trace linking as much as 100x Linking direct branches/calls Straightforward as targets are unique Linking indirect branches/calls & returns More challenging because the target can be different each time Our approach: For all indirect control transfers, use chaining For returns, further optimizes with function cloning PLDI’05
Indirect Trace Linking original indirect jump jmp [%eax] chain of predicted targets LookupHtab: target_1’: if (T != target_1) jmp target_2’ … target_N’: if (hit) jmp translated[T] else call Pin if (T != target_N) jmp LookupHtab … mov [%eax], T jmp target_1’ Chains are built incrementally Most recent target inserted at the chain’s head Hash table is local to each indirect jump slow path C Improved prediction accuracy over existing schemes PLDI’05
Return-Address Prediction Distinguish different callers to a function by cloning: A’: if (T != A) jmp B’ … B’: F’(): pop T jmp A’ if (T != B) jmp Lookuphtab1 … A(): no cloning call F() ret F(): F_A’(): pop T jmp A’ A’: if (T != A) jmp Lookuphtab1 … B(): call F() cloning F_B’(): pop T jmp B’ B’: if (T != B) jmp Lookuphtab2 … C Prediction accuracy further improved PLDI’05
Pin Multithreading Support For instrumenting multithreaded programs: Pin intercepts all threading-related system calls: Create and start jitting a thread if a clone() is seen Pin provides a “thread id” for pintools to index thread-local storage Pin’s virtual registers are backed up by per-thread spilling area For writing multithreaded pintools: Since Pin cannot link in libpthread in the pintool (to avoid conflicts in setting up signal handlers by two libpthreads) Pin implements a subset of libpthread itself Pin can also redirect libpthread calls in pintool to the application’s libpthread PLDI’05
Instrumenting Multithreaded Programs Pin instruments multithreaded programs: Spilling area has to be thread local Create a new per-thread spilling area when a thread-create system call (e.g., clone()) is intercepted How to access to per-thread spilling area? Steal a physical register to point to the per-thread spilling area x86-specific optimization: Initially assuming single-threaded program Access to the spilling area via its absolute address If multiple threads detected later: Flush the code cache Recompile with a physical register pointing to per-thread spilling area PLDI’05
Optimizing Instrumentation Performance Observations: Slowdown largely due to executing instrumentation code rather than dynamic compilation Make sense to spend more time to optimize Focus on optimizing simple instrumentation tools: Performance depends on how fast we can transit between the application and the tool Simple yet commonly used (e.g., basic-block profiling) PLDI’05
Pin Source Code Organization Pin source organized into generic, architecture-dependent, OS-dependent modules: Architecture #source files #source lines Generic 87 (48%) 53595 (47%) x86 (32-bit + 64-bit) 34 (19%) 22794 (20%) Itanium 20474 (18%) ARM 27 (14%) 17933 (15%) TOTAL 182 (100%) 114796 (100%) C ~50% code shared among architectures PLDI’05
Pin Instrumentation Performance Performance of basic-block counting with Pin/IA32 Average slowdown INT FP Without optimization 10.4x 3.9x Inlining 7.8x 3.5x Inlining + eflags analysis 2.8x 1.5x Inlining + eflags analysis + scheduling 2.5x 1.4x PLDI’05
Comparison among Dynamic Instrumentation Tools Performance of basic-block counting with three different tools Valgrind is a popular instrumentation tool on Linux Call-based instrumentation, no inlining DynamoRIO is the performance leader in dynamic optimization Manually inline, no eflags liveness analysis and scheduling C Pin automatically provides efficient instrumentation PLDI’05
Pin/IA32 Performance (no instrumentation) PLDI’05
Pin/EM64T Performance (no instrumentation) PLDI’05
Pin0/IPF Performance (no instrumentation) PLDI’05