Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009
ACACES 2009 – Process Virtualization Course Outline Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization 1
ACACES 2009 – Process Virtualization 2 JIT-Based Process Virtualization Application Transform Code Cache Execute Profile
ACACES 2009 – Process Virtualization 3 What are the Challenges? Performance! Solutions: Code caches – only transform code once Trace selection – focus on hot paths Branch linking – only perform cache lookup once Indirect branch hash tables / chaining Memory “management” Correctness – self-modifying code, munmaps, multithreading Transparency – context switching, eflags
ACACES 2009 – Process Virtualization 4 What is the Overhead? The latest Pin overhead numbers …
ACACES 2009 – Process Virtualization 5 Sources of Overhead Internal Compiling code & exit stubs (region detection, region formation, code generation) Managing code (eviction, linking) Managing directories and performing lookups Maintaining consistency (SMC, DLLs) External User-inserted instrumentation
ACACES 2009 – Process Virtualization 6 Improving Performance: Code Caches Code Cache Branch Target Address Hit Region Formation & Optimization Evict Code Update Hash Table Miss No Yes No Interpret Code is Hot? Room in Code Cache? Insert Start Hash Table Lookup Counter++ Delete Exit Stub
ACACES 2009 – Process Virtualization 7 Software-Managed Code Caches Store transformed code at run time to amortize overhead of process VMs Contain a (potentially altered) copy of application code Application Transform Code Cache Execute Profile
ACACES 2009 – Process Virtualization 8 Code Cache Contents Every application instruction executed is stored in the code cache (at least) Code Regions Altered copies of application code Basic blocks and/or traces Exit stubs Swap applicationVM state Return control to the process VM
ACACES 2009 – Process Virtualization 9 Code Regions Basic Blocks Traces A BBL A: Inst1 Inst2 Inst3 Branch B C A B D CFG A B C D In Memory A B C D D Trace
ACACES 2009 – Process Virtualization 10 Exit Stubs One exit stub exists for every exit from every trace or basic block Functionality Prepare for context switch Return control to VM dispatch Details Each exit stub ≈ 3 instructions A B D Exit to C Exit to E
ACACES 2009 – Process Virtualization 11 A BC DE FG HI Call Return CFG Performance: Trace Selection Interprocedural path Single entry, multiple exit A B C D I G H E F Call Return Layout in Memory Exit to C Exit to F A B D E G H I Layout in Code Cache Trace (superblock)
ACACES 2009 – Process Virtualization 12 Performance: Cache Linking Trace #2 Exit #1a Exit #1b Trace #1 Dispatch Trace #3
ACACES 2009 – Process Virtualization 13 Linking Traces Proactive linking Lazy linking Exit to C Exit to F A B D E G H I Exit to A F H I C D E G H I Exit to F A BC DE FG HI Call Return A B D E G H I C D E G H I F H I
ACACES 2009 – Process Virtualization 14 Are Links Highly Beneficial? Bench- mark With Linking Without Linking Slow-down gzip230 sec7951 sec3357% vpr333 sec2474 sec643% gcc206 sec3284 sec1494% mcf368 sec2014 sec447% crafty215 sec3547 sec1550% parser350 sec6795 sec1841% perlbmk336 sec6945 sec1967% gap195 sec4231 sec2070% vortex382 sec4655 sec1119% bzip2287 sec4294 sec1396% twolf658 sec6490 sec886%
ACACES 2009 – Process Virtualization 15 Code Cache Visualization
ACACES 2009 – Process Virtualization 16 Challenge: Rewriting Instructions We must regularly rewrite branches No atomic branch write on x86 Pin uses a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006
ACACES 2009 – Process Virtualization 17 Pretend as though the original program is executing Original Code: 0x1000 call 0x4000 Challenge: Achieving Transparency Code cache address mapping: 0x1000 0x7000 “caller” 0x4000 0x8000 “callee” Translated Code: 0x7000 push 0x1006 0x7006 jmp 0x8000 Push 0x1006 on stack, then jump to 0x4000 SPC TPC
ACACES 2009 – Process Virtualization 18 Challenge: Self-Modifying Code The problem Code cache must detect SMC and invalidate corresponding cached traces Solutions Many proposed … but without HW support, they are very expensive! Changing page protection Memory diff prior to execution On ARM, there is an explicit instruction for SMC!
ACACES 2009 – Process Virtualization 19 Self-Modifying Code Handler (Written by Alex Skaletsky) void main (int argc, char **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InsertSmcCheck,0); PIN_StartProgram(); // Never returns } void InsertSmcCheck () {... memcpy(traceCopyAddr, traceAddr, traceSize); TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END); } void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr, USIZE traceSize, CONTEXT* ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { CODECACHE_InvalidateTrace((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); } }
ACACES 2009 – Process Virtualization 20 Challenge: Parallel Applications JIT Compiler Syscall Emulator Signal Emulator Dispatcher Instrumentation Code Call-Back Handlers Analysis Code Code Cache Pin SerializedParallel T1 T2 T1 T2 Pin Tool
ACACES 2009 – Process Virtualization 21 Challenge: Code Cache Consistency Cached code must be removed for a variety of reasons: Dynamically unloaded code Ephemeral/adaptive instrumentation Self-modifying code Bounded code caches EXE Transform Code Cache Execute Profile
ACACES 2009 – Process Virtualization 22 Motivating a Bounded Code Cache The Perl Benchmark
ACACES 2009 – Process Virtualization 23 Option 1: All threads have a private code cache (oops, doesn’t scale) Option 2: Shared code cache across threads If one thread flushes the code cache, other threads may resume in stale memory Flushing the Code Cache
ACACES 2009 – Process Virtualization 24 Naïve Flush Wait for all threads to return to the code cache Could wait indefinitely! VM CC1 VMstall VMstall CC2 VMCC1VMCC2 Flush Delay Thread1 Thread2 Thread3 Time
ACACES 2009 – Process Virtualization 25 Generational Flush Allow threads to continue to make progress in a separate area of the code cache VM CC1 VM CC2 VMCC1VMCC2 Thread1 Thread2 Thread3 Requires a high water mark Time
ACACES 2009 – Process Virtualization % pin –cache_size –t flusher -- /bin/ls SWOOSH! 26 Build-Your-Own Cache Replacement void main(int argc, char **argv) { PIN_Init(argc,argv); CODECACHE_CacheIsFull(FlushOnFull); PIN_StartProgram(); //Never returns } void FlushOnFull() { CODECACHE_FlushCache(); cout << “SWOOSH!” << endl; } Eviction Granularities Entire Cache One Cache Block One Trace Address Range
ACACES 2009 – Process Virtualization 27 A Graphical Front-End
ACACES 2009 – Process Virtualization 28 Memory Scalability of the Code Cache Ensuring scalability also requires carefully configuring the code stored in the cache Trace Lengths First basic block is non-speculative, others are speculative Longer traces = fewer entries in the lookup table, but more unexecuted code Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
ACACES 2009 – Process Virtualization 29 Effect of Trace Length on Trace Count
ACACES 2009 – Process Virtualization 30 Effect of Trace Length on Memory
ACACES 2009 – Process Virtualization 31 Sources of Overhead Internal Compiling code & exit stubs (region detection, region formation, code generation) Managing code (eviction, linking) Managing directories and performing lookups Maintaining consistency (SMC, DLLs) External User-inserted instrumentation
ACACES 2009 – Process Virtualization 32 Adding Instrumentation
ACACES 2009 – Process Virtualization 33 “Normal Pin” Execution Flow Instrumentation is interleaved with application Uninstrumented Application Instrumented Application Pin Overhead Instrumentation Overhead “Pinned” Application time
ACACES 2009 – Process Virtualization 34 “SuperPin” Execution Flow SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices
ACACES 2009 – Process Virtualization 35 Issues and Design Decisions Creating slices How/when to start a slice How/when to end a slice System calls Merging results
ACACES 2009 – Process Virtualization 36 fork S6+ fork S5+ fork S4+ fork record sigr3, sleep S3+ fork S2+ sleep S1+ Execution Timeline fork S1 S2S3S4S5S6 detect sigr4 detect exit resume detect sigr3 detect sigr6 detect sigr2 detect sigr5 resume record sigr4, sleep CPU2 CPU3 CPU4 time record sigr2, sleep resume record sigr5, sleep resume record sigr6, sleep resume original application instrumented application slices CPU1
ACACES 2009 – Process Virtualization 37 Performance – icount1 % pin –t icount1 --
ACACES 2009 – Process Virtualization What Did We Learn Today? Building Process VMs is only half the battle Robustness, correctness, performance are paramount Lots of “tricks” are in play Code caches, trace selection, etc. Knowing about these tricks is beneficial Lots of research opportunities Understanding the inner workings often helps you write better tools 38
ACACES 2009 – Process Virtualization Want More Info? Read the seminal Dynamo paper See the more recent papers by the Pin, DynamoRIO, Valgrind teams Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT 39 Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization