Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn
Hazelwood – ISMM Dynamic Binary Instrumentation sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter++; Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count
Hazelwood – ISMM Instruction Count Output $ /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out $ pin -t inscount.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out Count
Hazelwood – ISMM How Does it Work? Generates and caches modified copies of instructions Modified (cached) instructions are executed in lieu of original instructions EXE Transform Code Cache Execute Profile
Hazelwood – ISMM Why “Dynamic” Instrumentation? Robustness! No need to recompile or relink Discover code at runtime Handle dynamically-generated code Attach to running processes The Code Discovery Problem on x86 Instr 1Instr 2 Instr 3Jump RegDATA Instr 5Instr 6 Uncond BranchPADDING Instr 8 Indirect jump to ?? Data interspersed with code Pad for alignment
Hazelwood – ISMM Intel Pin A dynamic binary instrumentation system Easy-to-use instrumentation interface Supports multiple platforms –Four ISAs – IA32, Intel64, IPF, ARM –Four OSes – Linux, Windows, FreeBSD, MacOS Popular and well supported –32,000+ downloads –400+ citations –500+ mailing list subscribers
Hazelwood – ISMM Research Applications Gather profile information about applications Compare programs generated by competing compilers Generate a select stream of live information for event-driven simulation Add security features Emulate new hardware Anything and everything multicore
Hazelwood – ISMM The Problem with Modern Tools Many research tools do not support multithreaded guest applications Providing support for MT apps is mostly straightforward Providing scalable support can be tricky!
Hazelwood – ISMM Issues that Arise Gaining control of executing threads Determining what should be private vs. shared between threads Code cache maintenance and consistency Concurrent instruction writes Providing/handling thread-local storage Handling indirect branches Handling signals / system calls
Hazelwood – ISMM The Pin Architecture JIT Compiler Syscall Emulator Signal Emulator Dispatcher Instrumentation Code Call-Back Handlers Analysis Code Code Cache Pin SerializedParallel T1 T2 T1 T2 Pin Tool
Hazelwood – ISMM Code Cache Consistency Cached code must be removed for a variety of reasons: Dynamically unloaded code Ephemeral/adaptive instrumentation Self-modifying code Bounded code caches EXE Transform Code Cache Execute Profile
Hazelwood – ISMM Motivating a Bounded Code Cache The Perl Benchmark
Hazelwood – ISMM Option 1: All threads have a private code cache (oops, doesn’t scale) Option 2: Shared code cache across threads If one thread flushes the code cache, other threads may resume in stale memory Flushing the Code Cache
Hazelwood – ISMM Naïve Flush Wait for all threads to return to the code cache Could wait indefinitely! VM CC1 VMstall VMstall CC2 VMCC1VMCC2 Flush Delay Thread1 Thread2 Thread3 Time
Hazelwood – ISMM Generational Flush Allow threads to continue to make progress in a separate area of the code cache VM CC1 VM CC2 VMCC1VMCC2 Thread1 Thread2 Thread3 Requires a high water mark Time
Hazelwood – ISMM Memory Scalability of the Code Cache Ensuring scalability also requires carefully configuring the code stored in the cache Trace Lengths First basic block is non-speculative, others are speculative Longer traces = fewer entries in the lookup table, but more unexecuted code Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
Hazelwood – ISMM Effect of Trace Length on Trace Count
Hazelwood – ISMM Effect of Trace Length on Memory
Hazelwood – ISMM Rewriting Instructions Pin must regularly rewrite branches No atomic branch write on x86 We use a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006
Hazelwood – ISMM Performance Results We use the SPEC OMP 2001 benchmarks OMP_NUM_THREADS environment variable We compare Native performance and scalability Pin (no Pintool) performance scalability Pin (lightweight Pintool) scalability InsCount Pintool – counts instructions at BB granularity Pin (middleweight Pintool) scalability MemTrace Pintool – records memory addresses Pin (heavyweight Pintool) scalability CMP$im – collects memory addresses and applies a software model of the CMP cache
Hazelwood – ISMM Native Scalability of SPEC OMP 2001
Hazelwood – ISMM Performance Scalability (No Instrumentation)
Hazelwood – ISMM Performance Scalability (LightWeight Instrumentation)
Hazelwood – ISMM Performance Scalability (MiddleWeight Instrumentation)
Hazelwood – ISMM Performance Scalability (HeavyWeight Instrumentation)
Hazelwood – ISMM Memory Scalability
Hazelwood – ISMM Summary Dynamic instrumentation tools are useful In the multicore era, we must provide support for MT application analysis and simulation Providing MT support in Pin was easy Making it robust and scalable was not easy