Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

A Binary Agent Technology for COTS Software Integrity Richard Schooler Anant Agarwal InCert Software.

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.

Pin PLDI Tutorial Advantages of Pin Instrumentation Easy-to-use Instrumentation: Uses dynamic instrumentation –Do not need source code, recompilation,

SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Pin Tutorial Robert Cohn Intel. Pin Tutorial Academia Sinica About Me Robert Cohn –Original author of Pin –Senior Principal Engineer at Intel –Ph.D.

Software & Services Group PinADX: Customizable Debugging with Dynamic Instrumentation Gregory Lueck, Harish Patil, Cristiano Pereira Intel Corporation.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Microprocessors AMD Hammer AMD’s High Stakes RISC Entry May 2 nd, 2002.

Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006

RUGRAT: Runtime Test Case Generation using Dynamic Compilers Ben Breech NASA Goddard Space Flight Center Lori Pollock John Cavazos University of Delaware.

Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

University of Colorado

Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.

Pin PLDI Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.

About Us Kim Hazelwood Vijay Janapa Reddi

Software & Services Group 1 Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial Intel Corporation Presented By: Tevi Devor CGO ISPASS 2012.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.

Pin Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.

1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.

1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.

Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Dynamic Compilation and Modification CS 671 April 15, 2008.

Environment Selection Application  Firefox 1.0 or 2.0  Apache Operating System  Linux  Windows XP Instrumentation Package  JIT (DynamoRio,

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems Marek Olszewski Keir Mierle Adam Czajkowski Angela Demke Brown University.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat

C++ Functions A bit of review (things we’ve covered so far)

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

15-740/ Computer Architecture Lecture 3: Performance

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

Kim Hazelwood Robert Cohn Intel SPI-ST

PinADX: Customizable Debugging with Dynamic Instrumentation

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

Workshop in Nihzny Novgorod State University Activity Report

Advantages of Pin Instrumentation

Hardware Multithreading

Advanced Computer Architecture

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

ACACES 2009 – Process Virtualization Research Applications Computer Architecture Trace Generation Fault Tolerance Studies Emulating New Instructions Program Analysis Code coverage Call-graph generation Memory-leak detection Instruction profiling Multicore Thread analysis –Thread profiling –Race detection Cache simulations Compilers Compare programs from competing compilers Security Add security checks and features 1

ACACES 2009 – Process Virtualization 2 Specific Applications Sample tools in the Pin distribution: Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler Some tools developed and used inside Intel: Opcodemix (analyze code generated by compilers) PinPoints (find representative regions in programs to simulate) A tool for detecting memory bugs Companies are writing their own Pintools Universities use Pin in teaching and research

ACACES 2009 – Process Virtualization 3 Compiler Bug Detection Opcodemix uncovered a compiler bug for crafty Instruction Type Compiler A Count Compiler B Count Delta *total712M618M-94M XORL94M 0M TESTQ94M 0M RET94M 0M PUSHQ94M0M-94M POPQ94M0M-94M JE94M0M-94M LEAQ37M 0M JNZ37M131M94M

ACACES 2009 – Process Virtualization 4 Thread Checker Basics Detect common parallel programming bugs: Data races, deadlocks, thread stalls, threading API usage violations Instrumentation used: Memory operations Synchronization operations (via function replacement) Call stack Pin-based prototype Runs on Linux, x86 and x86_64 A Pintool ~2500 C++ lines

ACACES 2009 – Process Virtualization 5 Thread Checker Results Potential errors in SPECOMP01 reported by Thread Checker (4 threads were used)

ACACES 2009 – Process Virtualization 6 a documented data race in the art benchmark is detected

ACACES 2009 – Process Virtualization 7 Fast exploratory studies Instrumentation ~= native execution Simulation speeds at MIPS Characterize complex applications E.g. Oracle, Java, parallel data-mining apps Simple to build instrumentation tools Tools can feed simulation models in real time Tools can gather instruction traces for later use Instrumentation-Driven Simulation

ACACES 2009 – Process Virtualization 8 Performance Models Branch Predictor Models: PC of conditional instructions Direction Predictor: Taken/not-taken information Target Predictor: PC of target instruction if taken Cache Models: Thread ID (if multi-threaded workload) Memory address Size of memory operation Type of memory operation (Read/Write) Simple Timing Models: Latency information

ACACES 2009 – Process Virtualization 9 Branch Predictor Model BP Model BPSim Pin Tool Pin Instrumentation RoutinesAnalysis RoutinesInstrumentation Tool API() Branch instr infoAPI data BPSim Pin Tool Instruments all branches Uses API to set up call backs to analysis routines Branch Predictor Model: Detailed branch predictor simulator

ACACES 2009 – Process Virtualization 10 BranchPredictor myBPU; VOID ProcessBranch(ADDRINT PC, ADDRINT targetPC, bool BrTaken) { BP_Info pred = myBPU.GetPrediction( PC ); if( pred.Taken != BrTaken ) { // Direction Mispredicted } if( pred.predTarget != targetPC ) { // Target Mispredicted } myBPU.Update( PC, BrTaken, targetPC); } VOID Instruction(INS ins, VOID *v) { if( INS_IsDirectBranchOrCall(ins) || INS_HasFallThrough(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) ProcessBranch, ADDRINT, INS_Address(ins), IARG_UINT32, INS_DirectBranchOrCallTargetAddress(ins), IARG_BRANCH_TAKEN, IARG_END); } int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); } INSTRUMENT BP Implementation ANALYSIS MAIN

ACACES 2009 – Process Virtualization 11 Performance Model Inputs Branch Predictor Models: PC of conditional instructions Direction Predictor: Taken/not-taken information Target Predictor: PC of target instruction if taken Cache Models: Thread ID (if multi-threaded workload) Memory address Size of memory operation Type of memory operation (Read/Write) Simple Timing Models: Latency information

ACACES 2009 – Process Virtualization 12 Cache Model Cache Pin Tool Pin Instrumentation RoutinesAnalysis RoutinesInstrumentation Tool API() Mem Addr infoAPI data Cache Pin Tool Instruments all instructions that reference memory Use API to set up call backs to analysis routines Cache Model: Detailed cache simulator Cache Simulators

ACACES 2009 – Process Virtualization 13 CACHE_t CacheHierarchy[MAX_NUM_THREADS][MAX_NUM_LEVELS]; VOID MemRef(int tid, ADDRINT addrStart, int size, int type) { for(addr=addrStart; addr<(addrStart+size); addr+=LINE_SIZE) LookupHierarchy( tid, FIRST_LEVEL_CACHE, addr, type); } VOID LookupHierarchy(int tid, int level, ADDRINT addr, int accessType){ result = cacheHier[tid][cacheLevel]->Lookup(addr, accessType ); if( result == CACHE_MISS ) { if( level == LAST_LEVEL_CACHE ) return; LookupHierarchy(tid, level+1, addr, accessType); } VOID Instruction(INS ins, VOID *v) { if( INS_IsMemoryRead(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemRef, IARG_THREAD_ID, IARG_MEMORYREAD_EA, IARG_MEMORYREAD_SIZE, IARG_UINT32, ACCESS_TYPE_LOAD, IARG_END); if( INS_IsMemoryWrite(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemRef, IARG_THREAD_ID, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_UINT32, ACCESS_TYPE_STORE, IARG_END); } int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); } INSTRUMENT Cache Implementation ANALYSIS MAIN

ACACES 2009 – Process Virtualization 14 Moving from 32-bit to 64-bit Applications How to identify the reasons for these performance results?  Profiling! BenchmarkLanguage64-bit vs. 32-bit speedup perlbenchC3.42% bzip2C15.77% gccC-18.09% mcfC-26.35% gobmkC4.97% hmmerC34.34% sjengC14.21% libquantumC35.38% h264refC35.35% omnetppC % astarC++8.46% xalancbmkC % Average7.16% Ye06, IISWC2006

ACACES 2009 – Process Virtualization 15 Main Observations In 64-bit mode: Code size increases (10%) Dynamic instruction count decreases Code density increases L1 icache request rate increases L1 dcache request rate decreases significantly Data cache miss rate increases

ACACES 2009 – Process Virtualization 16 Changing Program Code (Probe-mode) PIN_ReplaceProbed (RTN, AFUNPTR) –Redirect control flow to new functions in the Pin Tool PIN_ReplaceSignatureProbed (RTN, AFUNPTR, …) –(1) Redirect control flow (2) Rewrite function prototypes (3) Use Pin arguments (IARG’s) Original Stream foofoo’ foo() (Original)(Replacement)

ACACES 2009 – Process Virtualization 17 typedef VOID * (*FUNCPTR_MALLOC)(size_t); VOID * MyMalloc(FUNCPTR_MALLOC orgMalloc, UINT32 size, ADDRINT returnIp) { FUNCPTR_MALLOC poolMalloc = LookupMallocPool(returnIp, size); return (poolMalloc) ? poolMalloc(size) : orgMalloc(size); } VOID ImageLoad(IMG img, VOID *v) { RTN mallocRTN = RTN_FindByName(img, "malloc"); if (RTN_Valid(rtn)) { PROTO prototype = PROTO_Allocate(PIN_PARG(void *), CALLINGSTD_CDECL, "malloc", PIN_PARG(int), PIN_PARG_END()); RTN_ReplaceSignatureProbed(mallocRTN, (AFUNPTR) MyMalloc, IARG_PROTOTYPE, prototype, /* Function prototype */ IARG_ORIG_FUNCPTR, /* Handle to application’s malloc */ IARG_FUNCARG_ENTRYPOINT_VALUE, 0, /* First argument to malloc */ IARG_RETURN_IP, /* IP of caller */ IARG_END); PROTO_Free( proto_malloc ); } Replacing malloc() in an Application

ACACES 2009 – Process Virtualization 18 Total Overhead = Pin Overhead + Pintool Overhead ~5% for SPECfp and ~50% for SPECint Pin team’s job is to minimize this Usually much larger than pin overhead Pintool writers can help minimize this! Reducing Overhead in Your Tools

ACACES 2009 – Process Virtualization 19 Adding User Instrumentation

ACACES 2009 – Process Virtualization 20 Instrumentation Routines Overhead Pintool’s Overhead Frequency of calling an Analysis Routine Work required for transiting to Analysis Routine Reducing the Pintool’s Overhead Analysis Routines Overhead + Work required in the Analysis Routine x Work done inside Analysis Routine

ACACES 2009 – Process Virtualization 21 Reducing Work in Analysis Routines Key: Shift computation from analysis routines to instrumentation routines whenever possible This usually has the largest speedup

ACACES 2009 – Process Virtualization 22 Edge Counting: Slower Version... void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsBranchOrCall(ins)) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); }...

ACACES 2009 – Process Virtualization 23 Edge Counting: Faster Version void docount(COUNTER* pedge, INT32 taken) { pedg->count += taken; } void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsDirectBranchOrCall(ins)) { COUNTER *pedg = Lookup(INS_Address(ins), INS_DirectBranchOrCallTargetAddress(ins)); INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount, IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); } else INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); } …

ACACES 2009 – Process Virtualization 24 Key: Instrument at the largest granularity whenever possible –Instead of inserting one call per instruction –Insert one call per basic block or trace Analysis Routines: Reduce Call Frequency

ACACES 2009 – Process Virtualization 25 Slower Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter++; Recall our standard example:

ACACES 2009 – Process Virtualization 26 Faster Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 3 counter += 2 Counting at BBL level sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 5 Counting at Trace level counter+=3 L1

ACACES 2009 – Process Virtualization 27 Reducing Work for Analysis Transitions Reduce number of arguments to analysis routines Inline analysis routines Pass arguments in registers Instrumentation scheduling

ACACES 2009 – Process Virtualization 28 Reduce Number of Arguments Eliminate arguments only used for debugging Instead of passing TRUE/FALSE, create 2 analysis functions –Instead of inserting a call to: Analysis(BOOL val) –Insert a call to one of these: AnalysisTrue() AnalysisFalse() IARG_CONTEXT is very expensive (> 10 arguments)

ACACES 2009 – Process Virtualization 29 Inlining int docount0(int i) { x[i]++ return x[i]; } Inlinable int docount1(int i) { if (i == 1000) x[i]++; return x[i]; } Not-inlinable int docount2(int i) { x[i]++; printf(“%d”, i); return x[i]; } Not-inlinable void docount3() { for(i=0;i<100;i++) x[i]++; } Not-inlinable Pin will inline analysis functions into application code

ACACES 2009 – Process Virtualization 30 Inlining Inlining decisions are recorded in pin.log with log_inline pin –xyzzy –mesgon log_inline –t mytool – app Analysis function at 0x2a c CAN be inlined Analysis function at 0x2a a is not inlinable because the last instruction of the first bbl fetched is not a ret instruction. The first bbl fetched: ============================================================================ bbl[5:UNKN]: [p: ?,n: ? ] [____] rtn[ ? ] x x a a push rbp 32 0x x a b mov rbp, rsp 33 0x x a e mov rax, qword ptr [rip+0x3ce2b3] 34 0x x a inc dword ptr [rax] 35 0x x a mov rax, qword ptr [rip+0x3ce2aa] 36 0x x a e cmp dword ptr [rax], 0xf x x a965185a4 jnz 0x11

ACACES 2009 – Process Virtualization 31 Passing Arguments in Registers 32 bit platforms pass arguments on stack Passing arguments in registers helps small inlined functions VOID PIN_FAST_ANALYSIS_CALL docount(ADDRINT c) { icount += c; } BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount), IARG_FAST_ANALYSIS_CALL, IARG_UINT32, BBL_NumIns(bbl), IARG_END);

ACACES 2009 – Process Virtualization 32 Conditional Inlining Inline a common scenario where the analysis routine has a single “if-then” The “If” part is always executed The “then” part is rarely executed Useful cases: 1.“If” can be inlined, “Then” is not 2.“If” has small number of arguments, “then” has many arguments (or IARG_CONTEXT) Pintool writer breaks analysis routine into two: INS_InsertIfCall (ins, …, (AFUNPTR)doif, …) INS_InsertThenCall (ins, …, (AFUNPTR)dothen, …)

ACACES 2009 – Process Virtualization 33 IP-Sampling (a Slower Version) VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)IpSample, IARG_INST_PTR, IARG_END); } VOID IpSample(VOID* ip) { --icount; if (icount == 0) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } const INT32 N = 10000; const INT32 M = 5000; INT32 icount = N;

ACACES 2009 – Process Virtualization 34 IP-Sampling (a Faster Version) VOID Instruction(INS ins, VOID *v) { // CountDown() is always called before an inst is executed INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)CountDown, IARG_END); // PrintIp() is called only if the last call to CountDown() // returns a non-zero value INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintIp, IARG_INST_PTR, IARG_END); } INT32 CountDown() { --icount; return (icount==0); } VOID PrintIp(VOID *ip) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } inlined not inlined

ACACES 2009 – Process Virtualization 35 Instrumentation Scheduling If an instrumentation can be inserted anywhere in a basic block: Let Pin know via IPOINT_ANYWHERE Pin will find the best point to insert the instrumentation to minimize register spilling

ACACES 2009 – Process Virtualization 36 ManualExamples/inscount1.cpp #include #include "pin.H“ UINT64 icount = 0; void docount(INT32 c) { icount += c; } void Trace(TRACE trace, void *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl,IPOINT_ANYWHERE,(AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); } void Fini(INT32 code, void *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } analysis routine instrumentation routine

ACACES 2009 – Process Virtualization 37 Optimizing Your Tools - Summary Baseline system has fairly low overhead (~5-20%) Adding instrumentation can increase overhead significantly, but you can help! 1.Move work from analysis to instrumentation routines 2.Explore larger granularity instrumentation 3.Explore conditional instrumentation 4.Understand when inlining will occur

ACACES 2009 – Process Virtualization 38 How use of process VMs for: Program Analysis –Bug detection, thread analysis Computer architecture –Branch predictors, cache simulators, timing models, architecture width –Moving from 32-bit to 64-bit How to use the tools “well” Simple tricks for good performance What Did We Learn Today?

ACACES 2009 – Process Virtualization 39 Want More Info? Talk to your friends! (Lots of people writing tools with Pin, DynamoRIO, Valgrind) Each tool has a mailing list Pin DynamoRIOcode.google.com/p/dynamorio Valgrindwww.valgrind.org Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization