Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

CPU Structure and Function
Part IV: Memory Management
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
RIVERSIDE RESEARCH INSTITUTE Helikaon Linux Debugger: A Stealthy Custom Debugger For Linux Jason Raber, Team Lead - Reverse Engineer.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.
Pin Tutorial Robert Cohn Intel. Pin Tutorial Academia Sinica About Me Robert Cohn –Original author of Pin –Senior Principal Engineer at Intel –Ph.D.
Software & Services Group PinADX: Customizable Debugging with Dynamic Instrumentation Gregory Lueck, Harish Patil, Cristiano Pereira Intel Corporation.
Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006
Continuously Recording Program Execution for Deterministic Replay Debugging.
Computer Organization and Architecture The CPU Structure.
1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.
Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.
University of Colorado
Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.
Software & Services Group 1 Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial Intel Corporation Presented By: Tevi Devor CGO ISPASS 2012.
Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
System Calls 1.
Dr. José M. Reyes Álamo 1.  The 80x86 memory addressing modes provide flexible access to memory, allowing you to easily access ◦ Variables ◦ Arrays ◦
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.
Blue Diamond Scott Auge Amduus Information Works, Inc.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.
Pin Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
Assembly Language for Intel-Based Computers, 6 th Edition Chapter 8: Advanced Procedures (c) Pearson Education, All rights reserved. You may.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Dynamic Compilation and Modification CS 671 April 15, 2008.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.
Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Processes and Virtual Memory
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.
Functions/Methods in Assembly
1 The Stack and Procedures Chapter 5. 2 A Process in Virtual Memory  This is how a process is placed into its virtual addressable space  The code is.
1 A Seven-State Process Model. 2 CPU Switch From Process to Process Silberschatz, Galvin, and Gagne  1999.
Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Introduction to Information Security
Instruction Set Architecture
Processes and threads.
Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial
Kim Hazelwood Robert Cohn Intel SPI-ST
William Stallings Computer Organization and Architecture 8th Edition
PinADX: Customizable Debugging with Dynamic Instrumentation
Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial
Chapter 4: Threads.
System Structure and Process Model
Dynamic Binary Translators and Instrumenters
Computer Architecture and System Programming Laboratory
Presentation transcript:

Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial Presented By: Robert Cohn Tevi Devor Intel Corporation CGO 2010

Agenda Part1: Introduction to Pin Part2: Larger Pin tools and writing efficient Pin tools Part3: Deeper into Pin API Part4: Advanced Pin API Part5: Performance #s

Part1: Introduction to Pin Dynamic Binary Instrumentation Pin Capabilities Overview of how Pin works Sample Pin Tools

What Does “Pin” Stand For? Three Letter Acronyms @ Intel TLAs 263 possible TLAs 263 -1 are in use at Intel Only 1 is not approved for use at Intel Guess which one: Pin Is Not an acronym Pin is based on the post link optimizer Spike Use dynamic code generation to make a less intrusive profile guided optimization and instrumentation system Pin is a small Spike Spike is EOL http://www.cgo.org/cgo2004/papers/01_82_luk_ck.pdf

Pin is a dynamic binary instrumentation engine A technique that inserts code into a program to collect run-time information Program analysis : performance profiling, error detection, capture & replay Architectural study : processor and cache simulation, trace collection Source-Code Instrumentation Static Binary Instrumentation Dynamic Binary Instrumentation Instrument code just before it runs (Just In Time – JIT) No need to recompile or re-link Discover code at runtime Handle dynamically-generated code Attach to running processes Pin is a dynamic binary instrumentation engine

Advantages of Pin Instrumentation Programmable Instrumentation: Provides rich set of APIs to write, in C,C++,assembly, your own instrumentation tools, called PinTools APIs are designed to maximize ease of use abstract away the underlying instruction set idiosyncrasies Multiplatform: Supports IA-32, Intel64, IA-64 Supports Linux, Windows, MacOS Robust: Can instrument real-life applications: Database, web browsers, … Can instrument multithreaded applications Supports signals and exceptions, self modifying code… If you can Run it – you can Pin it Efficient: Applies compiler optimizations on instrumentation code Pin can be used to instrument all the user level code in an application

Pin Instrumentation Capabilities Use Pin APIs to write PinTools that: Replace application functions with your own Call the original application function from within your replacement function Fully examine any application instruction, and insert a call to your instrumenting function to be executed whenever that instruction executes Pass parameters to your instrumenting function from a large set of supported parameters Register values (including IP), Register values by reference (for modification) Memory addresses read/written by the instruction Full register context …. Track function calls including syscalls and examine/change arguments Track application threads Intercept signals Instrument a process tree Many other capabilities… If Pin doesn’t have it, you don’t want it

Usage of Pin at Intel Profiling and analysis products Intel Parallel Studio Amplifier (Performance Analysis) Lock and waits analysis Concurrency analysis Inspector (Correctness Analysis) Threading error detection (data race and deadlock) Memory error detection Architectural research and enabling Emulating new instructions (Intel SDE) Trace generation Branch prediction and cache modeling GUI Algorithm PinTool Pin

Example Pin-tools SDE: http://software.intel.com/en-us/articles/intel-software-development-emulator CMP$IM: http://www-mount.ece.umn.edu/~jjyi/MoBS/2008/program/02A-Jaleel.pdf PinPlay: Paper presented at CGO2010 http://www.cgo.org/cgo2010/program.html

Pin Usage Outside Intel Popular and well supported 30,000+ downloads, 400+ citations Free DownLoad www.pintool.org Includes: Detailed user manual, source code for 100s of Pin tools Pin User Group (PinHeads) http://tech.groups.yahoo.com/group/pinheads/ Pin users and Pin developers answer questions

Launcher Process PIN.EXE Launcher Application Process PINVM.DLL pin.exe –t inscount.dll – gzip.exe input.txt Pin Invocation Count 258743109 gzip.exe input.txt Read a Trace from Application Code Jit it, adding instrumentation code from inscount.dll Encode the jitted trace into the Code Cache Starting at first application IP Read a Trace from Application Code Jit it, adding instrumentation code from inscount.dll Encode the trace into the Code Cache Execute Jitted code PinTool that counts application instructions executed, prints Count at end Execution of Trace ends Call into PINVM.DLL to Jit next trace Pass in app IP of Trace’s target Source Trace exit branch is modified to directly branch to Destination Trace Start PINVM.DLL running (firstAppIp, “inscount.dll”) Load inscount.dll and run its main() Load PINVM.DLL CreateProcess (gzip.exe, input.txt, suspended) GetContext(&firstAppIp) Inject Pin BootRoutine and Data into application SetContext(BootRoutineIp) Resume at BootRoutine WriteProcessMemory(BootRoutine, BootData) Application Code and Data Boot Routine + Data: firstAppIp, “Inscount.dll” inscount.dll PIN.LIB Application Process First app IP Code Cache System Call Dispatcher Event Dispatcher Thread Dispatcher PINVM.DLL Decoder Encoder NTDLL.DLL Windows kernel app Ip of Trace’s target

All code in this presentation is covered by the following: /*BEGIN_LEGAL Intel Open Source License Copyright (c) 2002-2010 Intel Corporation. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE INTEL OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. END_LEGAL */

Instruction Counting Tool (inscount.dll) #include "pin.h" UINT64 icount = 0; void docount() { icount++; } void Instruction(INS ins, void *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_END); } void Fini(INT32 code, void *v) { std::cerr << "Count " << icount << endl; } int main(int argc, char * argv[]) { PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); // Never returns return 0; } Execution time routine Jitting time routine: Pin CallBack switch to pin stack save registers call docount restore registers switch to app stack inc icount sub $0xff, %edx inc icount cmp %esi, %edx save eflags inc icount Shows the full inscount.dll source code. icount is the variable which accumulates the instruction count. Pin Tool gets Pin to insert a call to the docount function before each jitted instruction, and call the Fini function just before the process exits. pinvm.dll calls tool’s main() PIN_Init: Initializes parts of Pin PIN_AddInstrumentationFunction: Directs Pin to call the Pin Tool function called Instruction as part of the jitting of each instruction. Pin will pass to this function, as it’s first parameter, a handle to the instruction being jitted – will see uses of this handle in later examples. INS_InsertCall directs Pin to insert a call to the docount function before the ins being jitted. IARG_END is the end of the argument list to the docount function – in this case no arguments. The function Instrument is referred to as an “Instrumentation Routine”. The function docount is referred to as an “Analysis Routine”. in this example the analysis routine does not take any parameters – but Pin provides support for a large variety of arguments for passing to the analysis routine Example shows how Pin can inline the code of docount to produce more optimal instrumentation code restore eflags jle <L1> inc icount mov 0x1, %edi

Launcher Process PIN.EXE Launcher Application Process PINVM.DLL pin.exe –t inscount.dll – gzip.exe input.txt Read a Trace from Application Code Jit it, adding instrumentation code from inscount.dll Encode the Jitted trace into the Code Cache Application Code and Data Boot Routine + Data: firstAppIp, “Inscount.dll” inscount.dll PIN.LIB Application Process First app IP Code Cache System Call Dispatcher Event Dispatcher Thread Dispatcher PINVM.DLL Decoder Encoder NTDLL.DLL Windows kernel

Trace Trace Original code BBL#3 BBL#2’ BBL#1’ Early Exit via Stub Trace Exit via Stub Trace BBL#2 BBL#4 BBL#1 BBL#3 BBL# 5 BBL# 6 BBL# 7 FT TK Original code Trace: A sequence of continuous instructions, with one entry point BBL: has one entry point and ends at first control transfer instruction Pin jits application code one trace at a time – was shown on the pin invocation slide as the orange rectangle being read into the pinvm. In the previous slide we saw INS_AddInstrumentationFunction – that is used to instrument on the INS level. Pin also provides TRACE_AddInstrumentationFunction to instrument on the trace level. A trace is a sequence of basic blocks that fall into each other Once execution of a basic block starts it executes till the end of the basic block. Faster instruction counting: Instead of incrementing a counter by 1 on each instruction executed Increment a counter by the number of instructions in the basic block each time the basic block is executed. Note: don’t care where in the basic block the instrumentation code is inserted.

ManualExamples/inscount2.cpp #include "pin.H" UINT64 icount = 0; void PIN_FAST_ANALYSIS_CALL docount(INT32 c) { icount += c; } void Trace(TRACE trace, void *v){{ for(BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) BBL_InsertCall(bbl, IPOINT_ANYWHERE, (AFUNPTR)docount, IARG_FAST_ANALYSIS_CALL, IARG_UINT32, BBL_NumIns(bbl), IARG_END); } void Fini(INT32 code, void *v) {// Pin Callback fprintf(stderr, "Count %lld\n", icount); int main(int argc, char * argv[]) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; TRACE_AddInstrumentationFunction directs Pin to call the Pin Tool function Trace at the beginning of the jitting of each trace Pin will pass the Pin Tool Trace function (“Instrumentation Routine”) a handle to the trace being jitted. This trace habdle is used by the function in calls to PIN APIs. Here we see that it is used to iterate thru the BBLs of the trace. For each BBL the Trace function issues a call to BBL_InsertCall, to direct Pin to a call to the docount function. One parameter is passed to the docount function: It is the count of the number of instructions in the BBL – this value is obtained, at jitting time, by the PIN API BBL_NumIns which takes a its parameter the handle to the BBL which was obtained initially by the call to TRACE_Bbl_Head and then by calls to BBL_Next. IPOINT_ANYWHERE means Pin is free to insert the instrumentation code anywhere in the BBL- this may enable Pin to avoid the flag saving code. The IARG_FAST_ANALYSIS_CALL and PIN_FAST_ANALYSIS_CALL specify that the docount function should t receive it’s parameters in registers (FASTCALL). This may enable Pin to better optimize the code of docount

On the left in the application trace that Pin jits 2 BBLs in the trace APP IP 2 0x77ec4600 cmp rax, rdx 22 0x77ec4603 jz 0x77f1eac9 40 0x77ec4609 movzx ecx, [rax+0x2] 37 0x77ec460d call 0x77ef7870 20 0x001de0000 mov r14, 0xc5267d40 //inscount2.docount 58 0x001de000a add [r14], 0x2 //inscount2.docount 2 0x001de0015 0x77ec4600 cmp rax, rdx 9 0x001de0018 jz 0x1deffa0 (L1) //patched in future 52 0x001de001e mov r14, 0xc5267d40 //inscount2.docount 29 0x001de0028 mov [r15+0x60], rax 57 0x001de002c lahf 37 0x001de002e seto al 50 0x001de0031 mov [r15+0xd8], ax 30 0x001de0039 mov rax, [r15+0x60] 12 0x001de003d add [r14], 0x2 //inscount2.docount 40 0x001de0048 0x77ec4609 movzx edi, [rax+0x2] //ecx alloced to edi 22 0x001de004c push 0x77ec4612 //push retaddr 61 0x001de0051 nop 17 0x001de0052 jmp 0x1deffd0 (L2)//patched in future (L1) 41 0x001deffa0 mov [r15+0x40], rsp // save app rsp 63 0x001deffa4 mov rsp, [r15+0x2d0] // switch to pin stack 56 0x001deffab call [0x2f000000] // call VmEnter // data used by VmEnter – pointed to by return-address of call 0x001deffb8_svc(VMSVC_XFER) 0x001deffc0_sct(0x00065f998) // current register // mapping 0x001deffc8_iaddr(0x077f1eac9) // app target IP of jz (L2) // at 0x77ec4603 24 0x001deffd0 mov [r15+0x40], rsp // save app rsp 34 0x001deffd4 mov rsp, [r15+0x2d0] // switch to pin stack 66 0x001deffdb call [0x2f000000]// call VmEnter 0x001deffe8_svc(VMSVC_XFER) 0x001defff0_sct(0x00065fb60) // current register mapping 0x001defff8_iaddr(0x077ef7870) // app target IP of // call at 0x77ec460d save status flags On the left in the application trace that Pin jits 2 BBLs in the trace Each BBL has 2 instructions On the right is the jitted code for this trace. The inserted instrumentation code, for inscunt2, is in green Application flag saving code is in black

SimpleExamples/inscount2_mt.cpp #include "pin.H" INT32 numThreads = 0; const INT32 MaxNumThreads = 10000; struct THREAD_DATA { UINT64 _count; UINT8 _pad[56]; // guess why? }icount[MaxNumThreads]; // Analysis routine VOID PIN_FAST_ANALYSIS_CALL docount(ADDRINT c, THREADID tid) { icount[tid]._count += c;} // Pin Callback VOID ThreadStart(THREADID threadid, CONTEXT *ctxt, INT32 flags, VOID *v){numThreads++;} VOID Trace(TRACE trace, VOID *v) { // Jitting time routine: Pin Callback for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) BBL_InsertCall(bbl, IPOINT_ANYWHERE, (AFUNPTR)docount, IARG_FAST_ANALYSIS_CALL, IARG_UINT32, BBL_NumIns(bbl), IARG_THREAD_ID, IARG_END); } VOID Fini(INT32 code, VOID *v){// Pin Callback for (INT32 t=0; t<numThreads; t++) printf ("Count[of thread#%d]= %d\n",t,icount[t]._count); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); for (INT32 t=0; t<MaxNumThreads; t++) {icount[t]._count = 0;} PIN_AddThreadStartFunction(ThreadStart, 0); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } Per-thread instruction count. icount is now an array of structures. The _pad field is to keep the _count fields of the different threads from being in the same cache line. In the main function of the tool, note the PIN_AddThreadStartFunction – registers the ThreadStart callback that Pin will call when each new thread starts. Pin assigns each thread a THREADID – this is NOT the OS thread Id. The Pin THREADIDs start at 0 and increase by 1 for each new thread. The Trace function specifies that the IARG_THREAD_ID be passed to the docount analysis function. This is the currently executing thread’s Pin THREADID, docount uses it as the index into the icount array.

Multi-Threading Pin supports multi-threading Application threads execute jitted code including instrumentation code (inlined and not inlined), without any serialization introduced by Pin Instrumentation code can use Pin and/or OS synchronization constructs to introduce serialization if needed. Will see examples of this in Part4 Pin provides APIs for thread local storage. Will see examples in Part3 Pin callbacks are serialized Jitting is serialized Only one application thread can be jitting code at any time

Memory Read Logger Tool #include "pin.h“#include <map> std::map<ADDRINT, std::string> disAssemblyMap; VOID ReadsMem (ADDRINT applicationIp, ADDRINT memoryAddressRead, UINT32 memoryReadSize) { printf ("0x%x %s reads %d bytes of memory at 0x%x\n", applicationIp, disAssemblyMap[applicationIp].c_str(), memoryReadSize, memoryAddressRead);} VOID Instruction(INS ins, void * v) {// Jitting time routine // Pin Callback if (INS_IsMemoryRead(ins)) { disAssemblyMap[INS_Address(ins)] = INS_Disassemble(ins); INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) ReadsMem, IARG_INST_PTR,// application IP IARG_MEMORYREAD_EA, IARG_MEMORYREAD_SIZE, IARG_END); } } int main(int argc, char * argv[]) { PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); } Switch to pin stack push 4 push %eax push 0x7f083de call ReadsMem Pop args off pin stack Switch back to app stack inc DWORD_PTR[%eax] Switch to pin stack push 4 lea %ecx,[%esi]0x8 push %ecx push 0x7f083e4 call ReadsMem Pop args off pin stack Switch back to app stack Pin has determined that it can overwrite ecx The instrumentation routine, Instruction, uses the INS handle as a parameter to the Pin API INS_IsMemoryRead – which returns TRUE iff the ins reads the memory. The disAssemblyMap is an STL map: It saves the dissassembly string of the memory reading inss. (INS_Disassemble(ins) returns the disassembly string of the ins). The key of the map is the appIP of the ins – retrieved by INS_Address(ins). The Instruction routine specifies to Pin to insert a call to ReadsMem before each jitted instruction. The parameters that Pin will pass into ReadsMem are: IARG_INST_PTR: The appIP of the ins (it is the same as the INS_Address). IARG_MEMORYREAD_EA: The virtual address of the memory that the instruction is reading: NOTE the tool writer does need to concern herself with analyzing the operands of the instruction. NOTE how Pin is abstracting away the details, making life easier for the Pin Tool writer. IARG_MEMORYREAD_SIZE: The number of bytes being read. inc DWORD_PTR[%esi]0x8

Malloc Replacement #include "pin.H" void * MallocWrapper( CONTEXT * ctxt, AFUNPTR pf_malloc, size_t size) { // Simulate out-of-memory every so often void * res; if (TimeForOutOfMem()) return (NULL); PIN_CallApplicationFunction(ctxt, PIN_ThreadId(), CALLINGSTD_DEFAULT, pf_malloc, PIN_PARG(void *), &res, PIN_PARG(size_t), size); return res; } VOID ImageLoad(IMG img, VOID *v) { // Pin callback. Registered by IMG_AddInstrumentFunction if (strstr(IMG_Name(img).c_str(), "libc.so") || strstr(IMG_Name(img).c_str(), "MSVCR80") || strstr(IMG_Name(img).c_str(), "MSVCR90")) { RTN mallocRtn = RTN_FindByName(img, "malloc"); PROTO protoMalloc = PROTO_Allocate( PIN_PARG(void *), CALLINGSTD_DEFAULT, "malloc", PIN_PARG(size_t), PIN_PARG_END() ); RTN_ReplaceSignature(mallocRtn, AFUNPTR(MallocWrapper), IARG_PROTOTYPE, protoMalloc, IARG_CONTEXT, IARG_ORIG_FUNCPTR, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); } } int main(int argc, CHAR *argv[]) { PIN_InitSymbols(); PIN_Init(argc,argv)); IMG_AddInstrumentFunction(ImageLoad, 0); PIN_StartProgram(); } Wrap the malloc function: Our MallocWrapper function will, most of the time, call the original malloc function and return what the original malloc returns. But every so often it will return NULL – this is called error-injection, it can be used to test how well an application handles the error condition of out-of-memory. PIN_InitSymbols instructs Pin to make use of any symbol info available for the process. IMG_AddInstrumentationFunction specifies to Pin to call the ImageLoad function each time an image is loaded – the main image itself and each and every shared library loaded. The ImageLoad function will receive the IMG parameter – a handle to the image being loaded. It uses this handle to query for the name of the image, and if the name of the image is one where the malloc function is expected, then RTN_FindByName is used to get an RTN handle to the malloc routine. RTN_ReplaceSignature is used to specify the tool function to be called instead of malloc. It also specifies which parameters to pass to this tool function. IARG_PROTOTYPE: The prototype of the function being replaced – this is used by Pin and not passed to the tool function IARG_CONTEXT: pass a full application register context of the thread to the replacement function (this is needed in order to call the original malloc function with the proper application context. IARG_ORIG_FUNCPTR: A pointer to the original malloc function IARG_FUNCARG_ENTRYPOINT_VALUE, 0: The argument to the original malloc function. Within the MallocWrapper rotuine the PIN_CallApplicationFunction is used to call the original malloc function in the original application context of the currently executing thread.

Pin Probe-Mode Probe mode is a method of using Pin to wrap or replace application functions with functions in the tool. A jump instruction (probe), which redirects the flow of control to the replacement function is placed at the start of the specified function. The bytes being overwritten are relocated, so that Pin can provide the replacement function with the address of the first relocated byte. This enables the replacement function to call the replaced (original) function. In probe mode, the application and the replacement routine are run natively (not Jitted). This improves performance, but puts more responsibility on the tool writer. Probes can only be placed on RTN boundaries, and should inserted within the Image load callback. Pin will automatically remove the probes when an image is unloaded. Many of the PIN APIs that are available in JIT mode are not available in Probe mode.

Malloc Replacement Probe-Mode #include "pin.H" void * MallocWrapper(AFUNPTR pf_malloc, size_t size) { // Simulate out-of-memory every so often void * res; if (TimeForOutOfMem()) return (NULL); res = pf_malloc(size); return res; } VOID ImageLoad (IMG img, VOID *v) { if (strstr(IMG_Name(img).c_str(), "libc.so") || strstr(IMG_Name(img).c_str(), "MSVCR80") || strstr(IMG_Name(img).c_str(), "MSVCR90")) { RTN mallocRtn = RTN_FindByName(img, "malloc"); if ( RTN_Valid(mallocRtn) && RTN_IsSafeForProbedReplacement(mallocRtn) ) PROTO proto_malloc = PROTO_Allocate(PIN_PARG(void *), CALLINGSTD_DEFAULT, "malloc", PIN_PARG(size_t), PIN_PARG_END() ); RTN_ReplaceSignatureProbed (mallocRtn, AFUNPTR(MallocWrapper), IARG_PROTOTYPE, proto_malloc, IARG_ORIG_FUNCPTR, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); } }} int main(int argc, CHAR *argv[]) { PIN_InitSymbols(); PIN_Init(argc,argv)); IMG_AddInstrumentFunction(ImageLoad, 0); PIN_StartProgramProbed(); } This is very similar to the Jit-mode malloc replacement example just presented. Note PIN_StartProgramProbed instead of Pin_StartProgram. Note RTN_IsSafeForProbedReplacement must be used to determine if a probe is indeed necessary, more on this later. Note that in the MallocWrapper routine the original function can be called directly – because the context is the original application context.

SDE SDE: A fast functional simulator for applications with new instructions New instructions have been defined Compiler generates code with new instructions What can be used to run the apps with the new instructions? Use PinTool that emulates new instructions. vmovdqu ymm?, mem256 vmovdqu mem256, ymm? 16 new 256 bit ymm registers Read/Write ymm register from/to memory.

Launcher Process PIN.EXE Launcher Application Process PINVM.DLL pin.exe –t inscount.dll – gzip.exe input.txt Read a Trace from Application Code Jit it, adding instrumentation code from inscount.dll Encode the Jitted trace into the Code Cache Execute it Application Code and Data Boot Routine + Data: firstAppIp, “Inscount.dll” inscount.dll PIN.LIB Application Process First app IP Code Cache System Call Dispatcher Event Dispatcher Thread Dispatcher PINVM.DLL Decoder Encoder NTDLL.DLL Windows kernel

sde_emul.dll Schema #include "pin.H" VOID EmVmovdquMem2Reg(unsigned int ymmDstRegNum, ADDRINT * ymmMemSrcPtr) { PIN_SafeCopy(ymmRegs[ymmDstRegNum], ymmMemSrcPtr, 32); } VOID EmVmovdquReg2Mem(int ymmSrcRegNum, ADDRINT * ymmMemDstPtr) { PIN_SafeCopy(ymmMemDstPtr, ymmRegs[ymmRegNum], 32); } VOID Instruction(INS ins, VOID *v) { switch (INS_Opcode(ins) { ::::: case XED_ICLASS_VMOVDQU: if (INS_IsMemoryRead(ins)) // vmovdqu ymm? <= mem256 INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)EmVmovdquMem2Reg, IARG_UINT32, REG(INS_OperandReg(ins, 0)) - REG_YMM0, IARG_MEMORYREAD_EA, IARG_END); else if (INS_IsMemoryWrite(ins)) // vmovdqu mem256 <= ymm? (AFUNPTR)EmVmovdquReg2Mem, IARG_UINT32, REG(INS_OperandReg(ins, 1)) - REG_YMM0, IARG_MEMORYWRITE_EA, INS_DeleteIns(ins); //Processor does NOT execute this instruction break; } } int main(int argc, CHAR *argv[]) { PIN_Init(argc,argv)); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); } SDE replaces new instruction with calls to emulation functions. Non-new instructions are jitted and executed directly. The decoder used by Pin is enhanced to be able to decode the new instructions – but the hardware that Pin is running on can NOT execute the new instructions. The Instruction instrumentation function uses the ins handle in the call to INS_Opcode to retrieve the ins’s opcode enumerator. Those opcodes which are new have calls to analysis functions that emulate them inserted into the jitted code, and the necessary parameters to these are also specified. The new instruction itself is then INS_Deleted from the instruction stream – so it is never executed. In this case we show the emulation of ymovdqu ymmReg#, mem And ymovdqu mem, ymmReg# Used to move 256bits from memory into one of the new ymm regs (16 of them) and vice versa. The emulation function is passed the number of the ymm reg being accessed, and the address of the memory being accessed. It uses PIN_SafeCopy to copy the 256bits to/from memory from/to the emulated ymm register. PIN_SafeCopy is used because Pin overwrites some small sections of application memory – and if the application is accessing that area then it should see the original contents of memory. PIN_SafeCopy ensures that the application always sees the original memory contents.

Child (Injector) gzip (Injectee) Pin (Injectee) fork Child (Injector) Pin (Injectee) gzip (Injectee) Linux Invocation+Injection exitLoop = FALSE; Ptrace TraceMe while(!exitLoop){} pin –t inscount.so – gzip input.txt gzip input.txt Pin stack PinTool that counts application instructions executed, prints Count at end Ptrace Injectee – Injectee Freezes gzip Code and Data Pin Code and Data MiniLoader Code to Save Pin Code and Data MiniLoader Code to Save Injectee.exitLoop = TRUE; MiniLoader Ptrace continue (unFreezes Injectee) execv(gzip); // Injectee Freezes Execution of Injector resumes after execv(gzip) in Injectee completes Ptrace Copy (save, gzip.CodeSegment, sizeof(MiniLoader)) PtraceGetContext (gzip.OrigContext) PtraceCopy (gzip.CodeSegment, MiniLoader, sizeof(MiniLoader)) Pin Code and Data MiniLoader Ptrace continue@MiniLoader (unFreezes Injectee) IP MiniLoader loads Pin+Tool, allocates Pin stack Kill(SigTrace, Injector): Freezes until Ptrace Cont Wait for MiniLoader complete (SigTrace from Injectee) gzip OrigCtxt gzip OrigCtxt Describes how an application process in Pinned on Linux. Ptrace Copy (gzip.CodeSegment, save, sizeof(MiniLoader)) Ptrace Copy (gzip.pin.stack, gzip.OrigCtxt, sizeof (ctxt)) Ptrace SetContext (gzip.IP=pin, gzip.SP=pin.Stack) Code to Save Inscount2.so Ptrace Detach

Part1 Summary Pin is Intel’s dynamic binary instrumentation engine Pin can be used to instrument all user level code Windows, Linux IA-32, Intel64, IA64 Product level robustness Jit-Mode for full instrumentation: Thread, Function, Trace, BBL, Instruction Probe-Mode for Function Replacement/Wrapping/Instrumentation only. Pin supports multi-threading, no serialization of jitted application nor of instrumentation code Pin API makes Pin Tools easy to write Presented 6 full Pin tools, each one fit on 1 ppt slide Popular and well supported 30,000+ downloads, 400+ citations Free DownLoad www.pintool.org Includes: Detailed user manual, source code for 100s of Pin tools Pin User Group http://tech.groups.yahoo.com/group/pinheads/ Pin users and Pin developers answer questions

Part2: Larger Pin tools and writing efficient Pin tools

CMP$im – A CMP Cache Simulation Pin Tool WORK LOAD Modeling an 8-core CMP using CMP$im Instrumentation Routines PIN ThreadID, Address, Size, Access Type DL1 ThreadID Address, Size Access Type L2 INTERCONNECT Cache model PRIVATE LLC/SHARED BANKED LLC LLC Params to configure # cache levels, size, threads/cache etc CMP$im author: Aamer.Jaleel@intel.com

CMP$im – Instrument Memory References Pin Tool VOID Instruction(INS ins, VOID *v) { if( INS_IsMemoryRead(ins) ) // If instruction reads // from memory INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)MemoryReference, IARG_THREAD_ID, IARG_MEMORYREAD_EA, IARG_MEMORYREAD_SIZE, IARG_UINT32, ACCESS_TYPE_LOAD, IARG_END); if( INS_IsMemoryWrite(ins) ) // If instructions writes // to memory INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemoryReference, IARG_THREAD_ID, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_UINT32, ACCESS_TYPE_STORE, IARG_END); } ANALYSIS ROUTINES ROUTINES INSTR MAIN

CMP$im – Analyze Memory References #include “cache_model.h” CACHE_t CacheHierarchy[MAX_NUM_THREADS][MAX_NUM_LEVELS]; VOID MemoryReference( int tid, ADDRINT addrStart, int size, int type) { for(addr=addrStart; addr<(addrStart+size); addr+=LINE_SIZE) LookupHierarchy( tid, FIRST_LEVEL_CACHE, addr, type); } VOID LookupHierarchy( int tid, int level, ADDRINT addr, int accessType) { result = cacheHier[tid][cacheLevel]->Lookup( addr, accessType ); if( result == CACHE_MISS ) { if( level == LAST_LEVEL_CACHE ) return; if( IsShared(level) ) AcquireLock(&lock[level], tid); LookupHierarchy(tid, level+1, addr, accessType); ReleaseLock(&lock[level]); ANALYSIS ROUTINES ROUTINES INSTR MAIN Synchronization point

2MB LLC Cache Behavior – 4 Threads, AMMP 10 mil phase cumulative Private Cache Miss Rate Miss Rate: Private Cache: 75% Shared Cache: 50% Shared caches have better hit rate when compared to private caches Shared Cache Miss Rate Instruction Count (billions)

Shared Refs & Shared Caches… 1 Thread 2 Thread 3 Thread 4 Thread Cache Miss (4 Threaded Run) Workloads have different phases of execution Shared caches BETTER during phases when shared data is referenced frequently A B % Total Accesses GeneNet – 16MB LLC Private LLC Miss Rate HPCA’06: Jaleel et al. Shared LLC Miss Rate 34

Paul Petersen, Zhiqiang Ma Intel Thread Checker Paul Petersen, Zhiqiang Ma Detect data races Instrumentation Memory operations Synchronization operations Analysis Use dynamic history of lock acquisition and release to form a partial order of memory references [Lamport 1978] Unordered read/write and write/write pairs to same location are races

a documented data race in the art benchmark is detected 36

PinPlay : Workload capture and deterministic replay Problem : Multi-threaded programs are inherently non-deterministic making their analysis, simulation, debugging very challenging Solution: PinPlay : A Pin-based framework for capturing an execution of multi-threaded program and replaying it deterministically under Pin App and input not needed once we have the log PinPlay LOGS Deterministic replay on any machine Application logger Pin input Replayer Harish Patil & Cristiano Pereira Joint work with Brad Calder, UCSD

Logging to provide deterministic behavior Start with checkpoint: memory image of code and data A thread is deterministic if every loads sees either: Data from original checkpoint Or a value computed and stored on the thread Potential non-determinism when a load sees a memory location written by an external agent Another thread Or system call, DMA, etc. Log these values with timestamps

Example Thread T1 Thread T2 DirEntry: [A:D] Last writer id: T1 T2 T1: 1: Store A T1 T2 1: Load F WAW T1: 1 2 T2: 2 Program execution 2: Store A DirEntry: [E:H] RAW 2: Load A 3: Load F Last writer id: T1 3: Store F WAR T1: 3 T2: 1 3 Last_writer SMO logs: Last access to the DirEntry Thread T2 cannot execute memory reference 2 until T1 executes its memory reference 1 T1 2 T2 2 T2 2 T1 1 T1 3 T2 3 Thread T1 cannot execute memory reference 2 until T2 executes its memory reference 2

Applying multi-threaded tracing to software tools Debugging. Customer interested in debugging tools derived from PinPlay Capture bug at customer, bring home log to debug Capture multi-threaded “heisenbug”, replay multiple times How: combine PinPlay tracing with transparent debugging PinPlay LOGS Pin Debug Agent debugger Replayer Standard protocol Pin debug agent enables custom debugger commands

Reducing Instrumentation Overhead Total Overhead = Pin Overhead + Pintool Overhead ~5% for SPECfp and ~50% for SPECint Pin team’s job is to minimize this Usually much larger than pin overhead Pintool writers can help minimize this!

Reducing the Pintool’s Overhead Instrumentation Routines Overhead Analysis Routines Overhead + Work required in the Analysis Routine Frequency of calling an Analysis Routine x Work required for transiting to Analysis Routine Work done inside Analysis Routine +

Reducing Work in Analysis Routines Key: Shift computation from analysis routines to instrumentation routines whenever possible This usually has the largest speedup

Counting control flow edges jne 60 40 100 40 ret call jmp 40 60 jne 1

Edge Counting: a Slower Version ... void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsBranchOrCall(ins)) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); Analysis Instrumentation

Edge Counting: a Faster Version void docount(COUNTER* pedge, INT32 taken) { pedg->count += taken; } void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); void Instruction(INS ins, void *v) { if (INS_IsDirectBranchOrCall(ins)) { COUNTER *pedg = Lookup(INS_Address(ins), INS_DirectBranchOrCallTargetAddress(ins)); INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount, IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); } else INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); … Analysis Instrumentation

Analysis Routines: Reduce Call Frequency Key: Instrument at the largest granularity whenever possible Instead of inserting one call per instruction Insert one call per basic block or trace

Slower Instruction Counting counter++; sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax

Faster Instruction Counting Counting at BBL level Counting at Trace level counter += 3 sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax counter += 5 counter += 2 counter+=3 L1

Reducing Work for Analysis Transitions Reduce number of arguments to analysis routines Inline analysis routines Pass arguments in registers Instrumentation scheduling

Reduce Number of Arguments Eliminate arguments only used for debugging Instead of passing TRUE/FALSE, create 2 analysis functions Instead of inserting a call to: Analysis(BOOL val) Insert a call to one of these: AnalysisTrue() AnalysisFalse() IARG_CONTEXT is very expensive (> 10 arguments)

Pin will inline analysis functions into jitted application code Inlining Not-inlinable Inlinable int docount1(int i) { if (i == 1000) x[i]++; return x[i]; } int docount0(int i) { x[i]++ return x[i]; } Not-inlinable Not-inlinable int docount2(int i) { x[i]++; printf(“%d”, i); return x[i]; } void docount3() { for(i=0;i<100;i++) x[i]++; } Embed analysis routines directly into the application Avoid transiting through “bridges” Current limitation: Only straight-line code can be inlined Pin will inline analysis functions into jitted application code

Inlining Use the –log_inline invocation switch to record inlining decisions in pin.log pin –log_inline –t mytool – app Look in pin.log Analysis function (0x2a9651854c) from mytool.cpp:53 INLINED Analysis function (0x2a9651858a) from mytool.cpp:178 NOT INLINED The last instruction of the first BBL fetched is not a ret instruction Look at source or disassembly of the function in mytool.cpp at line 178 0x0000002a9651858a push rbp 0x0000002a9651858b mov rbp, rsp 0x0000002a9651858e mov rax, qword ptr [rip+0x3ce2b3] 0x0000002a96518595 inc dword ptr [rax] 0x0000002a96518597 mov rax, qword ptr [rip+0x3ce2aa] 0x0000002a9651859e cmp dword ptr [rax], 0xf4240 0x0000002a965185a4 jnz 0x11 The function could not be inlined because it contains a control-flow changing instruction (other than ret)

Conditional Inlining Inline a common scenario where the analysis routine has a single “if-then” The “If” part is always executed The “then” part is rarely executed Useful cases: “If” can be inlined, “Then” is not “If” has small number of arguments, “then” has many arguments (or IARG_CONTEXT) Pintool writer breaks analysis routine into two: INS_InsertIfCall (ins, …, (AFUNPTR)doif, …) INS_InsertThenCall (ins, …, (AFUNPTR)dothen, …)

IP-Sampling (a Slower Version) const INT32 N = 10000; const INT32 M = 5000; INT32 icount = N; VOID IpSample(VOID* ip) { --icount; if (icount == 0) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between <N, N+M> } VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)IpSample, IARG_INST_PTR, IARG_END); }

IP-Sampling (a Faster Version) INT32 CountDown() { --icount; return (icount==0); } VOID PrintIp(VOID *ip) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between <N, N+M> inlined not inlined VOID Instruction(INS ins, VOID *v) { // CountDown() is always called before an inst is executed INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)CountDown, IARG_END); // PrintIp() is called only if the last call to CountDown() // returns a non-zero value INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintIp, IARG_INST_PTR, IARG_END); }

Optimizing Your Pintools - Summary Baseline Pin has fairly low overhead (~5-20%) Adding instrumentation can increase overhead significantly, but you can help! Move work from analysis to instrumentation routines Explore larger granularity instrumentation Explore conditional instrumentation Understand when Pin can inline instrumentation

Part3: Deeper into Pin API Agenda memtrace_simple tool membuffer_simple tool branchbuffer_simple tool Symbols DebugInfo Probe-Mode Multi-Threading

memtrace_simple Tool code collects pairs of {appIP, memAddr} of memory accessing instructions into a per-thread buffer Remember: It is the application thread(s) that execute the Pin Tool code. When the buffer becomes full, tool code processes the entries in the buffer, resets the collection to start at the beginning of the buffer, then execution continues – and so-forth. Is a representative of many memory-trace processing tools

memtrace_simple Tool code must Instrument each memory accessing instruction Determine where in the buffer the {appIP, memAddr} of the instruction should be written Determine when the buffer becomes full Will instrument instructions on Trace level – i.e. TRACE_AddInstrumentFunction(Trace, 0); Not all instructions in the trace will necessarily execute each time trace is executed – because of early exits. Will try to allocate, in the buffer, maximum space needed by trace at the trace start – if not enough space => buffer is full

memtrace_simple Instrumentation code for each memory accessing instruction in the trace will write it’s {appIP, memAddr} pair to a constant offset from the start of the trace in the buffer. Empty pairs (those instructions that were NOT executed) will be denoted by having an appIP==0.

memtrace_simple Trace If endOf(Previous)TraceReg + TotalSizeOccupiedByTraceInBuffer > endOfBufferReg Buffer Then Call BufferFull endOfTraceReg += TotalSizeOccupiedByTraceInBuffer Early Exit endOfTraceReg appIP memAddr endOfBufferReg TotalSizeOccupiedByTraceInBuffer Trace Exit Non memory access ins Instrumentation code for following memory access ins Memory access ins The instrumentation code (in mauve) of each memory accessing ins (in green) writes the {appIP, memAddr} pair into the buffer at a constant offset from the start of the buffer. If an early exit is taken then the pairs of the following memory accessing inss in the trace will not be written. This slide shows the case when there is enough room in the buffer for the data this trace will write into it. At the beginning of the trace is the IF-THEN instrumentation construct. The IF part checks if there is NOT enough room in the buffer for the {appIP, memAddr} pairs of this trace. This code is inlined, if the answer is TRUE (i.e. there is not enough room) – the THEN part is executed (not inlined) – the BufferFull function will process the entries in the buffer and set the endOfTraceReg to contain the address of the start of the buffer. The update of the endOfTraceReg (in blue) is executed in any case.

memtrace_simple Trace If endOf(Previous)TraceReg + TotalSizeOccupiedByTraceInBuffer > endOfBufferReg Buffer Then Call BufferFull endOfTraceReg appIP memAddr endOfTraceReg += TotalSizeOccupiedByTraceInBuffer Early Exit Trace Exit endOfTraceReg Non memory access ins Instrumentation code for following memory access ins Memory access ins This shows the case when there is NOT enough room in the buffer for the {appIP, memAddr} pairs of this trace. TotalSizeOccupiedByTraceInBuffer endOfBufferReg

memtrace_simple Tool will: iterate thru all INSs of the Trace Record which ones need to be instrumented (access memory) Record the ins, the memop, the offset from start of the trace in the buffer where the {appIP, memAddr} pair of this ins should be written Get a sum of the TotalSizeOccupiedByTraceInBuffer Insert the IF-THEN sequence at the beginning of the trace Insert the update of endOfTraceReg just after the IF-THEN sequence iterate thru recorded (memory accessing) INSs of the Trace Insert the instrumentation code before each recorded memory accessing instruction this is the code that writes the {appIP, memAddr} pair into the buffer at the designated offset (from start of trace) for this INS. endOfTraceReg and endOfBufferReg are virtual registers allocated by Pin to the Pin tool.

memtrace_simple TLS_KEY appThreadRepresentitiveKey; // Pin TLS key REG endOfTraceInBufferReg; // Pin virtual Reg that will hold the pointer to the end of the trace data in // the buffer REG endOfBufferReg; // Pin virtual Reg that will hold the pointer to the end of the buffer struct MEMREF { ADDRINT appIP; ADDRINT memAddr; } ; // structure of the {appIP, memAddr} pair of a memory accessing ins in the buffer int main(int argc, char * argv[]) { PIN_Init(argc,argv) ; // Pin TLS slot for holding the object that represents the application thread appThreadRepresentitiveKey = PIN_CreateThreadDataKey(0); // get the registers to be used in each thread for managing the per-thread buffer endOfTraceInBufferReg = PIN_ClaimToolRegister(); endOfBufferReg = PIN_ClaimToolRegister(); TRACE_AddInstrumentFunction(TraceAnalysisCalls, 0); PIN_AddThreadStartFunction(ThreadStart, 0); PIN_AddThreadFiniFunction(ThreadFini, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); } Pin provides an API for Thread Local Storage (TLS). The TLS_KEY is used to store and retrieve thread local data. PIN_CreateThreadDataKey allocates a TLS_KEY. In this tool the data held in the Pin TLS will be a pointer to an object that contains information about this thread’s buffer. The THREAD_ID of the thread is used to access the Pin TLS of the thread. Pin can allocate virtual registers for tool use. The registers are allocated by the PIN_ClaimToolRegister API. The tool can then use reference and set values in these registers. Pin takes care of the allocation of physical registers to these virtual registers.

memtrace_simple KNOB<UINT32> KnobNumBytesInBuffer(KNOB_MODE_WRITEONCE, "pintool", "num_bytes_in_buffer", "0x100000", "number of bytes in buffer"); APP_THREAD_REPRESENTITVE::APP_THREAD_REPRESENTITVE(THREADID myTid) { _buffer = new char[KnobNumBytesInBuffer.Value()]; // Allocate the buffer _numBuffersFilled = 0; _numElementsProcessed = 0; _myTid = myTid; } char * APP_THREAD_REPRESENTITVE::Begin() { return _buffer; } char * APP_THREAD_REPRESENTITVE:: End() { return _buffer + KnobNumBytesInBuffer.Value(); } VOID ThreadStart(THREADID tid, CONTEXT *ctxt, INT32 flags, VOID *v) // Pin callback on thread // creation { // There is a new APP_THREAD_REPRESENTITVE object for every thread APP_THREAD_REPRESENTITVE * appThreadRepresentitive = new APP_THREAD_REPRESENTITVE(tid); // A thread will need to look up its APP_THREAD_REPRESENTITVE, so save pointer in Pin TLS PIN_SetThreadData(appThreadRepresentitiveKey, appThreadRepresentitive, tid); // Initialize endOfTraceInBufferReg to point at beginning of buffer PIN_SetContextReg(ctxt, endOfTraceInBufferReg, reinterpret_cast<ADDRINT>(appThreadRepresentitive->Begin())); // Initialize endOfBufferReg to point at end of buffer PIN_SetContextReg(ctxt, endOfBufferReg, reinterpret_cast<ADDRINT>(appThreadRepresentitive->End())); } Pin provides the KNOB type for processing command line arguments of the tool. In this case the tool takes one command line argument pair: -num_bytes_in_buffer <n> where <n> is the number of bytes allocated to each buffer. The default is 0x100000. The APP_THREAD_REPRESENTITVE is a class that holds the information about this thread’s buffer. One object of this type is created for each thread, and the pointer to it is held in the Pin TLS of this thread, as described in the previous slide. Here we show the contructor of the APP_THREAD_REPRESENTITVE. The ThreadStart function is the Pin callback function that the tool registered via the call to PIN_AddThreadStartFunction (in the tool’s main). It allocates the APP_THREAD_REPRESENTITIVE object of this thread, then stores the pointer to it in the Pin TLS using the PIN_SetThreadData API. The PIN_SetContextReg API is used to set the value of the virtual registers of this thread (allocated in previous slide). The thread will start running with the register context CONTEXT *ctxt

memtrace_simple void TraceAnalysisCalls(TRACE trace, void *) /*TRACE_AddInstrumentFunction(TraceAnalysisCalls, 0)*/ { // Go over all BBLs of the trace and for each BBL determine and record the INSs which need // to be instrumented - i.e. the ins requires an analysis call TRACE_ANALYSIS_CALLS_NEEDED traceAnalysisCallsNeeded; for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) DetermineBBLAnalysisCalls(bbl, &traceAnalysisCallsNeeded); // If No memory accesses in this trace if (traceAnalysisCallsNeeded.NumAnalysisCallsNeeded() == 0) return; // APP_THREAD_REPRESENTITVE::CheckIfNoSpaceForTraceInBuffer will determine if there are NOT enough // available bytes in the buffer. If there are NOT then it returns TRUE and the BufferFull function is called TRACE_InsertIfCall(trace, IPOINT_BEFORE, AFUNPTR(APP_THREAD_REPRESENTITVE::CheckIfNoSpaceForTraceInBuffer), IARG_FAST_ANALYSIS_CALL, IARG_REG_VALUE, endOfTraceInBufferReg, // previous trace IARG_REG_VALUE, endOfBufferReg, IARG_UINT32, traceAnalysisCallsNeeded.TotalSizeOccupiedByTraceInBuffer(), IARG_END); TRACE_InsertThenCall(trace, IPOINT_BEFORE, AFUNPTR(APP_THREAD_REPRESENTITVE::BufferFull), IARG_REG_VALUE, endOfTraceInBufferReg, IARG_THREAD_ID, IARG_RETURN_REGS, endOfTraceInBufferReg, IARG_END); TRACE_InsertCall(trace, IPOINT_BEFORE, AFUNPTR(APP_THREAD_REPRESENTITVE::AllocateSpaceForTraceInBuffer), IARG_RETURN_REGS, endOfTraceInBufferReg, IARG_END); // Insert Analysis Calls for each INS on the trace that was recorded as needing one traceAnalysisCallsNeeded.InsertAnalysisCalls(); } TraceAnalysisCalls is the TRACE instrumentation callback. TRACE_ANALYSIS_CALLS is the class that holds the information on which INSs in the trace need to be instrumented (i.e. the memory accessin INSs). The DetermineBBLAnalysisCalls will be shown on the following slides. It fills in info in the TRACE_ANALYSIS_CALLS object. TRACE_InsertIfCall and TRACE_InsertThenCall insert the IF-THEN construct at the beginning of the trace. Note the IARG_RETURN_REGS argument which is used to set the value of the endOfTraceInBufferReg to the value returned by the BufferFull function. (BufferFull will return the address of the start of the buffer). The update of the endOfTraceInBufferReg is inserted next into the jitted trace, thereby allocating space for the {appIP, memAddr} pairs of this trace in the buffer. Finally the traceInsertCallsNeeded.InsertAnalysisCalls() will insert the instrumentation code of all the INSs recorded by the above call to DetermineBBLAnalysisCalls

memtrace_simple static ADDRINT PIN_FAST_ANALYSIS_CALL APP_THREAD_REPRESENTITVE::CheckIfNoSpaceForTraceInBuffer ( // Pin will inline this function char * endOfPreviousTraceInBuffer, char * bufferEnd, ADDRINT totalSizeOccupiedByTraceInBuffer) { return (endOfPreviousTraceInBuffer + totalSizeOccupiedByTraceInBuffer >= bufferEnd); } static char * PIN_FAST_ANALYSIS_CALL APP_THREAD_REPRESENTITVE::BufferFull ( // Pin will NOT inline this function char *endOfTraceInBuffer, ADDRINT tid) // Get this thread’s APP_THREAD_REPRESENTITVE from the Pin TLS APP_THREAD_REPRESENTITVE * appThreadRepresentitive = static_cast<APP_THREAD_REPRESENTITVE*> (PIN_GetThreadData(appThreadRepresentitiveKey, tid)); appThreadRepresentitive->ProcessBuffer(endOfTraceInBuffer); // After processing the buffer, move the endOfTraceInBuffer back to the beginning of the buffer endOfTraceInBuffer = appThreadRepresentitive->Begin(); return endOfTraceInBuffer; APP_THREAD_REPRESENTITVE::AllocateSpaceForTraceInBuffer (// Pin will inline this function return (endOfPreviousTraceInBuffer + totalSizeOccupiedByTraceInBuffer); This shows the analysis functions inserted at the beginning of each trace: The IF-THEN pair: CheckIfNoSpaceForTraceInBuffer, BufferFull. and the AllocateSpaceForTraceInBuffer which returns the value to be placed in the endOfTraceInBufferReg. Pin will place this value in the endOfTraceInBufferReg (remember the IARG_RETURN_REGS parameter in the previous slide). The BufferFull function retrieves the pointer to this thread’s APP_THREAD_REPRESENTITVE object from the Pin TLS using the TLS_KEY (appThreadReresentitveKey) and the thread’s THREAD_ID. It then call the ProcessBuffer function of the thread. Finally it returns the address of the beginning of the buffer, which Pin will place into the endOfTraceInBufferReg (remember the IARG_RETURN_REGS parameter in the previous slide).

memtrace_simple class ANALYSIS_CALL_INFO { public: ANALYSIS_CALL_INFO(INS ins, UINT32 offsetFromTraceStartInBuffer, UINT32 memop) : _ins(ins), _offsetFromTraceStartInBuffer(offsetFromTraceStartInBuffer), _memop (memop) {} void InsertAnalysisCall(INT32 sizeofTraceInBuffer); private: INS _ins; INT32 _offsetFromTraceStartInBuffer; UINT32 _memop; }; class TRACE_ANALYSIS_CALLS_NEEDED { TRACE_ANALYSIS_CALLS_NEEDED() : _numAnalysisCallsNeeded(0), _currentOffsetFromTraceStartInBuffer(0) {} UINT32 NumAnalysisCallsNeeded() const { return _numAnalysisCallsNeeded; } UINT32 TotalSizeOccupiedByTraceInBuffer() const { return _currentOffsetFromTraceStartInBuffer; } void RecordAnalysisCallNeeded(INS ins, UINT32 memop) { _analysisCalls.push_back(ANALYSIS_CALL_INFO(ins, _currentOffsetFromTraceStartInBuffer, memop)); _currentOffsetFromTraceStartInBuffer += sizeof(MEMREF); _numAnalysisCallsNeeded++; } void InsertAnalysisCalls(); INT32 _currentOffsetFromTraceStartInBuffer; INT32 _numAnalysisCallsNeeded; vector<ANALYSIS_CALL_INFO> _analysisCalls; }; void DetermineBBLAnalysisCalls (BBL bbl, TRACE_ANALYSIS_CALLS_NEEDED * traceAnalysisCallsNeeded) { for (INS ins = BBL_InsHead(bbl); INS_Valid(ins); ins = INS_Next(ins)) { // Iterate over each memory operand of the instruction. for (UINT32 memOp = 0; memOp < INS_MemoryOperandCount(ins); memOp++) // Record that an analysis call is needed, along with the info needed to generate the analysis // call traceAnalysisCallsNeeded->RecordAnalysisCallNeeded(ins, memOp); } } This slide shows the TRACE_ANALYSIS_CALLS_NEEDED class. Remember an object of this type is used in the TraceAnalysisCalls function to record the information neede to insert the instrumentation code of each of the memory accessing INSs of the trace. The information held is the _currentOffsetFromTraceStartInBuffer, which is the constant offset from the start of the MEMREF pairs of this trace in the buffer. The _analysisCalls is an STL vector of ANALYSIS_CALL_INFO objects that hold the information needed to generate the instrumentation code of the memory accessing ins recorded in this element: the _ins, the _offsetFromStartOfBuffer alloted to this inss MEMREF pair, and _memop identifier of the memory operand in the ins. The DetermineBBLAnalysisCalls is called on each BBL of the trace (from the TraceAnalysisCalls function).

memtrace_simple static void PIN_FAST_ANALYSIS_CALL APP_THREAD_REPRESENTITVE::RecordMEMREFInBuffer ( // Pin will inline this function char* endOfTraceInBuffer, ADDRINT offsetFromEndOfTrace, ADDRINT appIp, ADDRINT memAddr) { *reinterpret_cast<ADDRINT*>(endOfTraceInBuffer+ offsetFromEndOfTrace) = appIp; *reinterpret_cast<ADDRINT*>(endOfTraceInBuffer+ offsetFromEndOfTrace +sizeof(ADDRINT)) = memAddr; } void ANALYSIS_CALL_INFO::InsertAnalysisCall(INT32 sizeofTraceInBuffer) /* the place in the buffer where the {appIp, memAddr} of this _ins should be recorded is computed by: endOfTraceInBufferReg -sizeofTraceInBuffer + _offsetFromTraceStartInBuffer(of this _ins) */ INS_InsertCall(_ins, IPOINT_BEFORE, AFUNPTR(APP_THREAD_REPRESENTITVE::RecordMEMREFInBuffer), IARG_FAST_ANALYSIS_CALL, IARG_REG_VALUE, endOfTraceInBufferReg, IARG_ADDRINT, ADDRINT(_offsetFromTraceStartInBuffer - sizeofTraceInBuffer), IARG_INST_PTR, IARG_MEMORYOP_EA, _memop, IARG_END); void TRACE_ANALYSIS_CALLS_NEEDED::InsertAnalysisCalls() {// Iterate over the recorded ANALYSIS_CALL_INFO elements – insert the analysis call for (vector<ANALYSIS_CALL_INFO>::iterator c = _analysisCalls.begin(); c != _analysisCalls.end(); c++) c->InsertAnalysisCall(TotalSizeOccupiedByTraceInBuffer()); This slide shows the instrumentation code being inserted for each memory accessing INS in the trace. The InsertAnalysisCalls iterates over the _analysisCalls vector (which was filled in in the DetermineBBLAnalysisCalls function). InsertAnalysisCall inserts a call to the analysis function RecordMEMREFInBuffer before the memory accessing _ins recorded in the _analysisCalls vector element.

membuffer_simple Since managing a per-thread buffer is a necessity of a large class of Pin tools: Provide Pin APIs to make it (more) easy. Pin Buffering API, abstracts away the need for a Pin tool to manage per-thread buffers PIN_DefineTraceBuffer Define a per-thread buffer that each application trace can write data to INS_InsertFillBuffer Instrumentation code is generated to write the desired data into the buffer This code is inlined Tool defined BufferFull function, instrumentation code will cause this function to be called when the buffer becomes full

membuffer_simple Pin Buffering API actually works somewhat different than memtrace Instrumentation code will insert the data generated by an INS into the buffer immediately after the data generated by the previously executed instrumented INS Better buffer utilization Requires the instrumentation to update the next buffer location to write to – this was not required in the memtrace implementatio All this is invisible to the Pin tool writer membuffer_simple is a Pin tool that uses the Pin Buffering API to do the same memory access recording that memtrace_simple does

membuffer_simple KNOB<UINT32> KnobNumPagesInBuffer(KNOB_MODE_WRITEONCE, "pintool", "num_pages_in_buffer", "256", "number of pages in buffer"); // Struct of memory reference written to the buffer struct MEMREF { ADDRINT appIP; ADDRINT memAddr; }; // The buffer ID returned by the one call to PIN_DefineTraceBuffer BUFFER_ID bufId; TLS_KEY appThreadRepresentitiveKey; int main(int argc, char * argv[]) { PIN_Init(argc,argv) ; // Pin TLS slot for holding the object that represents an application thread appThreadRepresentitiveKey = PIN_CreateThreadDataKey(0); // Define the buffer that will be used – buffer is allocated to each thread when the thread starts //running bufId = PIN_DefineTraceBuffer(sizeof(struct MEMREF), KnobNumPagesInBuffer, BufferFull, // This Pin tool function will be called when buffer is full 0); INS_AddInstrumentFunction(Instruction, 0); // The Instruction function will use the Pin Buffering // API to insert the instrumentation code that writes // the MEMREF of a memory accessing INS into the buffer PIN_AddThreadStartFunction(ThreadStart, 0); PIN_AddThreadFiniFunction(ThreadFini, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); } Pin provides an API for Thread Local Storage (TLS). The TLS_KEY is used to store and retrieve thread local data. PIN_CreateThreadDataKey allocates a TLS_KEY. In this tool the data held in the Pin TLS will be a pointer to an object that contains information about this thread’s buffer. The THREAD_ID of the thread is used to access the Pin TLS of the thread. PIN_DefineTraceBuffer defines the per-thread buffer that will be used. Note that in the Pin buffering API the data written to the buffer MUST have the same structure in all writes. In this case it is the MEMREF structure. In the memtrace_simple, each write COULD be of different sizes (although we also used constant sizes in that example).

membuffer_simple /* * Pin Callback called, by application thread, when a buffer fills up, or the thread exits * Pin will NOT inline this function * @param[in] id buffer handle * @param[in] tid id of owning thread * @param[in] ctxt application context * @param[in] buf actual pointer to buffer * @param[in] numElements number of records * @param[in] v callback value * @return A pointer to the buffer to resume filling. */ VOID * BufferFull(BUFFER_ID id, THREADID tid, const CONTEXT *ctxt, VOID *buf, UINT64 numElements, VOID *v) { // retrieve the APP_THREAD_REPRESENTITVE* of this thread from the Pin TLS APP_THREAD_REPRESENTITVE * appThreadRepresentitive = static_cast<APP_THREAD_REPRESENTITVE*>( PIN_GetThreadData( appThreadRepresentitiveKey, tid ) ); appThreadRepresentitive->ProcessBuffer(buf, numElements); return buf; }} This is the BufferFull function. Note that Pin inserts code into the start of each trace that will be much like the code we saw in the memtrace_simple example: an IF-THEN pair that checks if there is not enough room in the buffer and calls the BufferFull function. When using the Pin buffering API this is done by Pin without the Pin tool having to do it. So Pin inserts code into each trace that will call this BufferFull function if there is no enough room in the buffer. This code will always pass in the parameters mentioned above. Note that the BufferFull function is executed as a Pin Callback – that means it is a serializing function. The BufferFull function retrieves the pointer to this thread’s APP_THREAD_REPRESENTITVE object from the Pin TLS using the TLS_KEY (appThreadReresentitveKey) and the thread’s THREAD_ID. It then call the ProcessBuffer function of the thread. Finally it reutins the pointer to the buffer (VOID *buf) to be used as the next buffer to start filling. In this example there is only one buffer allocated per thread, we will see the use of multi-buffering in part 3.

membuffer_simple VOID Instruction (INS ins, VOID *v) { UINT32 numMemOperands = INS_MemoryOperandCount(ins); // Iterate over each memory operand of the instruction. for (UINT32 memOp = 0; memOp < numMemOperands ; memOp++) { // Add the instrumentation code to write the appIP and memAddr // of this memory operand into the buffer // Pin will inline the code that writes to the buffer INS_InsertFillBuffer(ins, IPOINT_BEFORE, bufId, IARG_INST_PTR, offsetof(struct MEMREF, appIP), IARG_MEMORYOP_EA, memOp, offsetof(struct MEMREF, memAddr), IARG_END); }

branchbuffer_simple Use Pin Buffering API to collect a branch trace: For each executed branch instruction record: appIP of the branch instruction targetAddress of the branch instruction branchTaken boolean

branchbuffer_simple KNOB<UINT32> KnobNumPagesInBuffer(KNOB_MODE_WRITEONCE, "pintool", "num_pages_in_buffer", "256", "number of pages in buffer"); struct BRANCH_INFO { // This is the structure of the data that will be written into the buffer ADDRINT appIP; ADDRINT targetAddress; BOOL branchTaken; }; int main(int argc, char *argv[]) { PIN_Init(argc,argv); bufId = PIN_DefineTraceBuffer(sizeof(BRANCH_INFO), KnobNumPagesInBuffer, BufferFull, 0); // Register function to be called to instrument traces TRACE_AddInstrumentFunction(Trace, 0); // Register function to be called when the application exits PIN_AddFiniFunction(Fini, 0); // Start the program, never returns PIN_StartProgram(); }

branchbuffer_simple void Trace(TRACE tr, void* V) // TRACE_AddInstrumentFunction(Trace, 0); { for(BBL bbl = TRACE_BblHead(tr); BBL_Valid(bbl); bbl=BBL_Next(bbl)) if (INS_IsBranchOrCall(BBL_InsTail(bbl))) // The branch instruction, if it exists, will always // be the last in the BBL INS_InsertFillBuffer(BBL_InsTail(bbl), IPOINT_BEFORE, bufId, IARG_INST_PTR, offsetof(BRANCH_INFO, appIP), IARG_BRANCH_TARGET_ADDR, offsetof(BRANCH_INFO, targetAddress), IARG_BRANCH_TAKEN, offsetof(BRANCH_INFO, branchTaken), IARG_END); }

Symbols PIN_InitSymbols() Pin will use whatever symbol information is available Debug info in the app Pdb files Export Tables On Windows uses dbghelp Use symbols to instrument/wrap/replace specific functions wrap/replace: see malloc replacement examples in intro Access application debug information from a Pin tool

Symbols: Instrument malloc and free int main(int argc, char *argv[]) { // Initialize pin symbol manager PIN_InitSymbols(); PIN_Init(argc,argv); // Register the function ImageLoad to be called each time an image is loaded in the process // This includes the process itself and all shared libraries it loads (implicitly or explicitly) IMG_AddInstrumentFunction(ImageLoad, 0); // Never returns PIN_StartProgram(); }

Symbols: Instrument malloc and free VOID ImageLoad(IMG img, VOID *v) // Pin Callback. IMG_AddInstrumentFunction(ImageLoad, 0); { // Instrument the malloc() and free() functions. Print the input argument // of each malloc() or free(), and the return value of malloc(). RTN mallocRtn = RTN_FindByName(img, "_malloc"); // Find the malloc() function. if (RTN_Valid(mallocRtn)) RTN_Open(mallocRtn); // Instrument malloc() to print the input argument value and the return value. RTN_InsertCall(mallocRtn, IPOINT_BEFORE, (AFUNPTR)MallocBefore, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); RTN_InsertCall(mallocRtn, IPOINT_AFTER, (AFUNPTR)MallocAfter, IARG_FUNCRET_EXITPOINT_VALUE, IARG_END); RTN_Close(mallocRtn); } RTN freeRtn = RTN_FindByName(img, "_free"); // Find the free() function. if (RTN_Valid(freeRtn)) RTN_Open(freeRtn); // Instrument free() to print the input argument value. RTN_InsertCall(freeRtn, IPOINT_BEFORE, (AFUNPTR)FreeBefore, RTN_Close(freeRtn); IMG_AddInstrumentationFunction specifies to Pin to call the ImageLoad function each time an image is loaded – the main image itself and each and every shared library loaded. The ImageLoad function will receive the IMG parameter – a handle to the image being loaded. It uses this handle to query for the name of the image, and if the name of the image is one where the malloc function is expected, then RTN_FindByName is used to get an RTN handle to the malloc routine. RTN_InsertCall with IPOINT_BEFORE is used to specify the tool function to be called before the malloc function. RTN_InsertCall with IPOINT_AFTER is used to specify the tool function to be called after the malloc function.

Symbols: Instrument malloc Handling name-mangling and multiple symbols at same address VOID Image(IMG img, VOID *v) // IMG_AddInstrumentFunction(Image, 0); { // Walk through the symbols in the symbol table. for (SYM sym = IMG_RegsymHead(img); SYM_Valid(sym); sym = SYM_Next(sym)) string undFuncName = PIN_UndecorateSymbolName(SYM_Name(sym), UNDECORATION_NAME_ONLY); if (undFuncName == "malloc") // Find the malloc function. RTN mallocRtn = RTN_FindByAddress(IMG_LowAddress(img) + SYM_Value(sym)); if (RTN_Valid(mallocRtn)) RTN_Open(mallocRtn); // Instrument to print the input argument value and the return value. RTN_InsertCall(mallocRtn, IPOINT_BEFORE, (AFUNPTR)MallocBefore, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); RTN_InsertCall(mallocRtn, IPOINT_AFTER, (AFUNPTR)MallocAfter, IARG_FUNCRET_EXITPOINT_VALUE, RTN_Close(mallocRtn); } This shows the Pin APIs that can be used to get the unmangled names of symbols, and to locate RTNs via addresses.

Symbols: Accessing Application Debug Info from a Pin Tool VOID Instruction(INS ins, VOID *v) // INS_AddInstrumentFunction(Instruction, 0); { UINT32 numMemOperands = INS_MemoryOperandCount(ins); // Iterate over each memory operand of the instruction. for (UINT32 memOp = 0; memOp < numMemOperands ; memOp++) if (INS_MemoryOperandIsWritten(ins, memOp)) { // Insert instrumentation code to catch a memory overwrite INS_InsertIfCall(ins, IPOINT_BEFORE, AFUNPTR(AnalyzeMemWrite), IARG_FAST_ANALYSIS_CALL, IARG_MEMORYOP_EA, memop, IARG_MEMORYWRITE_SIZE, IARG_END); INS_InsertThenCall(ins, IPOINT_BEFORE, AFUNPTR(MemoryOverWriteAt), IARG_INST_PTR, } This is part of a watchpoint tool that shows also how to access debug info. The tool inserts instrumentation code before each memory writing INS that check whether it writes to a specified address (using the IF part of an IF-THEN construct: AnalyzeMemWrite). If it does write, the THEN part is invoked (MemoryOverwriteAt)

Symbols: Accessing Application Debug Info from a Pin Tool KNOB<ADDRINT> KnobMemAddrBeingOverwritten(KNOB_MODE_WRITEONCE, "pintool", "mem_overwrite_addr", "256", "overwritten memaddr"); static ADDRINT PIN_FAST_ANALYSIS_CALL AnalyzeMemWrite ( // Pin will inline this function, it is the IF part ADDRINT memWriteAddr, UINT32 numBytesWritten) { // return 1 if this memory write overwrites the address specified by // KnobMemAddrBeingOverwritten return (memWriteAddr<= KnobMemAddrBeingOverwritten && (memWriteAddr + numBytesWritten) > KnobMemAddrBeingOverwritten); } static VOID PIN_FAST_ANALYSIS_CALL MemoryOverWriteAt ( // Pin will NOT inline this function, it is the THEN part ADDRINT appIP, ADDRINT memWriteAddr, UINT32 numBytesWritten) { INT32 column, lineNum; string fileName; PIN_GetSourceLocation (appIP, &column, &line, &fileName); printf ("overwrite of %p from instruction at %p originating from file %s line %d col %d\n", KnobMemAddrBeingOverwritten, appIP, fileName.c_str(), lineNum, column); printf (" writing %d bytes starting at %p\n", numBytesWritten, memWriteAddr); KnobMemAddrBeingOverwritten contains the specified memory address at which the watchpoint is being placed (this is the command line argument –mem_overwrite_addr <add> specified to the tool). The call to PIN_GetSourceLocation is used to retrieve the debug info associated with the specified appIP.

Probe Mode JIT Mode Pin creates a modified copy of the application on-the-fly Original code never executes More flexible, more common approach Probe Mode Pin modifies the original application instructions Inserts jumps to instrumentation code (trampolines) Lower overhead (less flexible) approach

Pin Probe-Mode Probe mode is a method of using Pin to wrap or replace application functions with functions in the tool. A jump instruction (probe), which redirects the flow of control to the replacement function is placed at the start of the specified function. The bytes being overwritten are relocated, so that Pin can provide the replacement function with the address of the first relocated byte. This enables the replacement function to call the replaced (original) function. In probe mode, the application and the replacement routine are run natively (not Jitted). This improves performance, but puts more responsibility on the tool writer. Probes can only be placed on RTN boundaries, and should inserted within the Image load callback. Pin will automatically remove the probes when an image is unloaded. Many of the PIN APIs that are available in JIT mode are not available in Probe mode.

A Sample Probe A probe is a jump instruction that overwrites original instruction(s) in the application Instrumentation invoked with probes Pin copies/translates original bytes so probed (replaced) functions can be called from the replacement function Copy of entry point with original bytes: 0x50000004: push %ebp 0x50000005: mov %esp,%ebp 0x50000007: push %edi 0x50000008: push %esi 0x50000009: jmp 0x400113d9 Entry point overwritten with probe: 0x400113d4: jmp 0x41481064 0x400113d9: push %ebx Original function entry point: 0x400113d4: push %ebp 0x400113d5: mov %esp,%ebp 0x400113d7: push %edi 0x400113d8: push %esi 0x400113d9: push %ebx 0x41481064: push %ebp // tool wrapper func :::::::::::::::::::: 0x414827fe: call 0x50000004 // call original func

PinProbes Instrumentation Advantages: Low overhead – few percent Less intrusive – execute original code Leverages Pin: API Instrumentation engine Disadvantages: More tool writer responsibility Routine-level granularity (RTN)

Using Probes to Replace/Wrap a Function RTN_ReplaceSignatureProbed() redirects all calls to application routine rtn to the specified replacementFunction Can add IARG_* types to be passed to the replacement routine, including pointer to original function and IARG_CONTEXT. Replacement function can call original function. To use: Must use PIN_StartProgramProbed() Application prototype is required

Malloc Replacement Probe-Mode #include "pin.H" void * MallocWrapper(AFUNPTR pf_malloc, size_t size) { // Simulate out-of-memory every so often void * res; if (TimeForOutOfMem()) return (NULL); res = pf_malloc(size); return res; } VOID ImageLoad (IMG img, VOID *v) { if (strstr(IMG_Name(img).c_str(), "libc.so") || strstr(IMG_Name(img).c_str(), "MSVCR80") || strstr(IMG_Name(img).c_str(), "MSVCR90")) { RTN mallocRtn = RTN_FindByName(img, "malloc"); if ( RTN_Valid(mallocRtn) && RTN_IsSafeForProbedReplacement(mallocRtn) ) PROTO proto_malloc = PROTO_Allocate(PIN_PARG(void *), CALLINGSTD_DEFAULT, "malloc", PIN_PARG(size_t), PIN_PARG_END() ); RTN_ReplaceSignatureProbed(mallocRtn, AFUNPTR(MallocWrapper), IARG_PROTOTYPE, proto_malloc, IARG_ORIG_FUNCPTR, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); } }} int main(int argc, CHAR *argv[]) { PIN_InitSymbols(); PIN_Init(argc,argv)); IMG_AddInstrumentFunction(ImageLoad, 0); PIN_StartProgramProbed(); }

Using Probes to Call Analysis Functions RTN_InsertCallProbed() invokes the analysis routine before or after the specified rtn Use IPOINT_BEFORE or IPOINT_AFTER Pin may NOT be able to find all AFTER points on the function when it is running in Probe-Mode PIN IARG_TYPEs are used for arguments To use: Must use PIN_StartProgramProbed() Application prototype is required

Symbols: Instrument malloc Handling name-mangling and multiple symbols at same address Probe-Mode VOID Image(IMG img, VOID *v) // IMG_AddInstrumentFunction(Image, 0); { // Walk through the symbols in the symbol table. for (SYM sym = IMG_RegsymHead(img); SYM_Valid(sym); sym = SYM_Next(sym)) string undFuncName = PIN_UndecorateSymbolName(SYM_Name(sym), UNDECORATION_NAME_ONLY); if (undFuncName == "malloc") // Find the malloc function. { RTN mallocRtn = RTN_FindByAddress(IMG_LowAddress(img) + SYM_Value(sym)); if (RTN_Valid(mallocRtn)) { RTN_Open(mallocRtn); PROTO proto_malloc = PROTO_Allocate(PIN_PARG(void *), CALLINGSTD_DEFAULT, "malloc", PIN_PARG(size_t), PIN_PARG_END() ); // Instrument to print the input argument value and the return value. RTN_InsertCallProbed(mallocRtn, IPOINT_BEFORE, (AFUNPTR)MallocBefore, IARG_PROTOTYPE, proto_malloc, IARG_FUNCARG_ENTRYPOINT_VALUE, 0, IARG_END); RTN_InsertCallProbed(mallocRtn, IPOINT_AFTER, (AFUNPTR)MallocAfter, IARG_FUNCRET_EXITPOINT_VALUE, RTN_Close(mallocRtn); } } }

Tool Writer Responsibilities No control flow into the instruction space where probe is placed 6 bytes on IA-32, 7 bytes on Intel64, 1 bundle on IA64 Branch into “replaced” instructions will fail Probes at function entry point only Thread safety for insertion and deletion of probes During image load callback is safe Only loading thread has a handle to the image Replacement function has same behavior as original

Multi-Threading Have shown a number of examples of Pin tools supporting multi-threading Pin fully supports multi-threading Application threads execute jitted code including instrumentation code (inlined and not inlined), without any serialization introduced by Pin Instrumentation code can use Pin and/or OS synchronization constructs to introduce serialization if needed. Will see examples of this in Part3 System calls require serialized entry to the VM before and after execution – BUT actual execution is NOT serialized Pin does NOT create any threads of it’s own Pin callbacks are serialized Including the BufferFull callback Jitting is serialized Only one application thread can be jitting code at any time

Multi-Threading Pin Tools, in Jit-Mode, can: Track Threads ThreadStart, ThreadFini callbacks IARG_THREAD_ID Use Pin TLS for thread-specific data Use Pin Locks to synchronize threads Create threads to do Pin Tool work Use Pin provided APIs to do this Otherwise these threads would be Jitted Details in Part3

Part3 Summary Saw Examples of Allocating Pin Registers for Pin Tool Use Pin IF-THEN instrumentation Changing register values in instrumentation code Changing register values in CONTEXT Knobs Pin TLS Pin Buffering API Using Symbol and Debug Info Probe-Mode Multi-Threading support

Part4: Advanced Pin API “To boldly go where no PinHead has gone before…” Agenda membuffer_threadpool tool Using multiple buffers in the Pin Buffering API Using Pin Tool Threads Using Pin and OS locks to synchronize threads System call instrumentation Instrumenting a process tree CONTEXT* and IARG_CONTEXT Managing Exceptions and Signals Accessing Decode API Pin Code-Cache API Transparent debugging, and extending the debugger

membuffer_threadpool Recall membuffer_simple: Uses Pin Buffering API One buffer for each thread Inlined call to INS_InsertFillBuffer writes instrumentation data into the buffer Application threads execute jitted application and instrumentation code When buffer becomes full the Pin Tool defined BufferFull callback is called (by the application thread) Process the data in the buffer After the buffer is processed it is set to be re-filled from the top Application thread continues executing jitted application and instrumentation code All Pin callbacks are serialized Only one buffer is being processed at any time

membuffer_threadpool Improvement: Process buffers that become full asynchronously, allows application code to continue executing while buffers are being processed. Pin Buffering API supports multiple buffers per-thread Each application thread will allocate a number of buffers. The buffers allocated by the thread can only be used by the allocating thread so: Each application thread will have a buffers-free list, holding all buffers that are not currently full or being filled. Pin supports creating Pin Tool threads, these are NOT jitted and can be used to do Pin Tool work asynchronously. A number of these threads will be created, their job is: Process buffers that become full. These will be located on a global full-buffers list. After processing, return them to the buffers-free list of the application thread that filled them Application threads execute jitted application code and instrumentation code – the instrumentation code writes data into the buffers and when it detects that the buffer is full calls the BufferFull callback. The BufferFull callback function will NOT process the buffer Remember it is executed by an application thread It places the buffer on the global full-buffers list It retrieves a free buffer from this application thread’s free buffer list and returns it as the next buffer to fill.

membuffer_threadpool Application thread Pin Tool Processing thread Buffer being filled buffers-free list Buffer becomes full BufferFull function executed Buffer Processing finishes. Buffer returned to owner’s buffers-free list buffers-full list Application thread Buffer being filled buffers-free list Buffer becomes full BufferFull function executed Pin Tool Processing thread

membuffer_threadpool int main(int argc, char *argv[]) { PIN_Init(argc,argv); // Pin TLS slot for holding the object that represents an application thread appThreadRepresentitiveKey = PIN_CreateThreadDataKey(0); // Define the buffer that will be used – bufId = PIN_DefineTraceBuffer(sizeof(struct MEMREF), KnobNumPagesInBuffer, BufferFull, // This Pin tool function will be called when buffer is full 0); TRACE_AddInstrumentFunction(Trace, 0); // add an instrumentation callback function // add callbacks PIN_AddThreadStartFunction(ThreadStart, 0); PIN_AddThreadFiniFunction(ThreadFini, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddFiniUnlockedFunction(FiniUnlocked, 0); // Used for Pin Tool thread termination /* It is safe to create internal threads in the tool's main procedure and spawn new * internal threads from existing ones. All other places, like Pin callbacks and * analysis routines in application threads, are not safe for creating internal threads. */ // NOTE: These threads are NOT jitted, Need to discuss when the threads actually start running for (int i=0; i<KnobNumProcessingThreads; i++) { THREADID threadId; PIN_THREAD_UID threadUid; threadId = PIN_SpawnInternalThread (BufferProcessingThread, NULL, 0, &threadUid); RecordToolThreadCreated(threadUid); /* Used for Pin Tool thread termination */ } PIN_StartProgram(); /* Start the program, never returns */ } PIN_DefineTraceBuffer defines the per-thread buffer that will be used. Note that in the Pin buffering API the data written to the buffer MUST have the same structure in all writes. In this case it is the MEMREF structure. PIN_SpawnInternalThread creates Pin tool threads, these threads are NOT jitted. BufferProcessingThread is the function the thread will start executing when it begin to run. The threadUid is set by the function it gives a unique ID to the thread that will NEVER be re-used during the lifetime of this process. This threadUid is used to manage the termination of these Pin tool threads. Note that upon the return from PIN_SpawnInternalThread the Pin tool thread may or not have started actually running, if it hasn’t it will start running eventually. RecordToolThreadCreated is a Pin tool function used to record the threadUids of the Pin tool threads.

membuffer_threadpool static void RecordToolThreadCreated (PIN_THREAD_UID threadUid) { // Record the unique ID of the Pin Tool thread BOOL insertStatus; insertStatus = (uidSet.insert(threadUid)).second; } // The thread function of Pin Tool threads – this code runs natively: NO Jitting static VOID BufferProcessingThread(VOID * arg) { processingThreadRunning = TRUE; // Indicate that thread has started running THREADID myThreadId = PIN_ThreadId(); while (!doExit) VOID *buf; UINT64 numElements; APP_THREAD_REPRESENTITVE *appThreadRepresentitive; // Get full buffer from the full buffer list fullBuffersListManager.GetBufferFromList(&buf ,&numElements, &appThreadRepresentitive, myThreadId); if (buf == NULL) { // this will happen at process termination time – when there are NO ASSERTX(doExit); // no buffers left to process break; } // Process the full buffer ProcessBuffer(buf, numElements, appThreadRepresentitive); // Put the processed buffer back on the free buffer list of the application thread that owns it appThreadRepresentitive->FreeBufferListManager() ->PutBufferOnList(buf, 0, appThreadRepresentitive, myThreadId); The RecordToolThreadCreated function records the threadUid of a Pin tool thread in the STL set called uidSet. BufferProcessingThread is the start function of all Pin tool threads. It get’s it’s Pin THREAD_ID by calling PIN_ThreadId(). Then it goes into a loop of: Get a full buffer from the full-buffers list Process the buffer Put the processed buffer on the free list of it’s owner thread The fullBuffersListManager is a BUFFER_LIST_MANGER object that manages the one global full-buffers list The appThreadRepresentitive->FreeBufferListManager() returns a pointer to the BUFFER_LIST_MANAGER that manages the free buffer list of the the thread that owns the buffer that was retrieved from the full-buffers list This loop is terminated when the GetBufferFromList set the buf parameter to NULL, this indicates process termination

membuffer_threadpool /*! Pin Callback * Called by, instrumentation code, when a buffer fills up, or the thread exits, so the buffer can be processed * Called in the context of the application thread * @param[in] id buffer handle * @param[in] tid id of owning thread * @param[in] ctxt application context * @param[in] buf actual pointer to buffer * @param[in] numElements number of records * @param[in] v callback value * @return A pointer to the buffer to resume filling. */ VOID * BufferFull(BUFFER_ID id, THREADID tid, const CONTEXT *ctxt, VOID *buf, UINT64 numElements, VOID *v) { // get the APP_THREAD_REPRESENTITVE of this app thread from the Pin TLS APP_THREAD_REPRESENTITVE * appThreadRepresentitive = static_cast<APP_THREAD_REPRESENTITVE*>( PIN_GetThreadData( appThreadRepresentitiveKey, tid ) ); // Enqueue the full buffer, on the full-buffers list, and get the next buffer to fill, from this // thread’s free buffer list VOID *nextBuffToFill = appThreadRepresentitive->EnqueFullAndGetNextToFill(buf, numElements); return (nextBuffToFill); } This is the BufferFull function. Note that Pin inserts code into the start of each trace that will be much like the code we saw in the memtrace_simple example: an IF-THEN pair that checks if there is not enough room in the buffer and calls the BufferFull function. When using the Pin buffering API this is done by Pin without the Pin tool having to do it. So Pin inserts code into each trace that will call this BufferFull function if there is no enough room in the buffer. This code will always pass in the parameters mentioned above. Note that the BufferFull function is executed as a Pin Callback – that means it is a serializing function. The BufferFull function retrieves the pointer to this thread’s APP_THREAD_REPRESENTITVE object from the Pin TLS using the TLS_KEY (appThreadReresentitveKey) and the thread’s THREAD_ID. It then calls EnqueueFullAndGetNextToFill to place the full buffer on the global full-buffers list and to get a buffer from this thread’s free-buffers list. The buffer retrieved from the free –buffers list is returned to specify that this is to buffer to write data to on the following calls to INS_InsertFillBuffer calls

membuffer_threadpool VOID * APP_THREAD_REPRESENTITVE::EnqueFullAndGetNextToFill(VOID *fullBuf, UINT64 numElements) { // cannot wait for Pin Tool threads to start running since this may cause deadlock // because this app thread may be holding some OS resource that the Pin Tool // thread needs to obtain in order to start - e.g. the LoaderLock if ( !processingThreadRunning) { // process buffer in this app thread ProcessBuffer(fullBuf, numElements, this); return fullBuf; } if (!_buffersAllocated) { // now allocate the rest of the KnobNumBuffersPerAppThread buffers to be used for (int i=0; i<KnobNumBuffersPerAppThread-1; i++) _freeBufferListManager->PutBufferOnList(PIN_AllocateBuffer(bufId), 0, this, _myTid); _buffersAllocated = TRUE; } // put the fullBuf on the full buffers list, on the Pin Tool processing // threads will pick it from there, process it, and then put it on this app-thread's free buffer list fullBuffersListManager.PutBufferOnList(fullBuf, numElements, this, _myTid); // return the next buffer to fill. // It is always taken from the free buffers list of this app thread. If the list is empty then this app // thread will be blocked until one is placed there (by one of the Pin Tool buffer processing threads). VOID *nextBufToFill; UINT64 numElementsDummy; APP_THREAD_REPRESENTITVE *appThreadRepresentitiveDummy; _freeBufferListManager->GetBufferFromList(&nextBufToFill, &numElementsDummy, &appThreadRepresentitiveDummy, _myTid); ASSERTX(appThreadRepresentitiveDummy = this); return nextBufToFill; } PIN_AllocateBuffer is used to allocate additional buffers to this application thread. The first buffer is aloocated when the thread starts. Here we allocate them and place them of the free-buffers list of this application thread.

membuffer_threadpool VOID Instruction (INS ins, VOID *v) { UINT32 numMemOperands = INS_MemoryOperandCount(ins); // Iterate over each memory operand of the instruction. for (UINT32 memOp = 0; memOp < numMemOperands ; memOp++) { // Add the instrumentation code to write the appIP and memAddr // of this memory operand into the buffer // Pin will inline the code that writes to the buffer INS_InsertFillBuffer(ins, IPOINT_BEFORE, bufId, IARG_INST_PTR, offsetof(struct MEMREF, appIP), IARG_MEMORYOP_EA, memOp, offsetof(struct MEMREF, memAddr), IARG_END); } Same as in membuffer_simple

membuffer_threadpool class BUFFER_LIST_MANAGER { public: BUFFER_LIST_MANAGER(); VOID PutBufferOnList (VOID *buf, UINT64 numElements, APP_THREAD_REPRESENTITVE *appThreadRepresentitive, THREADID tid) { // build the list element BUFFER_LIST_ELEMENT bufferListElement; bufferListElement.buf = buf; bufferListElement.numElements = numElements; bufferListElement.appThreadRepresentitive = appThreadRepresentitive; GetLock(&_bufferListLock, tid+1); // lock the list, using a Pin lock _bufferList.push_back(bufferListElement); // insert the element at the end of the list ReleaseLock(&_bufferListLock); // unlock the list WIND::ReleaseSemaphore(_bufferSem, 1, NULL); // signal that there is a buffer on the list } VOID GetBufferFromList (VOID **buf ,UINT64 *numElements, APP_THREAD_REPRESENTITVE **appThreadRepresentitive, THREADID tid){ WIND::WaitForSingleObject (_bufferSem, INFINITE); // wait until there is a buffer on the list GetLock(&_bufferListLock, tid+1); // lock the list BUFFER_LIST_ELEMENT &bufferListElement = (_bufferList.front()); // retrieve the first element of the list *buf = bufferListElement.buf; *numElements = bufferListElement.numElements; *appThreadRepresentitive = bufferListElement.appThreadRepresentitive; _bufferList.pop_front(); // remove the first element from the list ReleaseLock(&_bufferListLock); // unlock the list } VOID SignalBufferSem() {WIND::ReleaseSemaphore(_bufferSem, 1, NULL);} UINT32 NumBuffersOnList () { return (_bufferList.size());} private: struct BUFFER_LIST_ELEMENT // structure of an element of the buffer list { VOID *buf; UINT64 numElements; APP_THREAD_REPRESENTITVE *appThreadRepresentitive; // the application thread that owns this buffer }; WIND::HANDLE _bufferSem; // counting semaphore, value is #of buffers on the list, value==0 => WaitForSingleObject blocks PIN_LOCK _bufferListLock; // Pin Lock list<const BUFFER_LIST_ELEMENT> _bufferList; }; This shows the buffer list manager. It provides the functionality of inserting a buffer into the end of the list (PutBufferOnList) and poping one from the front of the list and returning it to the caller (GetBufferFromList). It provides mutual exclusive access to the list using a PIN_LOCK which it taken for ownership using the THREAD_ID of the thread (+1 because THREAD_IDs start at 0). A counting semaphore (OS specific) is used to reflect the number of buffers on the list – If there are 0 buffers on the list, then the threads calling GetBufferFromList is blocked on the counting semaphore until that semaphore enters the signaled state, at which point one of the blocked threads is unblocked. The unblocking almost always occurs due to some other thread placing a buffer on the list (see ReleaseSemaphore called at end of PutBufferOnList), the exception is when the process is terminating. In this case the FiniUnlocked function of the tool releases the semaphore once for each Pin tool thread, allowing it to wake up and see there is nothing on the list and terminate – we will see this later.

membuffer_threadpool VOID ThreadFini(THREADID tid, const CONTEXT *ctxt, INT32 code, VOID *v) { // get the APP_THREAD_REPRESENTITVE of this app thread from the Pin TLS APP_THREAD_REPRESENTITVE * appThreadRepresentitive = static_cast<APP_THREAD_REPRESENTITVE*>(PIN_GetThreadData( appThreadRepresentitiveKey, tid)); // wait for all my buffers to be processed while(appThreadRepresentitive->_freeBufferListManager->NumBuffersOnList() != KnobNumBuffersPerAppThread-1) PIN_Sleep(1); delete appThreadRepresentitive; PIN_SetThreadData(appThreadRepresentitiveKey, 0, tid); } static VOID FiniUnlocked(INT32 code, VOID *v) { BOOL waitStatus; INT32 threadExitCode; doExit = TRUE; // indicate that process is exiting // signal all the Pin Tool threads to wake up and recognize the exit for (int i=0; i<KnobNumProcessingThreads; i++) fullBuffersListManager.SignalBufferSem(); // Wait until all Pin Tool threads exit for (set<PIN_THREAD_UID>::iterator it = uidSet.begin(); it != uidSet.end(); ++it) waitStatus = PIN_WaitForThreadTermination(*it, PIN_INFINITE_TIMEOUT, &threadExitCode); } The ThreadFini function is a Pin Callback called by Pin when an application thread is about to terminate. Here we see it retrieved the APP_THREAD_REPRESENTITIVE object of the thread from the Pin TLS. Then it executes a busy sleep loop to wait for all full buffers it placed on the full list to be processed and returned to it’s free list. The FiniUnlocked function is a Pin Callback called during the process termination. It gives the tool the chance to terminate all the Pin tool threads gracefully. The full-buffers list is signaled once for each Pin tool thread. This will cause each of the threads to become unblocked and see there that the full-buffers list is empty and exit. The second for loop waits for the termination signal of all the Pin tool threads.

System Call Instrumentation VOID SyscallEntry(THREADID threadIndex, CONTEXT *ctxt, SYSCALL_STANDARD std, VOID *v) { ADDRINT appIP = PIN_GetContextReg(ctxt, REG_INST_PTR); printf ("syscall# %d at appIP %x param1 %x param2 %x param3 %x param4 %x param5 %x param6 %x\n", PIN_GetSyscallNumber(ctxt, std), appIP, PIN_GetSyscallArgument(ctxt, std, 0), PIN_GetSyscallArgument(ctxt, std, 1), PIN_GetSyscallArgument(ctxt, std, 2), PIN_GetSyscallArgument(ctxt, std, 3), PIN_GetSyscallArgument(ctxt, std, 4), PIN_GetSyscallArgument(ctxt, std, 5)); } VOID SyscallExit(THREADID threadIndex, CONTEXT *ctxt, SYSCALL_STANDARD std, VOID *v) printf(" returns: %x\n", PIN_GetSyscallReturn(ctxt, std); int main(int argc, char *argv[]) PIN_Init(argc, argv); // Instrument system calls via these Pin Callbacks and not via analysis functions PIN_AddSyscallEntryFunction (SyscallEntry, 0); PIN_AddSyscallExitFunction (SyscallExit, 0); PIN_StartProgram(); // Never returns

Instrumenting a Process Tree Process A creates Process B Process B creates Process C and D And so forth Can use Pin to instrument all or part of the processes of a process tree Use the –follow_exevc Pin invocation switch to turn this on Can use different Pin modes (Jit or Probe) on the different processes in the process tree. Can use different Pin Tools on the different processes of a process tree. Architecture of processes in the process tree may be intermixed: e.g. Process A is 32bit, Process B is 64 bit, Process C is 64 bit, Process D is 32 bit…

Instrumenting a Process Tree // If this Pin Callback returns FALSE, then the child process will run Natively BOOL FollowChild(CHILD_PROCESS childProcess, VOID * userData) { BOOL res; INT appArgc; CHAR const * const * appArgv; OS_PROCESS_ID pid = CHILD_PROCESS_GetId(childProcess); // Get the command line that child process will be Pinned with, these are the Pin invocation switches // that were specified when this (parent) process was Pinned CHILD_PROCESS_GetCommandLine(childProcess, &appArgc, &appArgv); // The Pin invocation switches of the child can be modified INT pinArgc = 0; CHAR const * pinArgv[20]; :::: Put values in pinArgv, Set pinArgc to be the number of entries in pinArgv that are to be used CHILD_PROCESS_SetPinCommandLine(childProcess, pinArgc, pinArgv); return TRUE; /* Specify Child process is to be Pinned */ } int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); cout << " Process is running on Pin in " << PIN_IsProbeMode() ? " Probe " : " Jit " << " mode " // The FollowChild Pin Callback will be called when the application being Pinned is about to spawn // child process PIN_AddFollowChildProcessFunction (FollowChild, 0); if ( PIN_IsProbeMode() ) PIN_StartProgramProbed(); // Never returns else PIN_StartProgram(); }

CONTEXT* and IARG_CONTEXT CONTEXT* is a Handle to the full register context of the application at a particular point in the execution CONTEXT* is passed by default to a number of Pin Callback functions: e.g. ThreadStart (registered by PIN_AddThreadStartFunction) BufferFull (registered by PIN_DefineTraceBuffer) Can request CONTEXT* be passed to an analysis function by requesting and IARG_CONTEXT

CONTEXT* and IARG_CONTEXT Passing IARG_CONTEXT to an analysis function has implications: The analysis function will NOT be inlined The passing of the IARG_CONTEXT is time consuming CONTEXT* can NOT be dereferenced. It is a handle to be passed to Pin API functions Pin API functions supplied to Get and Set registers within the CONTEXT Set has no affect on CONTEXT* passed into analysis function (via IARG_CONTEXT request) Have seen examples of both Get and Set Have Pin API functions to Get and Set FP context

Managing Exceptions and Signals

Exceptions Catch Exceptions that occur in Pin Tool code Global exception handler PIN_AddInternalExceptionHandler Guard code section with exception handler PIN_TryStart PIN_TryEnd When an exception occurs in jitted code, Pin is given a chance to identify which application instruction the jitted code corresponds to. Pin will succeed when the jitted code causing the exception is the actual jitting of an application instruction. When it succeeds Pin will re-raise the exception giving it the exception IP to be the appIP of the jitted code, so the exception can be handled as expected by the application. But when the jitted code is instruemntation code then Pin will NOT succeed in the mapping, and the exception will appear as an unhandled exception. Pin provides APIs to the Pin tool to manage this type of exception.

Exceptions VOID InstrumentDivide(INS ins, VOID* v) { if ((INS_Mnemonic(ins) == "DIV") && (INS_OperandIsReg(ins, 0))) { // Will Emulate div instruction with register operand INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(EmulateIntDivide), IARG_REG_REFERENCE, REG_GDX, IARG_REG_REFERENCE, REG_GAX, IARG_REG_VALUE, REG(INS_OperandReg(ins, 0)), IARG_CONTEXT, IARG_THREAD_ID, IARG_END); INS_Delete(ins); // Delete the div instruction } int main(int argc, char * argv[]) PIN_Init(argc, argv); INS_AddInstrumentFunction (InstrumentDivide, 0); PIN_AddInternalExceptionHandler (GlobalHandler, NULL); // Registers a Global Exception Handler PIN_StartProgram(); // Never returns return 0; Here is an example tool that is emulating the “div” instruction. If an exception occurs in the emulating code, in order to preserve the application behavior in this case, the exception must be re-raised as if it occurred at the appIP that is being emulated.

Exceptions EXCEPT_HANDLING_RESULT DivideHandler (THREADID tid, EXCEPTION_INFO * pExceptInfo, PHYSICAL_CONTEXT * pPhysCtxt, // The context when the exception // occurred VOID *appContextArg // The application context when the // exception occurred ) { if(PIN_GetExceptionCode(pExceptInfo) == EXCEPTCODE_INT_DIVIDE_BY_ZERO) { // Divide by zero occurred in the code emulating the divide, use PIN_RaiseException to raise this exception // at the appIP – for handling by the application cout << " DivideHandler : Caught divide by zero." << PIN_ExceptionToString(pExceptInfo) << endl; // Get the application IP where the exception occurred from the application context CONTEXT * appCtxt = (CONTEXT *)appContextArg; ADDRINT faultIp = PIN_GetContextReg (appCtxt, REG_INST_PTR); // raise the exception at the application IP, so the application can handle it as it wants to PIN_SetExceptionAddress (pExceptInfo, faultIp); PIN_RaiseException (appCtxt, tid, pExceptInfo); // never returns } return EHR_CONTINUE_SEARCH; } VOID EmulateIntDivide(ADDRINT * pGdx, ADDRINT * pGax, ADDRINT divisor, CONTEXT * ctxt, THREADID tid) { PIN_TryStart(tid, DivideHandler, ctxt); // Register a Guard Code Section Exception Handler UINT64 dividend = *pGdx; dividend <<= 32; dividend += *pGax; *pGax = dividend / divisor; *pGdx = dividend % divisor; PIN_TryEnd(tid); /* Guarded Code Section ends */ } EmulateDivide is the Pin tool emulation routine. It is using the PIN_Try* APIs to register the DivideHandler to be called by Pin if an exception occurs in the emulation code. Note that the EmulateDivide has received from Pin a CONTEXT* - (the IARG_CONTEXT was specified in the preceeding slide). It specified to Pin to pass this CONTEXT* to the DivideHandler should an exception occur in the emulation code (by specifying it as the 3rd parameter to PIN_TryStart). So if an exception occurs in the emulation code, the DivideHandler is called by Pin. The DivideHandler will handle a divide-by-zero exception by retrieving the appIP from the CONTEXT* (the appIP is value contained in the REG_INST_PTR register, and that is retrieved from the CONTEXT* using PIN_GetContextReg). Then the DivideHandler re-raises the exception specifying that it occurred at the appIP – it does this using the PIN_SetExceptionAddress and PIN_RaiseException APIs.

Exceptions EXCEPT_HANDLING_RESULT GlobalHandler(THREADID threadIndex, EXCEPTION_INFO * pExceptInfo, PHYSICAL_CONTEXT * pPhysCtxt, VOID *v) { // Any Exception occurring in Pin Tool, or Pin that is not in a Guarded Code Section will cause this function to be // executed cout << "GlobalHandler: Caught unexpected exception. " << PIN_ExceptionToString(pExceptInfo) << endl; return EHR_UNHANDLED; }

Exceptions, Monitoring Application Exceptions PIN_AddContextChangeFunction Can monitor and change that application state at application exceptions int main(int argc, char **argv) { PIN_Init(argc, argv); PIN_AddContextChangeFunction(OnContextChange, 0); PIN_StartProgram(); } It is possible to monitor the exceptions occurring in the jitted code that Pin CAN map directly to the application code. This done by using the PIN_AddContextChange function to register a Pin Callback (in this case OnContextChange) that Pin will call whenever there is a context changing event that is not the regular time-slice event.

Exceptions, Monitoring Application Exceptions static void OnContextChange (THREADID tid, CONTEXT_CHANGE_REASON reason, const CONTEXT *ctxtFrom // Application's register state at exception point CONTEXT *ctxtTo, // Application's register state delivered to handler INT32 info, VOID *v) { if (CONTEXT_CHANGE_REASON_SIGRETURN == reason || CONTEXT_CHANGE_REASON_APC == reason || CONTEXT_CHANGE_REASON_CALLBACK == reason || CONTEXT_CHANGE_REASON_FATALSIGNAL == reason || ctxtTo == NULL) { // don't want to handle these return; } // change some register values in the context that the application will see at the handler FPSTATE fpContextFromPin; // change the bottom 4 bytes of xmm0 PIN_GetContextFPState (ctxtFrom, &fpContextFromPin); fpContextFromPin.fxsave_legacy._xmm[3] = 'de'; fpContextFromPin.fxsave_legacy._xmm[2] = 'ad'; fpContextFromPin.fxsave_legacy._xmm[1] = 'be'; fpContextFromPin.fxsave_legacy._xmm[0] = 'ef'; PIN_SetContextFPState (ctxtTo, &fpContextFromPin); // change eax PIN_SetContextReg(ctxtTo, REG_RAX, 0xbaadf00d); Here we se the different reasons that the registered OnContextChange function will be called. In this example the function wants to handle only the case of exception. If it is an exception, this example shows how to change registers in the application state/context that the application exception handler will see. Here we see also how to access FP registers in the application context.

Signals Establish an interceptor function for signals delivered to the application Tools should never call sigaction() directly to handle signals. function is called whenever the application receives the requested signal, regardless of whether the application has a handler for that signal. function can then decide whether the signal should be forwarded to the application

Signals A tool can take over ownership of a signal in order to: use the signal as an asynchronous communication mechanism to the outside world. For example, if a tool intercepts SIGUSR1, a user of the tool could send this signal and tell the tool to do something. In this usage model, the tool may call PIN_UnblockSignal() so that it will receive the signal even if the application attempts to block it. "squash" certain signals that the application generates. a tool that forces speculative execution in the application may want to intercept and squash exceptions generated in the speculative code. A tool can set only one "intercept" handler for a particular signal, so a new handler overwrites any previous handler for the same signal. To disable a handler, pass a NULL function pointer.

Signals BOOL EnableInstrumentation = FALSE; BOOL SignalHandler(THREADID, INT32, CONTEXT *, BOOL, const EXCEPTION_INFO *, void *) { // When tool receives the signal, enable instrumentation. Tool calls // PIN_RemoveInstrumentation() to remove any existing instrumentation from Pin's code cache. EnableInstrumentation = TRUE; PIN_RemoveInstrumentation(); return FALSE; /* Tell Pin NOT to pass the signal to the application. */ } VOID Trace(TRACE trace, VOID *) { if (!EnableInstrumentation) return; for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) BBL_InsertCall(bbl, IPOINT_BEFORE, AFUNPTR(AnalysisFunc), IARG_INST_PTR, IARG_END);} int main(int argc, char * argv[]) { PIN_Init(argc, argv); PIN_InterceptSignal(SIGUSR1, SignalHandler, 0); // Tool should really determine which signal is NOT in use by // application PIN_UnblockSignal(SIGUSR1, TRUE); TRACE_AddInstrumentFunction(Trace, 0); PIN_StartProgram(); } When a separate process sends the SIGUSR1 signal to the pinned process that signal will first arrive at this Pin tool’s SignalHandler function. IN this case the SIGUSR1 signal is being used to tell the tool to remove all the jitted code that was created up to this point (it was created un-instrumented because EnableIntrumentation is initially FLASE) and subsequent jitted code will be jitted with instrumentation. The PIN API PIN_RemoveInstrumentation is used to remove from the code-cache all the jitted code that was created up to this point.

Accessing the Decode API The decoder/encoder used is called XED http://www.pintool.org/docs/24110/Xed/html/ Tool code can use the XED API E.g. decode an instruction inside an analysis routine.

Accessing the Decode API extern "C" { #include "xed-interface.h" } static VOID PIN_FAST_ANALYSIS_CALL MemoryOverWriteAt ( // Pin will NOT inline this function, it is the THEN part ADDRINT appIP, ADDRINT memWriteAddr, UINT32 numBytesWritten) { INT32 column, lineNum; string fileName; PIN_GetSourceLocation (appIP, &column, &line, &fileName); static const xed_state_t dstate = { XED_MACHINE_MODE_LEGACY_32, XED_ADDRESS_WIDTH_32b}; xed_decoded_inst_t xedd; xed_decoded_inst_zero_set_mode (&xedd, &dstate); xed_error_enum_t xed_code = xed_decode (&xedd, reinterpret_cast<UINT8*>(appIP), 15); char buf[256]; xed_decoded_inst_dump_intel_format(&xedd, buf, 256, appIP); printf ("overwrite of %p from instruction at %p %s originating from file %s line %d col %d\n", KnobMemAddrBeingOverwritten, appIP, buf, fileName.c_str(), lineNum, column); printf (" writing %d bytes starting at %p\n", numBytesWritten, memWriteAddr); Here we show the analysis function of the watchpoint example using the Decode API to decode the instruction at the given appIP. The decoder is called XED

Pin Code-Cache API The Code-Cache API allows a Pin Tool to: Inspect Pin's code cache and/or alter the code cache replacement policy Assume full control of the code cache Remove all or selected traces from the code cache Monitor code cache activity, including start/end of execution of code in the code cache

Pin Code-Cache API VOID DoSmcCheck(VOID * traceAddr, VOID * traceCopyAddr, USIZE traceSize, CONTEXT * ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) /* application code changed */ { // the jitted trace is no longer valid free(traceCopyAddr); CODECACHE_InvalidateTraceAtProgramAddress((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); /* Continue jited execution at this application trace */ } } VOID InstrumentTrace(TRACE trace, VOID *v) { VOID * traceAddr; VOID * traceCopyAddr; USIZE traceSize; traceAddr = (VOID *)TRACE_Address(trace); // The appIP of the start of the trace traceSize = TRACE_Size(trace); // The size of the original application trace in bytes traceCopyAddr = malloc(traceSize); if (traceCopyAddr != 0) { memcpy(traceCopyAddr, traceAddr, traceSize); // Copy of original application code in trace // Insert a call to DoSmcCheck before every trace TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32 , traceSize, IARG_CONTEXT, IARG_END); } } int main(int argc, char * argv[]) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InstrumentTrace, 0); PIN_StartProgram(); } Pin supports Self Modifying Code: Code in writable pages is jitted as self-verifying traces. In essence the jitted code of the trace contains the application code that it jits, and each time the trace is executed it verifies that the application code currently in the appIPs that were jitted are the same now as they were when they were jitted – if NOT then the jitted trace is discarded and rejitted. The above is an example of a tool that does this – it was used to run app with SMC on Pin before Pin supported SMC. The DoSmcCheck uses the CODECACHE_InvalidateTraceAtProgramAddress to remove the jitted trace from the code when it find that any of the bytes of the original application code in the trace have changed. Then it uses the CONTEXT * parameter and the PIN_ExecuteAt API to continue exceution at the application context specified in the CONTEXT * parameter – This means that the application trace starting at appIp of the start of the trace will be jitted and then executed.

Transparent debugging, and extending the debugger Transparently debug the application while it is running on Pin + Pin Tool Have detailed explanation in the Pin User Manual Use Pin Tool to enhance/extend the debugger capabilities Watchpoint: Is order of magnitude faster when implemented using Pin Tool See previous “Symbols: Accessing Application Debug Info from a Pin Tool” Which branch is branching to address 0 Easy to write a Pin Tool that implements this

Part4 Summary Boldly went where no pin head has gone before… Lived to tell the tail membuffer_threadpool tool Using multiple buffers in the Pin Buffering API Using Pin Tool Threads Using Pin and OS locks to synchronize threads System call instrumentation Instrumenting a process tree CONTEXT* and IARG_CONTEXT Managing Exceptions and Signals Decode API Pin Code-Cache API Transparent debugging, and extending the debugger

Part5 Performance #s

Performance Applications Instrumentation Workloads SpecINT SPEC CPU2000 SPEC integer benchmarks SpecFP SPEC floating point benchmarks Cinebench MAXON CINEBENCH 10 Graphics performance benchmarks Povray POV-Ray Ray tracing RealOne RealOne Player 1.0 Media player Illustrator Adobe Illustrator 9.1 Graphics design Dreamweaver Adobe Dreamweaver 9 Website design Director Macromedia Director 9 3D games, demos MediaEncoder Microsoft Media Encoder 9 Audio and video capturing Word, Excel, PowerPoint, Access, Outlook Microsoft Office XP Office applications Instrumentation BBCount Lightweight Counts executed basic blocks MemTrace Middleweight Records memory references MemError (Intel Parallel Inspector) Heavyweight Detects memory leaks, uninitialized variables, etc. Workloads SpecINT, SpecFP Reference input Cinebench, Povray Rendering an image (scalable) Other GUI applications Proprietary Visual Test scripts

Pin in Windows vs. Pin in Linux Without Instrumentation For CPU bound applications Pin shows nearly the same performance in Windows and Linux SPECINT overhead is higher than SPECFP - more control instructions and low-trip-count code Pin uses the same binary translation technique on Windows and Linux

Pin in Windows vs. Pin in Linux With Instrumentation Instrumentation overhead dominates the translation overhead

Pin in Windows vs. Pin in Linux With Instrumentation Instrumentation overhead dominates the translation overhead

Windows Applications with Light and Middleweight Instrumentation No Instrumentation BBCount MemTrace #instructions per branch shrinks => BBCount overhead grows %memory instructions grows => MemTrace overhead grows

Windows Applications with Heavyweight Instrumentation No Instrumentation MemError The code translation overhead in Pin is much smaller than the overhead of non-trivial analysis in tools MemError overhead is proportional to the number of memory accesses

Pin Performance on Scalable Workloads No Instrumentation BBCount Native MemTrace MemError Pin VMM serialization does not impact scalability of the application Execution in the code cache is not serialized Scalability may drop due to limited memory bandwidth (MemTrace) or contention for tool private data (MemError)

Kernel Interaction Overhead Slowdown Relative to Native per Kernel Interaction System Calls 12X Exceptions 10.5X APCs 3X Callbacks 1.8X Cost of a trip in Pin VMM for each system call is high ~3000 cycles in VMM vs. ~500 cycles for ring crossing Future work: a faster path in VMM for system calls assa Kernel Interaction Counts Illustrator Excel CINEBENCH POV-Ray System Calls 1,659,298 658,683 101,700 75,313 Exceptions 1 APCs 6 24 Callbacks 73,062 68,767 961 7,682 Overhead vs. Total Runtime 3.3% 2.8% <1% Total overhead for handling kernel interactions is relatively low Kernel interactions are infrequent for majority of applications

Overall Summary Pin is Intel’s dynamic binary instrumentation engine Pin can be used to instrument all user level code Windows, Linux IA-32, Intel64, IA64 Product level robustness Jit-Mode for full instrumentation: Thread, Function, Trace, BBL, Instruction Probe-Mode for Function Replacement/Wrapping/Instrumentation only. Pin supports multi-threading, no serialization of jitted application nor of instrumentation code Pin API makes Pin tools easy to write Presented many tools, many fit on 1 ppt slide Pin performance is good Pin APIs provide for writing efficient Pin tools Popular and well supported 30,000+ downloads, 400+ citations Free DownLoad www.pintool.org Includes: Detailed user manual, source code for 100s of Pin tools Pin User Group http://tech.groups.yahoo.com/group/pinheads/ Pin users and Pin developers answer questions