Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood.

Slides:



Advertisements
Similar presentations
Copyright © 2000, Daniel W. Lewis. All Rights Reserved. CHAPTER 10 SHARED MEMORY.
Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
University of Washington Procedures and Stacks II The Hardware/Software Interface CSE351 Winter 2013.
- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel.
Pin PLDI Tutorial Advantages of Pin Instrumentation Easy-to-use Instrumentation: Uses dynamic instrumentation –Do not need source code, recompilation,
BINARY INSTRUMENTATION FOR HACKERS GAL DISKIN / INTEL HACK.LU
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.
Pin Tutorial Robert Cohn Intel. Pin Tutorial Academia Sinica About Me Robert Cohn –Original author of Pin –Senior Principal Engineer at Intel –Ph.D.
Software & Services Group PinADX: Customizable Debugging with Dynamic Instrumentation Gregory Lueck, Harish Patil, Cristiano Pereira Intel Corporation.
Aamer Jaleel Intel® Corporation, VSSAD June 17, 2006
Memory Image of Running Programs Executable file on disk, running program in memory, activation record, C-style and Pascal-style parameter passing.
1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.
Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Inline Function. 2 Expanded in a line when it is invoked Ie compiler replace the function call with function code To make a function inline the function.
University of Colorado
Pin2 Tutorial1 Pin Tutorial Kim Hazelwood Robert Muth VSSAD Group, Intel.
Pin PLDI Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.
CS 241 Section Week #4 (2/19/09). Topics This Section  SMP2 Review  SMP3 Forward  Semaphores  Problems  Recap of Classical Synchronization Problems.
About Us Kim Hazelwood Vijay Janapa Reddi
Software & Services Group 1 Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial Intel Corporation Presented By: Tevi Devor CGO ISPASS 2012.
Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Chapter 11 Programming in C Compilation vs. Interpretation Different ways of translating high-level language Compilation translates code into machine.
Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.
Variables, Functions & Parameter Passing CSci 588 Fall 2013 All material not from online sources copyright © Travis Desell, 2011.
Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial
- 1 - Copyright © 2006 Intel Corporation. All Rights Reserved. Using the Pin Instrumentation Tool for Computer Architecture Research Aamer Jaleel, Chi-Keung.
University of Washington Today More on procedures, stack etc. Lab 2 due today!  We hope it was fun! What is a stack?  And how about a stack frame? 1.
Pin Tutorial Kim Hazelwood David Kaeli Dan Connors Vijay Janapa Reddi.
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Dynamic Compilation and Modification CS 671 April 15, 2008.
Machine-Level Programming: X86-64 Topics Registers Stack Function Calls Local Storage X86-64.ppt CS 105 Tour of Black Holes of Computing.
Chapter 8 Scope of variables Name reuse. Scope The region of program code where it is legal to reference (use) a variable The scope of a variable depends.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Derived from "x86 Assembly Registers and the Stack" by Rodney BeedeRodney Beede x86 Assembly Registers and the Stack Nov 2009.
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.
Functions in Assembly CS 210 Tutorial 7 Functions in Assembly Studwww.cs.auckland.ac.nz/ ~mngu012/public.html/210/7/
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
CS415 C++ Programming Takamitsu Kawai x4212 G11 CERC building WV Virtual Environments Lab West Virginia University.
Functions Illustration of: Pass by value, reference Scope Allocation Reference: See your CS115/215 textbook.
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Recursion.
Performance Optimization of Pintools C K Luk Copyright © 2006 Intel Corporation. All Rights Reserved. Reducing Instrumentation Overhead Total Overhead.
CSE 332: C++ expressions Expressions: Operators and Operands Operators obey arity, associativity, and precedence int result = 2 * 3 + 5; // assigns 11.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
C++ Functions A bit of review (things we’ve covered so far)
Credits and Disclaimers
Stack and Heap Memory Stack resident variables include:
Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial
Chapter 6 CS 3370 – C++ Functions.
Kim Hazelwood Robert Cohn Intel SPI-ST
Conditional Branch Example
Protection of System Resources
PinADX: Customizable Debugging with Dynamic Instrumentation
Pin: Intel’s Dynamic Binary Instrumentation Engine Pin Tutorial
Advantages of Pin Instrumentation
Machine-Level Programming 4 Procedures
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
EECE.3170 Microprocessor Systems Design I
EECE.3170 Microprocessor Systems Design I
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
X86 Assembly Review.
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Credits and Disclaimers
Computer Architecture and System Programming Laboratory
Presentation transcript:

Part Two: Optimizing Pintools Robert Cohn Kim Hazelwood

Pin Tutorial – ISCA Total Overhead = Pin Overhead + Pintool Overhead ~5% for SPECfp and ~50% for SPECint Pin team’s job is to minimize this Usually much larger than pin overhead Pintool writers can help minimize this! Reducing Instrumentation Overhead

Pin Tutorial – ISCA Pin Overhead SPEC Integer 2006

Pin Tutorial – ISCA Adding User Instrumentation

Pin Tutorial – ISCA Instrumentation Routines Overhead Pintool’s Overhead Frequency of calling an Analysis Routine Work required for transiting to Analysis Routine Reducing the Pintool’s Overhead Analysis Routines Overhead + Work required in the Analysis Routine x Work done inside Analysis Routine

Pin Tutorial – ISCA Reducing Work in Analysis Routines Key: Shift computation from analysis routines to instrumentation routines whenever possible This usually has the largest speedup

Pin Tutorial – ISCA Edge Counting: a Slower Version... void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsBranchOrCall(ins)) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); }...

Pin Tutorial – ISCA Edge Counting: a Faster Version void docount(COUNTER* pedge, INT32 taken) { pedg->count += taken; } void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { COUNTER *pedg = Lookup(src, dst); pedg->count += taken; } void Instruction(INS ins, void *v) { if (INS_IsDirectBranchOrCall(ins)) { COUNTER *pedg = Lookup(INS_Address(ins), INS_DirectBranchOrCallTargetAddress(ins)); INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount, IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); } else INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, IARG_BRANCH_TAKEN, IARG_END); } …

Pin Tutorial – ISCA Key: Instrument at the largest granularity whenever possible Instead of inserting one call per instruction Insert one call per basic block or trace Analysis Routines: Reduce Call Frequency

Pin Tutorial – ISCA Slower Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter++;

Pin Tutorial – ISCA Faster Instruction Counting sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 3 counter += 2 Counting at BBL level sub$0xff, %edx cmp%esi, %edx jle mov$0x1, %edi add$0x10, %eax counter += 5 Counting at Trace level counter+=3 L1

Pin Tutorial – ISCA Reducing Work for Analysis Transitions Reduce number of arguments to analysis routines Inline analysis routines Pass arguments in registers Instrumentation scheduling

Pin Tutorial – ISCA Reduce Number of Arguments Eliminate arguments only used for debugging Instead of passing TRUE/FALSE, create 2 analysis functions –Instead of inserting a call to: Analysis(BOOL val) –Insert a call to one of these: AnalysisTrue() AnalysisFalse() IARG_CONTEXT is very expensive (> 10 arguments)

Pin Tutorial – ISCA Inlining int docount0(int i) { x[i]++ return x[i]; } Inlinable int docount1(int i) { if (i == 1000) x[i]++; return x[i]; } Not-inlinable int docount2(int i) { x[i]++; printf(“%d”, i); return x[i]; } Not-inlinable void docount3() { for(i=0;i<100;i++) x[i]++; } Not-inlinable Pin will inline analysis functions into application code

Pin Tutorial – ISCA Inlining Inlining decisions are recorded in pin.log with log_inline pin –xyzzy –mesgon log_inline –t mytool – app Analysis function at 0x2a c CAN be inlined Analysis function at 0x2a a is not inlinable because the last instruction of the first bbl fetched is not a ret instruction. The first bbl fetched: ============================================================================= === bbl[5:UNKN]: [p: ?,n: ? ] [____] rtn[ ? ] x x a a push rbp 32 0x x a b mov rbp, rsp 33 0x x a e mov rax, qword ptr [rip+0x3ce2b3] 34 0x x a inc dword ptr [rax] 35 0x x a mov rax, qword ptr [rip+0x3ce2aa] 36 0x x a e cmp dword ptr [rax], 0xf x x a965185a4 jnz 0x11

Pin Tutorial – ISCA Passing Arguments in Registers 32 bit platforms pass arguments on stack Passing arguments in registers helps small inlined functions VOID PIN_FAST_ANALYSIS_CALL docount(ADDRINT c) { icount += c; } BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount), IARG_FAST_ANALYSIS_CALL, IARG_UINT32, BBL_NumIns(bbl), IARG_END);

Pin Tutorial – ISCA Conditional Inlining Inline a common scenario where the analysis routine has a single “if-then” The “If” part is always executed The “then” part is rarely executed Useful cases: 1.“If” can be inlined, “Then” is not 2.“If” has small number of arguments, “then” has many arguments (or IARG_CONTEXT) Pintool writer breaks analysis routine into two: INS_InsertIfCall (ins, …, (AFUNPTR)doif, …) INS_InsertThenCall (ins, …, (AFUNPTR)dothen, …)

Pin Tutorial – ISCA IP-Sampling (a Slower Version) VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)IpSample, IARG_INST_PTR, IARG_END); } VOID IpSample(VOID* ip) { --icount; if (icount == 0) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } const INT32 N = 10000; const INT32 M = 5000; INT32 icount = N;

Pin Tutorial – ISCA IP-Sampling (a Faster Version) VOID Instruction(INS ins, VOID *v) { // CountDown() is always called before an inst is executed INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)CountDown, IARG_END); // PrintIp() is called only if the last call to CountDown() // returns a non-zero value INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintIp, IARG_INST_PTR, IARG_END); } INT32 CountDown() { --icount; return (icount==0); } VOID PrintIp(VOID *ip) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between } inlined not inlined

Pin Tutorial – ISCA Instrumentation Scheduling If an instrumentation can be inserted anywhere in a basic block: Let Pin know via IPOINT_ANYWHERE Pin will find the best point to insert the instrumentation to minimize register spilling

Pin Tutorial – ISCA ManualExamples/inscount1.cpp #include #include "pin.H“ UINT64 icount = 0; void docount(INT32 c) { icount += c; } void Trace(TRACE trace, void *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl,IPOINT_ANYWHERE,(AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); } void Fini(INT32 code, void *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } analysis routine instrumentation routine

Pin Tutorial – ISCA Optimizing Your Pintools - Summary Baseline Pin has fairly low overhead (~5-20%) Adding instrumentation can increase overhead significantly, but you can help! 1.Move work from analysis to instrumentation routines 2.Explore larger granularity instrumentation 3.Explore conditional instrumentation 4.Understand when Pin can inline instrumentation

Part Three: Analyzing Parallel Programs Robert Cohn Kim Hazelwood

Pin Tutorial – ISCA ManualExamples/inscount0.cpp instrumentation routine analysis routine #include #include "pin.h" UINT64 icount = 0; void docount() { icount++; } void Instruction(INS ins, void *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_END); } void Fini(INT32 code, void *v) { std::cerr << "Count " << icount << endl; } int main(int argc, char * argv[]) { PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } Unsynchronized access to global variable

Pin Tutorial – ISCA Making Tools Thread Safe Pthreads/Windows thread functions are not safe to call from tool Interfere with application Pin provides simple functions Locks – be careful about deadlocks Thread local storage Callbacks for thread begin/end More complicated threading calls should be done in a separate process

Pin Tutorial – ISCA Using Locks UINT64 icount = 0; PIN_LOCK lock; void docount() {GetLock(&lock, 1); icount++; ReleaseLock(&lock); } void Instruction(INS ins, void *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_END); } void Fini(INT32 code, void *v) { GetLock(&lock,1); std::cerr << "Count " << icount << endl; ReleaseLock(&lock); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; }

Pin Tutorial – ISCA Thread Start/End Callbacks VOID ThreadStart(THREADID tid, CONTEXT *ctxt, INT32 flags, VOID *v) { cout << “Thread is starting: ” << tid << endl; } VOID ThreadFini(THREADID tid, const CONTEXT *ctxt, INT32 code, VOID *v) { cout << “Thread is ending: ” << tid << endl; } int main(int argc, char * argv[]) { PIN_Init(argc, argv); PIN_AddThreadStartFunction(ThreadStart, 0); PIN_AddThreadFiniFunction(ThreadFini, 0); PIN_StartProgram(); return 0; }

Pin Tutorial – ISCA Threadid ID assigned to each thread, never reused Starts from 0 and increments Passed with IARG_THREAD_ID Use it to help debug deadlocks –GetLock(&lock,threadid) Use it to index into array (simple thread local storage) –Values[threadid]

Pin Tutorial – ISCA Thread Local Storage Make access thread safe by using thread local storage Pin allocates thread local storage for each thread You can request a slot in thread local storage Typically holds a pointer to data that has been malloced

Pin Tutorial – ISCA Thread Local Storage static UINT64 icount = 0; TLS_KEY key; VOID docount( THREADID tid) { ADDRINT * counter = static_cast (PIN_GetThreadData(key, tid)); *counter = *counter + 1; } VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_THREAD_ID, IARG_END); } VOID ThreadStart(THREADID tid, CONTEXT *ctxt, INT32 flags, VOID *v) { ADDRINT * counter = new ADDRINT; PIN_SetThreadData(key, counter, tid); } VOID ThreadFini(THREADID tid, const CONTEXT *ctxt, INT32 code, VOID *v) { ADDRINT * counter = static_cast (PIN_GetThreadData(key, tid)); icount += *counter; delete counter; }

Pin Tutorial – ISCA Thread Local Storage // This function is called when the application exits VOID Fini(INT32 code, VOID *v) { // Write to a file since cout and cerr maybe closed by the application ofstream OutFile("icount.out"); OutFile << "Count " << icount << endl; OutFile.close(); } // argc, argv are the entire command line, including pin -t int main(int argc, char * argv[]) { PIN_Init(argc, argv); key = PIN_CreateThreadDataKey(0); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddThreadStartFunction(ThreadStart, 0); PIN_AddThreadFiniFunction(ThreadFini, 0); PIN_StartProgram(); return 0; }