Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative.

Similar presentations


Presentation on theme: "Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative."— Presentation transcript:

1 Quiz Wei Hsu 8/16/2006

2 Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative loads (e.g. ld.s) D)Advance load (e.g. ld.a) E)Stores Answer: A, B, C, D

3 Which of the following motivate dynamic optimization A)When the underlying micro-architecture is different from what the object code is compiled for. B)When the program behaves very differently on different input data C)When the application is large, and has a very flat profile. D)When the application is written in C/C++. Answer: A, B

4 Which of the following may increase MLP? A)A larger instruction re-ordering window for OOO processors B)Use code scheduling to overlap delinquent loads if the processor uses stall-on-use model C)Inserting cache prefetches for multiple delinquent loads D)Decrease the associativity of the cache E)Using a helper thread running on the second core Answer: A, B, C, E

5 Montecito is a dual-core CMP, but the two cores do not share on-chip caches (L1/L2/L3), how may we use help threads ? A)We may use VMT that switches the main thread to a helper thread on a L3 cache miss. B)It is hard to use the other core for helper threads since the synchronization overhead is high C)It is possible to use the other core to warm up the off-chip L4 shared cache, if there is one. D)It is possible to use the other core to warm up near memory side caches. Answer: A, B, C, D

6 Dynamic Instrumentation Techniques and their Applications Wei Hsu 8/16/2006

7 Program Instrumentation  Instrumentation A technique for inserting extra code (or probes) into an application to observe its behavior   Program measurements (profiles, value profiles)   Trace generator (e.g. branch trace, memory trace)  Protection (program introspection)  Emulation (cache simulator, Shade)  Migration (e.g. PA 1.1  2.0)  Debugging tools (Pure software, purify, memory checker)

8 Instrumentation Time  In Source Code  For communicating high-level domain specific abstraction to tools  Portable across multiple compilers and platforms  At Compile Time  compiler inserts instrumentations, e.g. using –p, -pg, -a, -ax … flags.  At Post-Link Time  Often referred to as binary editing tools e.g. Atom, EEL, Pixie  No need to recompile applications  Language independent  Will not be invalidated or affected by compiler optimization  At Runtime  Instrumentation during program execution  No recompilation, no re-linking, no restarting  Can be inserted and/or removed at runtime, no disabled probe effect  Requires continuous porting efforts as computing platforms evolve

9 Static Binary Editing Tools –ATOM on DEC (COMPAQ) Unix –NTATOM on WindowsNT –HiProf/TracePoint (performance tools based on ATOM) –Pixie for MIPS –Etch (Instrumentation/optimization for Win32/x86 apps) –OM system (for link time optimizations, developed at DEC)  Spike for Alpha  iSPike for Itanium –CacheProf (now rolled into valgrind, a popular Linux tool set for profiling and debugging, based on dynamic instrumentation) –UQBT (resourceable and retargetable binary translator) –EEL

10 EEL:Machine-Independent Executable Editing  EEL (Executable Editing Library) is a C++ library that hides much of the complexity and system- specific detail of editing executables.  Applications appear unchanged, and data collected as a side effect of execution.  Qpt/Qpt2 are tracing tools based on EEL. Qpt’s performance is better than Shade (Qpt is 2-6x slower than native execution even with tracing)

11 Instrumentation Example If (A > B) { bb[0]++; A = 1; } Else { bb[1]++; A = 0; }

12 Executable Editing  Typical binary editing –Decompose –Build IR –Insert Instrumentation –Convert IR to executable  EEL’s approach –Abstractions: executable, routine, CFG, instruction, snippet –Adding snippets to a routine’s CFG –Produce a new version of the routine from the edited CFG

13 Executable Editing  Obstacle –Address are bound –Registers are bound Example if (a) a = b Bnz%r1,.+2 Ld_b,%r1 Insert Ld_counter1, %rx Add%rx, 1, %rx St%rx, _counter Q: Is reg %rx free? How about the branch inst offset?

14 Handling Registers Bnz%r1,.+2 Ld_b,%r1 Insert Ld_counter1, %rx Add%rx, 1, %rx St%rx, _counter 1)If there are free registers (dead at the point of insertion), the editor could replace %rx by the free register. 2) If no free registers, a wrapper routine must be used to spill %rx to the stack. EEL uses data flow analysis to identify free registers – not trivial.

15 Handling Addresses Bnz%r1,.+2 Ld_b,%r1 Insert Ld_counter1, %rx Add%rx, 1, %rx St%rx, _counter How to change the address in the branch instructions? EEL uses control flow analysis to change addresses in branches calls, and jumps One alternative is to change the Ld instruction to a branch to the instrumented code segment (like a procedure call) so that addresses of other branch instructions remain the same.

16 Handling Addresses Bnz%r1,.+2 Callxxxxx Ld_counter1, %rx Add%rx, 1, %rx St%rx, _counter Ld_b, %r1 ret Pros No need for CFG, no adjustment to addresses in branch/jump instructions Cons Less efficient instrumented code Don’t know how to handle variable length instructions

17 EEL Abstractions  Executable –Object file, library, static or dynamically linked programs –Use symbol table information, but do not rely on it  Analysis to identify all routines –Using the symbol table to form the initial set of routines –If no symbol table, the initial set contains only the program entry address and the first location in the text segment –Examine instructions to locate jumps out of a routine, or calls on routines not in the initial set. –A CFG is constructed

18 Code Snippet  Code snippet can be coded in assembly or in high-level. It is usually coded in assembly for efficiency, but becomes machine dependent.  When a tool creates a snippet, it specifies the instructions, two register sets, and a call-back function. –Registers used in the snippet that need to be assigned unused registers –Some particular registers that EEL should not spill or assign them. –Call back function edits displacements

19 Editing Example: 1*sethi0x1, %g6 2*ld[%lo(0x1)+%g6], %g7 add%g7, 1, %g7 3*st%g7, [%lo(0x1)+%g6]  EEL modifies calls, branches, and jumps to ensure correct control flows

20 CFG of a routine  EEL represents a routine as a CFG  Why CFG? –A profile tool, qpt required CFG to place instrumentation code on CFG edges. (what’s wrong with block counts??) –EEL uses CFG to adjust addresses in branches and jumps –CFG provides architecture-independence on control flow

21 Representing Delayed branches Bne%icc, L1 Add%r1,%r2,%r3 Bne, a%icc, L1 Add%r1,%r2,%r3 Bne%icc, L1 L1 Bne, a%icc, L1 Add%r1,%r2,%r3 L1 Nullified delay slot

22 Incomplete CFG  When control flow cannot be completely analyzed, runtime code ensures corrected execution.  This paper claims that most indirect jumps occur in case statements (actually, most indirect branches are return jumps, shared lib calls and indirect calls). EEL uses backward slicing to find the jump table and complete the CFG.  EEL’s backward slicing makes runtime translation a rare occurrence: no unanalyzable indirect jumps in spec92 using SunOS’s compilers.

23 Int main(int argc, char* argv[]) { executable * exec = new executable(argv[1]); exec->read_contetnts(); routine * r; FOREACH_ROUTINE(r, exec->routines()) { instrument(r); ….} ….} Void instrument (routine* r) { cfg* g = r->control_flow_graph(); bb* b; bb* b; FOREACH_BB(b, g->blocks()) { FOREACH_BB(b, g->blocks()) { if (1 succ()->size()) { edge* e; FOREACH_EDGE(e, b->succ()) { FOREACH_EDGE(e, b->succ()) { e->add_code_along(incr_count(num)); num++; } e->add_code_along(incr_count(num)); num++; }

24 Patching based dynamic instrumentation

25 Dynamic Instrumentation  Many advantages over static instrumentation: –No need of a separate instrumentation pass –Can instrument all user-level codes executed  Shared libraries  Dynamically generated code –Easy to distinguish code and data –Instrumentation can be turned on/off –Can attach and instrument an already running process –No disabled probe effect

26 PIN: A VM based Dynamic Instrumentation Tool  It uses dynamic code generation to make a less intrusive instrumentation system  Pin has the following advantages:  Easy-to-use  Portable  Transparent  Efficient

27 Easy-to-use and Portable  Instrumentation tools are written in C/C++ using PIN’s API  It allows tool writers to analyze an application by inserting calls at arbitrary locations in the executable.  Users do not need to manually in-line calls or save/restore registers  PIN’s API abstract away instruction  PIN’s API abstract away instruction idiosyncrasies, so the tools can be portable. Various Pintools are available on IA32, Itanium, ARM, and EM64  API also allows access to architecture-specific information

28 Efficient and Robust  Code caching and trace linking  Pin implements register re-allocation, inlining, liveness analysis, and instruction scheduling to instrumented code.  Pin can dynamically attaching and detaching to a process. This is important for large, long running programs.  Pin can handle  Pin can handle mixed code and data, variable- length instructions, statically unknown indirect jump targets, dynamically loaded libraries, and dynamically generated code

29 FILE * trace; // Print a memory write record VOID RecordMemWrite(VOID * ip, VOID * addr, UINT32 size) { fprintf(trace,"%p: W %p %d\n", ip, addr, size); } // Called for every instruction VOID Instruction(INS ins, VOID *v) { // instruments writes using a predicated call, // i.e. the call happens iff the store is // actually executed if (INS_IsMemoryWrite(ins)) INS_InsertPredicatedCall( ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); } int main(int argc, char *argv[]) { PIN_Init(argc, argv); trace = fopen("atrace.out", "w"); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); // Never returns return 0; A Sample Pintool for tracing Memory writes

30 Pin’s software architecture Hardware Operating System Application Pintool Instrumentation API JIT Compiler Emulation unit Code Cache Dispatcher Virtual machine (VM)

31 Execution Drives Instrumentation 23 1 7 45 6 7’ 2’ 1’ Compiler Original code Code cache

32 Execution Drives Instrumentation 23 1 7 45 6 7’ 2’ 1’ Compiler Original code Code cache 3’ 5’ 6’

33 Instruction-level Instrumentation  Instrument relative to an instruction: –Before –After:  Fall-through edge  Taken edge (if it is a branch) cmp%esi, %edx jle mov$0x1, %edi : mov $0x8,%edi count(10) count(30) count(20)

34 Pin Instrumentation APIs  Basic APIs are architecture independent: –Provide common functionalities such as finding out:  Control-flow changes  Memory accesses  Architecture-specific APIs for more detailed info – IA-32, EM64T, Itanium, Xscale  ATOM-based notion: –Instrumentation routines –Analysis routines

35 Instrumentation Routines  User writes instrumentation routines: –Walk list of instructions, and –Insert calls to analysis routines  Pin invokes instrumentation routines when placing new instructions in code cache  Repeated execution uses already instrumented code in code cache

36 Analysis Routines  User inserts calls to analysis routine: –User-specified arguments –E.g., increment counter, record data address, …  User writes in C, C++, ASM  Pin provides isolation so analysis does not affect application  Optimizations like inlining, register allocation, and scheduling make it efficient

37 Example: Instruction Count [rscohn1@shli0005 Tests]$ hello Hello world [rscohn1@shli0005 Tests]$ icount -- hello Hello world ICount 496890 [rscohn1@shli0005 Tests]$

38 Example: Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 counter++;

39 #include #include "pinstr.H" UINT64 icount=0; // Analysis Routine void docount() { icount++; } // Instrumentation Routine void Instruction(INS ins) { PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)docount, IARG_END); } VOID Fini() { fprintf(stderr,"ICount %lld\n", icount); } int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); PIN_AddFiniFunction(Fini); PIN_StartProgram(); }

40 Example: Instruction Trace [rscohn1@shli0005 Trace]$ itrace -e hello Hello world [rscohn1@shli0005 Trace]$ head prog.trace 0x20000000000045c00x20000000000045c10x20000000000045c20x20000000000045d00x20000000000045d20x20000000000045e00x20000000000045e10x20000000000045e2 [rscohn1@shli0005 Trace]$

41 Example: Instruction Trace mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 traceInst(ip);

42 #include #include #include "pinstr.H" FILE *traceFile; void traceInst(long * ipsyll){ fprintf(traceFile, "%p\n", ipsyll); fprintf(traceFile, "%p\n", ipsyll);} void Instruction(INS ins){ PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)traceInst, IARG_IP_SLOT, IARG_END); PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)traceInst, IARG_IP_SLOT, IARG_END);} int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); PIN_AddInstrumentInstructionFunction(Instruction); traceFile = fopen("prog.trace", "w"); traceFile = fopen("prog.trace", "w"); PIN_StartProgram(); PIN_StartProgram();}

43 Example: Faster Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2 counter++; counter += 3; counter += 2;

44 #include #include #include "pin.H“ UINT64 icount = 0; VOID docount(INT32 c) { icount += c; } VOID Trace(TRACE trace, VOID *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl, IPOINT_BEFORE, (AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); IARG_UINT32, BBL_NumIns(bbl), IARG_END); }} VOID Fini(INT32 code, VOID *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); PIN_StartProgram(); return 0; return 0;} ManualExamples/inscount1.C

45 Instruction Information Accessed at Instrumentation Time 1.INS_Category(INS) 2.INS_Address(INS) 3.INS_Regr1, INS_Regr2, INS_Regr3, … 4.INS_Next(INS), INS_Prev(INS) 5.INS_BraType(INS) 6.INS_SizeType(INS) 7.INS_Stop(INS)

46 More Advanced Tools  Instruction cache simulation: replace itrace analysis function  Data cache: like icache, but instrument loads/stores and pass effective address  Malloc/Free trace: instrument entry/exit points  Detect out of bound stack references –Instrument instructions that move stack pointer –Instrument loads/stores to check in bound

47 Instrumentation is Transparent  When application looks at itself, sees same: –Code addresses –Data addresses –Memory contents  Don’t want to change behavior, expose latent bugs  When instrumentation looks at application, sees original application: –Code addresses –Data addresses –Memory contents  Observe original behavior

48 Pin Instruments All Code  Execution driven instrumentation: –Shared libraries –Dynamically generated code  Self modifying code –Instrumented first time executed –Pin does not detect code has been modified

49 Dynamic Instrumentation in Pin  While program is running: –Instrumentation can be turned on/off –Code cache can be invalidated –Reinstrumented the next time it is executed –Pin can detach and run application native  Use this for fast skip

50   Pin is freely available at http://rogue.colorado.edu/Pin.

51 Additional Pintools  PLR (Process Level Redundancy) to check transient faults in software uses Pin to trace all system calls. It ensures output data from redundant processes are consistent before execution continues.  Path coverage expander selectively executes NT- path (Not Taken Path) in order to increase the execution path coverage to expose potential bugs. It uses Pin to modify the architecture states to force execution go into NT-path. All memory updates in NT-path are re-directed to a special region.

52 Additional Pintools  PASS (Phase Aware Stratified Sampling) uses Pin to construct DCR (Dynamic Code Region). DCR can be used to determine program phases and pass the information to compilers.  PinPoint uses Pin to compute SimPoint on- the-fly.


Download ppt "Quiz Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature? A)Data cache prefetch instruction B)Non-faulting loads C)Speculative."

Similar presentations


Ads by Google