Statistical Profiling: Hardware, OS, and Analysis Tools.

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Chapter 8. Pipelining.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Organization and Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Multiscalar processors
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Processes April 5, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
© 2000 Universität Karlsruhe, System Architecture Group Efficient and Flexible Value Sampling Michael Burrows (Compaq SRC) Ulfar Erlingson (deCODE Genetics)
Chapter 14 Instruction Level Parallelism and Superscalar Processors
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
In-Line Interrupt Handling for Software Managed TLBs Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering University of Maryland at College.
Implementing Processes and Process Management Brian Bershad.
If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
1 Virtual Memory and Address Translation. 2 Review Program addresses are virtual addresses.  Relative offset of program regions can not change during.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Pipelining and Parallelism Mark Staveley
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a Lance Berc & Mark Vandevoorde Joint work with: Jennifer Anderson, Jeff Dean, Sanjay.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
PipeliningPipelining Computer Architecture (Fall 2006)
Multiprogramming. Readings r Chapter 2.1 of the textbook.
Lecture Topics: 11/1 Processes Process Management
5.2 Eleven Advanced Optimizations of Cache Performance
Performance monitoring on HP Alpha using DCPI
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Seoul National University
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture Topics: 11/1 General Operating System Concepts Processes
Processes Hank Levy 1.
Processes and Process Management
Superscalar and VLIW Architectures
Processes Hank Levy 1.
Presentation transcript:

Statistical Profiling: Hardware, OS, and Analysis Tools

Profiling Tutorial /4/98 Joint Work DIGITAL Continuous Profiling Infrastructure (DCPI) Project Members at  Systems Research Center Lance Berc, Sanjay Ghemawat, Monika Henzinger, Shun-Tak Leung, Dick Sites (now at Adobe), Mitch Lichtenberg, Mark Vandevoorde, Carl Waldspurger, Bill Weihl  Western Research Lab Jennifer Anderson, Jeffrey Dean Other Collaborators  Cambridge Research Lab Jamey Hicks  Alpha Engineering George Chrysos, Scot Hildebrandt, Rick Kessler, Ed McLellan, Gerard Vernes, Jonathan White

Profiling Tutorial /4/98 Outline  Statistical sampling –What is it? –Why use it?  Data collection –Hardware issues –OS issues  Data analysis –In-order processors –Out-of-order processors

Profiling Tutorial /4/98 Statistical Profiling Based on periodic sampling  Hardware generates periodic interrupts  OS handles the interrupts and stores data –Program Counter (PC) and any extra info  Analysis Tools convert data –for users –for compilers Examples: DCPI, Morph, SGI Speedshop, Unix’s prof(), VTune

Profiling Tutorial /4/98 Sampling vs. Instrumentation  Much lower overhead than instrumentation –DCPI: program 1%-3% slower –Pixie: program 2-3 times slower  Applicable to large workloads –100,000 TPS on Alpha –AltaVista  Easier to apply to whole systems (kernel, device drivers, shared libraries,...) –Instrumenting kernels is very tricky –No source code needed

Profiling Tutorial /4/98 Information from Profiles DCPI estimates  Where CPU cycles went, broken down by –image, procedure, instruction  How often code was executed –basic blocks and CFG edges  Where peak performance was lost and why

Profiling Tutorial /4/98 Example: Getting the Big Picture Total samples for event type cycles = cycles % cum% load file % 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so % 64.24% /vmunix % 79.47% /usr/shlib/X11/libmi.so % 90.14% /usr/shlib/X11/libos.so cycles % cum% procedure load file % 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so % 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so % 47.36% miCreateETandAET /usr/shlib/X11/libmi.so % 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so % 55.84% bcopy /vmunix % 59.28% Dispatch /usr/shlib/X11/libdix.so % 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so % 65.14% in_checksum /vmunix % 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so % 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so

Profiling Tutorial /4/98 Example: Using the Microscope Where peak performance is lost and why

Profiling Tutorial /4/98 Example: Summarizing Stalls I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0% Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3% Subtotal dynamic 44.1% Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0% Subtotal static 4.8% Total stall 48.9% Execution 51.2% Net sampling error -0.1% Total tallied 100.0% (35171, 93.1% of all samples)

Profiling Tutorial /4/98 Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line 10.0% 10.0% dcache 957c comp.c: % 19.8% dcache 9530 comp.c: % 27.6% dcache 959c comp.c:488

Profiling Tutorial /4/98 Instruction-level Information Matters DCPI anecdotes  TPC-D: 10% speedup  Duplicate filtering for AltaVista: part of 19X  Compress program: 22%  Compiler improvements: 20% in several Spec benchmarks

Profiling Tutorial /4/98 Outline  Statistical sampling –What is it? –Why use it?  Data collection –Hardware issues –OS issues  Data analysis –In-order processors –Out-of-order processors

Profiling Tutorial /4/98 Typical Hardware Support  Timers –Clock interrupt after N units of time  Performance Counters –Interrupt after N cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired,... –Alpha 21064, 21164; Ppro, PII;… –Easy to measure total cycles, issues, CPI, etc. Only extra information is restart PC

Profiling Tutorial /4/98 Problem: Inaccurate Attribution  Experiment –count data loads –loop: single load + hundreds of nops  In-Order Processor –Alpha –skew –large peak  Out-of-Order Processor –Intel Pentium Pro –skew –smear load

Profiling Tutorial /4/98 Ramification of Misattribution  No skew or smear –Instruction-level analysis is easy!  Skew is a constant number of cycles –Instruction-level analysis is possible –Adjust sampling period by amount of skew –Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program  Smear –Instruction-level analysis seems hopeless –Examples: PII, StrongARM

Profiling Tutorial /4/98 Desired Hardware Support  Sample fetched instructions  Save PC of sampled instruction –E.g., interrupt handler reads Internal Processor Register –Makes skew and smear irrelevant  Gather more information

Profiling Tutorial /4/98 random selection ProfileMe: Instruction-Centric Profiling fetchmapissueexec retire icache branch predict dcache interrupt! arith units done? Fetch counter overflow? pcaddrretired?miss?stage latencies ProfileMe tag! tagged? historymp? capture! internal processor registers miss?

Profiling Tutorial /4/98 Instruction-Level Statistics  PC + Retire Status  execution frequency  PC + Cache Miss Flag  cache miss rates  PC + Branch Mispredict  mispredict rates  PC + Event Flag  event rates  PC + Branch Direction  edge frequencies  PC + Branch History  path execution rates  PC + Latency  instruction stalls “100-cycle dcache miss” vs. “dcache miss”

Profiling Tutorial /4/98 Kernel Device Driver  Challenge: 1% of 64K is only 655 cycles/sample  Aggregate samples in hash table –(PID, PC, event)  count  Minimize cache misses –~100 cycles to memory –Pack data structures into cache lines  Eliminate expensive synchronization operations –Interprocessor interrupts for synchronization with daemon –Replicate main data structures on each processor

Profiling Tutorial /4/98 Moving Samples to Disk  User-Space Daemon –Extracts raw samples from driver –Associates samples with compiled code –Updates disk-based profiles for compiled code  Mapping samples to compiled code –Dynamic loader hook for dynamically loaded code –Exec hook for statically linked code –Other hooks for initializing mapping at daemon start-up  Profiles –text header + compact binary samples

Profiling Tutorial /4/98 Performance of Data Collection (DCPI)  Time –1-3% total overhead for most workloads –Often less than variation from run to run  Space –512 KB kernel memory per processor –2-10 MB resident for daemon –10 MB disk after one month of profiling on heavily used timeshared 4-processor machine  Non-intrusive enough to be run for many hours on production systems, e.g.

Profiling Tutorial /4/98 Outline  Statistical sampling –What is it? –Why use it?  Data collection –Hardware issues –OS issues  Data analysis –In-order processors –Out-of-order processors

Profiling Tutorial /4/98 Compile code Samples ANALYSISANALYSIS Stall explanations Frequency Cycles per instruction Data Analysis  Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas)  Frequency indicates frequent paths  CPI indicates stalls

Profiling Tutorial /4/98 1,000,000  1 CPI ? 10,000  100 CPI 1,000,000 Cycles Estimating Frequency from Samples  Problem –given cycle samples, compute frequency and CPI  Approach –Let F = Frequency / Sampling Period –E(Cycle Samples) = F X CPI –So … F = E(Cycle Samples) / CPI

Profiling Tutorial /4/98 Estimating Frequency (cont.) F = E(Cycle Samples) / CPI  Idea –If no dynamic stall, then know CPI, so can estimate F –So… assume some instructions have no dynamic stalls  Consider a group of instructions with the same frequency (e.g., basic block)  Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy  Key insight: –Instructions without stalls have smaller sample counts

Profiling Tutorial /4/98 Estimating Frequency (Example)  Compute MinCPI from Code  Compute Samples/MinCPI  Select Data to Average  Does badly when: –Few issue points –All issue points stall

Profiling Tutorial /4/98 Frequency Estimate Accuracy  Compare frequency estimates for blocks to measured values obtained with pixie-like tool  Edge frequencies a bit less accurate

Profiling Tutorial /4/98 Explaining Stalls  Static stalls –Schedule instructions in each basic block optimistically using a detailed pipeline model for the processor  Dynamic stalls –Start with all possible explanations  I-cache miss, D-cache miss, DTB miss, branch mispredict,... –Rule out unlikely explanations –List the remaining possibilities

Profiling Tutorial /4/98  Is the previous occurrence of an operand register the destination of a load instruction?  Search backward across basic block boundaries  Prune by block and edge execution frequencies ldq t0,0(s1) subq t0,t1,t2 addq t3,t0,t4 OR subq t0,t1,t2 Ruling Out D-cache Misses

Profiling Tutorial /4/98 Out-of-Order Processors  In-Order processors –Periodic interrupt lands on “current” instruction, e.g., next instruction to issue –Peak performance = no wasted issue slots –Any stall implies loss in performance  Out-of-Order Processors –Many instructions in-flight: no “current” instruction –Some stalls masked by concurrent execution  Instructions issue around stalled instruction  Example: does this stall matter? load r1,… add …,r1,… average latency: 15.0 cycles … other instructions …

Profiling Tutorial /4/98 Issue: Need to Measure Concurrency  Interesting concurrency metrics –Retired instructions per cycle –Issue slots wasted while an instruction is in flight –Pipeline stage utilization How to measure concurrency?  Special-purpose hardware –Some metrics difficult to measure e.g. need retire/abort status  Sample potentially-concurrent instructions –Aggregate info from pairs of samples –Statistically estimate metrics

Profiling Tutorial /4/98 Paired Sampling  Sample two instructions –May be in-flight simultaneously –Replicate ProfileMe hardware, add intra-pair distance  Nested sampling –Sample window around first profiled instruction –Randomly select second profiled instruction –Statistically estimate frequency for F(first, second) +W... -W time overlapno overlap

Profiling Tutorial /4/98 Explaining Lost Performance  An open question  Some in-order analysis applicable –E.g., D-cache miss & branch mispredict analysis  Pipe stage latencies from counters would help a lot

Profiling Tutorial /4/98 Summary & Conclusion  Statistical profiling can be –Inexpensive –Effective  Instruction-level analysis matters  Performance counters –Implementation details make a big difference  Out-of-order processors require better counters