Statistical Profiling: Hardware, OS, and Analysis Tools.

Statistical Profiling: Hardware, OS, and Analysis Tools

Profiling Tutorial 2-2 10/4/98 Joint Work DIGITAL Continuous Profiling Infrastructure (DCPI) Project Members at  Systems Research Center Lance Berc, Sanjay Ghemawat, Monika Henzinger, Shun-Tak Leung, Dick Sites (now at Adobe), Mitch Lichtenberg, Mark Vandevoorde, Carl Waldspurger, Bill Weihl  Western Research Lab Jennifer Anderson, Jeffrey Dean Other Collaborators  Cambridge Research Lab Jamey Hicks  Alpha Engineering George Chrysos, Scot Hildebrandt, Rick Kessler, Ed McLellan, Gerard Vernes, Jonathan White

Profiling Tutorial 2-3 10/4/98 Outline  Statistical sampling –What is it? –Why use it?  Data collection –Hardware issues –OS issues  Data analysis –In-order processors –Out-of-order processors

Profiling Tutorial 2-4 10/4/98 Statistical Profiling Based on periodic sampling  Hardware generates periodic interrupts  OS handles the interrupts and stores data –Program Counter (PC) and any extra info  Analysis Tools convert data –for users –for compilers Examples: DCPI, Morph, SGI Speedshop, Unix’s prof(), VTune

Profiling Tutorial 2-5 10/4/98 Sampling vs. Instrumentation  Much lower overhead than instrumentation –DCPI: program 1%-3% slower –Pixie: program 2-3 times slower  Applicable to large workloads –100,000 TPS on Alpha –AltaVista  Easier to apply to whole systems (kernel, device drivers, shared libraries,...) –Instrumenting kernels is very tricky –No source code needed

Profiling Tutorial 2-6 10/4/98 Information from Profiles DCPI estimates  Where CPU cycles went, broken down by –image, procedure, instruction  How often code was executed –basic blocks and CFG edges  Where peak performance was lost and why

Profiling Tutorial 2-7 10/4/98 Example: Getting the Big Picture Total samples for event type cycles = 6095201 cycles % cum% load file 2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so cycles % cum% procedure load file 2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so

Profiling Tutorial 2-8 10/4/98 Example: Using the Microscope Where peak performance is lost and why

Profiling Tutorial 2-9 10/4/98 Example: Summarizing Stalls I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0% Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3% ------------------------------------------------------------- Subtotal dynamic 44.1% Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0% ------------------------------------------------------------- Subtotal static 4.8% ------------------------------------------------------------- Total stall 48.9% Execution 51.2% Net sampling error -0.1% ------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)

Profiling Tutorial 2-10 10/4/98 Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line 10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488

Profiling Tutorial 2-11 10/4/98 Instruction-level Information Matters DCPI anecdotes  TPC-D: 10% speedup  Duplicate filtering for AltaVista: part of 19X  Compress program: 22%  Compiler improvements: 20% in several Spec benchmarks

Profiling Tutorial 2-13 10/4/98 Typical Hardware Support  Timers –Clock interrupt after N units of time  Performance Counters –Interrupt after N cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired,... –Alpha 21064, 21164; Ppro, PII;… –Easy to measure total cycles, issues, CPI, etc. Only extra information is restart PC

Profiling Tutorial 2-14 10/4/98 Problem: Inaccurate Attribution  Experiment –count data loads –loop: single load + hundreds of nops  In-Order Processor –Alpha 21164 –skew –large peak  Out-of-Order Processor –Intel Pentium Pro –skew –smear load

Profiling Tutorial 2-15 10/4/98 Ramification of Misattribution  No skew or smear –Instruction-level analysis is easy!  Skew is a constant number of cycles –Instruction-level analysis is possible –Adjust sampling period by amount of skew –Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program  Smear –Instruction-level analysis seems hopeless –Examples: PII, StrongARM

Profiling Tutorial 2-16 10/4/98 Desired Hardware Support  Sample fetched instructions  Save PC of sampled instruction –E.g., interrupt handler reads Internal Processor Register –Makes skew and smear irrelevant  Gather more information

Profiling Tutorial 2-17 10/4/98 random selection ProfileMe: Instruction-Centric Profiling fetchmapissueexec retire icache branch predict dcache interrupt! arith units done? Fetch counter overflow? pcaddrretired?miss?stage latencies ProfileMe tag! tagged? historymp? capture! internal processor registers miss?

Profiling Tutorial 2-18 10/4/98 Instruction-Level Statistics  PC + Retire Status  execution frequency  PC + Cache Miss Flag  cache miss rates  PC + Branch Mispredict  mispredict rates  PC + Event Flag  event rates  PC + Branch Direction  edge frequencies  PC + Branch History  path execution rates  PC + Latency  instruction stalls “100-cycle dcache miss” vs. “dcache miss”

Profiling Tutorial 2-19 10/4/98 Kernel Device Driver  Challenge: 1% of 64K is only 655 cycles/sample  Aggregate samples in hash table –(PID, PC, event)  count  Minimize cache misses –~100 cycles to memory –Pack data structures into cache lines  Eliminate expensive synchronization operations –Interprocessor interrupts for synchronization with daemon –Replicate main data structures on each processor

Profiling Tutorial 2-20 10/4/98 Moving Samples to Disk  User-Space Daemon –Extracts raw samples from driver –Associates samples with compiled code –Updates disk-based profiles for compiled code  Mapping samples to compiled code –Dynamic loader hook for dynamically loaded code –Exec hook for statically linked code –Other hooks for initializing mapping at daemon start-up  Profiles –text header + compact binary samples

Profiling Tutorial 2-21 10/4/98 Performance of Data Collection (DCPI)  Time –1-3% total overhead for most workloads –Often less than variation from run to run  Space –512 KB kernel memory per processor –2-10 MB resident for daemon –10 MB disk after one month of profiling on heavily used timeshared 4-processor machine  Non-intrusive enough to be run for many hours on production systems, e.g.

Profiling Tutorial 2-23 10/4/98 Compile code Samples ANALYSISANALYSIS Stall explanations Frequency Cycles per instruction Data Analysis  Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas)  Frequency indicates frequent paths  CPI indicates stalls

Profiling Tutorial 2-24 10/4/98 1,000,000  1 CPI ? 10,000  100 CPI 1,000,000 Cycles Estimating Frequency from Samples  Problem –given cycle samples, compute frequency and CPI  Approach –Let F = Frequency / Sampling Period –E(Cycle Samples) = F X CPI –So … F = E(Cycle Samples) / CPI

Profiling Tutorial 2-25 10/4/98 Estimating Frequency (cont.) F = E(Cycle Samples) / CPI  Idea –If no dynamic stall, then know CPI, so can estimate F –So… assume some instructions have no dynamic stalls  Consider a group of instructions with the same frequency (e.g., basic block)  Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy  Key insight: –Instructions without stalls have smaller sample counts

Profiling Tutorial 2-26 10/4/98 Estimating Frequency (Example)  Compute MinCPI from Code  Compute Samples/MinCPI  Select Data to Average  Does badly when: –Few issue points –All issue points stall

Profiling Tutorial 2-27 10/4/98 Frequency Estimate Accuracy  Compare frequency estimates for blocks to measured values obtained with pixie-like tool  Edge frequencies a bit less accurate

Profiling Tutorial 2-28 10/4/98 Explaining Stalls  Static stalls –Schedule instructions in each basic block optimistically using a detailed pipeline model for the processor  Dynamic stalls –Start with all possible explanations  I-cache miss, D-cache miss, DTB miss, branch mispredict,... –Rule out unlikely explanations –List the remaining possibilities

Profiling Tutorial 2-29 10/4/98  Is the previous occurrence of an operand register the destination of a load instruction?  Search backward across basic block boundaries  Prune by block and edge execution frequencies ldq t0,0(s1) subq t0,t1,t2 addq t3,t0,t4 OR subq t0,t1,t2 Ruling Out D-cache Misses

Profiling Tutorial 2-30 10/4/98 Out-of-Order Processors  In-Order processors –Periodic interrupt lands on “current” instruction, e.g., next instruction to issue –Peak performance = no wasted issue slots –Any stall implies loss in performance  Out-of-Order Processors –Many instructions in-flight: no “current” instruction –Some stalls masked by concurrent execution  Instructions issue around stalled instruction  Example: does this stall matter? load r1,… add …,r1,… average latency: 15.0 cycles … other instructions …

Profiling Tutorial 2-31 10/4/98 Issue: Need to Measure Concurrency  Interesting concurrency metrics –Retired instructions per cycle –Issue slots wasted while an instruction is in flight –Pipeline stage utilization How to measure concurrency?  Special-purpose hardware –Some metrics difficult to measure e.g. need retire/abort status  Sample potentially-concurrent instructions –Aggregate info from pairs of samples –Statistically estimate metrics

Profiling Tutorial 2-32 10/4/98 Paired Sampling  Sample two instructions –May be in-flight simultaneously –Replicate ProfileMe hardware, add intra-pair distance  Nested sampling –Sample window around first profiled instruction –Randomly select second profiled instruction –Statistically estimate frequency for F(first, second) +W... -W time overlapno overlap

Profiling Tutorial 2-33 10/4/98 Explaining Lost Performance  An open question  Some in-order analysis applicable –E.g., D-cache miss & branch mispredict analysis  Pipe stage latencies from counters would help a lot

Profiling Tutorial 2-34 10/4/98 Summary & Conclusion  Statistical profiling can be –Inexpensive –Effective  Instruction-level analysis matters  Performance counters –Implementation details make a big difference  Out-of-order processors require better counters

Statistical Profiling: Hardware, OS, and Analysis Tools.

Similar presentations

Presentation on theme: "Statistical Profiling: Hardware, OS, and Analysis Tools."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Profiling: Hardware, OS, and Analysis Tools.

Similar presentations

Presentation on theme: "Statistical Profiling: Hardware, OS, and Analysis Tools."— Presentation transcript:

Similar presentations

About project

Feedback