Download presentation
Presentation is loading. Please wait.
Published byBenjamin Dylan Hardy Modified over 9 years ago
1
Statistical Profiling: Hardware, OS, and Analysis Tools
2
Profiling Tutorial 2-2 10/4/98 Joint Work DIGITAL Continuous Profiling Infrastructure (DCPI) Project Members at Systems Research Center Lance Berc, Sanjay Ghemawat, Monika Henzinger, Shun-Tak Leung, Dick Sites (now at Adobe), Mitch Lichtenberg, Mark Vandevoorde, Carl Waldspurger, Bill Weihl Western Research Lab Jennifer Anderson, Jeffrey Dean Other Collaborators Cambridge Research Lab Jamey Hicks Alpha Engineering George Chrysos, Scot Hildebrandt, Rick Kessler, Ed McLellan, Gerard Vernes, Jonathan White
3
Profiling Tutorial 2-3 10/4/98 Outline Statistical sampling –What is it? –Why use it? Data collection –Hardware issues –OS issues Data analysis –In-order processors –Out-of-order processors
4
Profiling Tutorial 2-4 10/4/98 Statistical Profiling Based on periodic sampling Hardware generates periodic interrupts OS handles the interrupts and stores data –Program Counter (PC) and any extra info Analysis Tools convert data –for users –for compilers Examples: DCPI, Morph, SGI Speedshop, Unix’s prof(), VTune
5
Profiling Tutorial 2-5 10/4/98 Sampling vs. Instrumentation Much lower overhead than instrumentation –DCPI: program 1%-3% slower –Pixie: program 2-3 times slower Applicable to large workloads –100,000 TPS on Alpha –AltaVista Easier to apply to whole systems (kernel, device drivers, shared libraries,...) –Instrumenting kernels is very tricky –No source code needed
6
Profiling Tutorial 2-6 10/4/98 Information from Profiles DCPI estimates Where CPU cycles went, broken down by –image, procedure, instruction How often code was executed –basic blocks and CFG edges Where peak performance was lost and why
7
Profiling Tutorial 2-7 10/4/98 Example: Getting the Big Picture Total samples for event type cycles = 6095201 cycles % cum% load file 2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so cycles % cum% procedure load file 2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
8
Profiling Tutorial 2-8 10/4/98 Example: Using the Microscope Where peak performance is lost and why
9
Profiling Tutorial 2-9 10/4/98 Example: Summarizing Stalls I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0% Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3% ------------------------------------------------------------- Subtotal dynamic 44.1% Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0% ------------------------------------------------------------- Subtotal static 4.8% ------------------------------------------------------------- Total stall 48.9% Execution 51.2% Net sampling error -0.1% ------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)
10
Profiling Tutorial 2-10 10/4/98 Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line 10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488
11
Profiling Tutorial 2-11 10/4/98 Instruction-level Information Matters DCPI anecdotes TPC-D: 10% speedup Duplicate filtering for AltaVista: part of 19X Compress program: 22% Compiler improvements: 20% in several Spec benchmarks
12
Profiling Tutorial 2-12 10/4/98 Outline Statistical sampling –What is it? –Why use it? Data collection –Hardware issues –OS issues Data analysis –In-order processors –Out-of-order processors
13
Profiling Tutorial 2-13 10/4/98 Typical Hardware Support Timers –Clock interrupt after N units of time Performance Counters –Interrupt after N cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired,... –Alpha 21064, 21164; Ppro, PII;… –Easy to measure total cycles, issues, CPI, etc. Only extra information is restart PC
14
Profiling Tutorial 2-14 10/4/98 Problem: Inaccurate Attribution Experiment –count data loads –loop: single load + hundreds of nops In-Order Processor –Alpha 21164 –skew –large peak Out-of-Order Processor –Intel Pentium Pro –skew –smear load
15
Profiling Tutorial 2-15 10/4/98 Ramification of Misattribution No skew or smear –Instruction-level analysis is easy! Skew is a constant number of cycles –Instruction-level analysis is possible –Adjust sampling period by amount of skew –Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program Smear –Instruction-level analysis seems hopeless –Examples: PII, StrongARM
16
Profiling Tutorial 2-16 10/4/98 Desired Hardware Support Sample fetched instructions Save PC of sampled instruction –E.g., interrupt handler reads Internal Processor Register –Makes skew and smear irrelevant Gather more information
17
Profiling Tutorial 2-17 10/4/98 random selection ProfileMe: Instruction-Centric Profiling fetchmapissueexec retire icache branch predict dcache interrupt! arith units done? Fetch counter overflow? pcaddrretired?miss?stage latencies ProfileMe tag! tagged? historymp? capture! internal processor registers miss?
18
Profiling Tutorial 2-18 10/4/98 Instruction-Level Statistics PC + Retire Status execution frequency PC + Cache Miss Flag cache miss rates PC + Branch Mispredict mispredict rates PC + Event Flag event rates PC + Branch Direction edge frequencies PC + Branch History path execution rates PC + Latency instruction stalls “100-cycle dcache miss” vs. “dcache miss”
19
Profiling Tutorial 2-19 10/4/98 Kernel Device Driver Challenge: 1% of 64K is only 655 cycles/sample Aggregate samples in hash table –(PID, PC, event) count Minimize cache misses –~100 cycles to memory –Pack data structures into cache lines Eliminate expensive synchronization operations –Interprocessor interrupts for synchronization with daemon –Replicate main data structures on each processor
20
Profiling Tutorial 2-20 10/4/98 Moving Samples to Disk User-Space Daemon –Extracts raw samples from driver –Associates samples with compiled code –Updates disk-based profiles for compiled code Mapping samples to compiled code –Dynamic loader hook for dynamically loaded code –Exec hook for statically linked code –Other hooks for initializing mapping at daemon start-up Profiles –text header + compact binary samples
21
Profiling Tutorial 2-21 10/4/98 Performance of Data Collection (DCPI) Time –1-3% total overhead for most workloads –Often less than variation from run to run Space –512 KB kernel memory per processor –2-10 MB resident for daemon –10 MB disk after one month of profiling on heavily used timeshared 4-processor machine Non-intrusive enough to be run for many hours on production systems, e.g.
22
Profiling Tutorial 2-22 10/4/98 Outline Statistical sampling –What is it? –Why use it? Data collection –Hardware issues –OS issues Data analysis –In-order processors –Out-of-order processors
23
Profiling Tutorial 2-23 10/4/98 Compile code Samples ANALYSISANALYSIS Stall explanations Frequency Cycles per instruction Data Analysis Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas) Frequency indicates frequent paths CPI indicates stalls
24
Profiling Tutorial 2-24 10/4/98 1,000,000 1 CPI ? 10,000 100 CPI 1,000,000 Cycles Estimating Frequency from Samples Problem –given cycle samples, compute frequency and CPI Approach –Let F = Frequency / Sampling Period –E(Cycle Samples) = F X CPI –So … F = E(Cycle Samples) / CPI
25
Profiling Tutorial 2-25 10/4/98 Estimating Frequency (cont.) F = E(Cycle Samples) / CPI Idea –If no dynamic stall, then know CPI, so can estimate F –So… assume some instructions have no dynamic stalls Consider a group of instructions with the same frequency (e.g., basic block) Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy Key insight: –Instructions without stalls have smaller sample counts
26
Profiling Tutorial 2-26 10/4/98 Estimating Frequency (Example) Compute MinCPI from Code Compute Samples/MinCPI Select Data to Average Does badly when: –Few issue points –All issue points stall
27
Profiling Tutorial 2-27 10/4/98 Frequency Estimate Accuracy Compare frequency estimates for blocks to measured values obtained with pixie-like tool Edge frequencies a bit less accurate
28
Profiling Tutorial 2-28 10/4/98 Explaining Stalls Static stalls –Schedule instructions in each basic block optimistically using a detailed pipeline model for the processor Dynamic stalls –Start with all possible explanations I-cache miss, D-cache miss, DTB miss, branch mispredict,... –Rule out unlikely explanations –List the remaining possibilities
29
Profiling Tutorial 2-29 10/4/98 Is the previous occurrence of an operand register the destination of a load instruction? Search backward across basic block boundaries Prune by block and edge execution frequencies ldq t0,0(s1) subq t0,t1,t2 addq t3,t0,t4 OR subq t0,t1,t2 Ruling Out D-cache Misses
30
Profiling Tutorial 2-30 10/4/98 Out-of-Order Processors In-Order processors –Periodic interrupt lands on “current” instruction, e.g., next instruction to issue –Peak performance = no wasted issue slots –Any stall implies loss in performance Out-of-Order Processors –Many instructions in-flight: no “current” instruction –Some stalls masked by concurrent execution Instructions issue around stalled instruction Example: does this stall matter? load r1,… add …,r1,… average latency: 15.0 cycles … other instructions …
31
Profiling Tutorial 2-31 10/4/98 Issue: Need to Measure Concurrency Interesting concurrency metrics –Retired instructions per cycle –Issue slots wasted while an instruction is in flight –Pipeline stage utilization How to measure concurrency? Special-purpose hardware –Some metrics difficult to measure e.g. need retire/abort status Sample potentially-concurrent instructions –Aggregate info from pairs of samples –Statistically estimate metrics
32
Profiling Tutorial 2-32 10/4/98 Paired Sampling Sample two instructions –May be in-flight simultaneously –Replicate ProfileMe hardware, add intra-pair distance Nested sampling –Sample window around first profiled instruction –Randomly select second profiled instruction –Statistically estimate frequency for F(first, second) +W... -W time overlapno overlap
33
Profiling Tutorial 2-33 10/4/98 Explaining Lost Performance An open question Some in-order analysis applicable –E.g., D-cache miss & branch mispredict analysis Pipe stage latencies from counters would help a lot
34
Profiling Tutorial 2-34 10/4/98 Summary & Conclusion Statistical profiling can be –Inexpensive –Effective Instruction-level analysis matters Performance counters –Implementation details make a big difference Out-of-order processors require better counters
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.