Presentation is loading. Please wait.

Presentation is loading. Please wait.

Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36.

Similar presentations


Presentation on theme: "Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36."— Presentation transcript:

1 Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36

2 2 Motivation  Power is important!  Measurement/Modeling techniques help guide design of power-aware and temperature-aware systems  Real-system measurements:  Help observe long time periods  Help guide on-the-fly adaptive management  Our work: live, run-time, per- component power/thermal measures

3 3 Simulation vs. Multimeters… Multimeter: + Fast + Fairly Accurate +/- Existing systems - No on-chip detail Counter-Based Power Estimation: + Fast (Real-time) + Offers estimated view of on-chip detail But: 1) Are the “right” counters available? 2) How accurate is it? Simulation: + Arbitrary detail + Common base - Slow - Possibly inaccurate

4 4 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  What are some interesting uses of this measurement approach?

5 5 Power Simulation: Overview MaxPower[I] * ArchScaling[I] * AccessRate[I] Power of component I =  Idealized view: For all components in a processor chip … Die area and Capacitance estimate Cycle-level simulator gathers event counts, activity factors and adjusts scaling…

6 6 Counter-Based Power Estimation: An Overview of Our Approach MaxPower[I] * ArchScaling[I] * AccessRate[I] Power of component I =  Idealized view: For all components in a processor chip … … + NonGatedPower[I]  More realistic view: Handle non-linear scaling … Empirical Multimeter measurement CPU Performance Counters! From microarch. properties Die area + Stressmarks

7 7 Define Events Performance Monitoring Power Model Counter-Based Power Estimation: General Implementation Define Events  Develop per-component access heuristics and scaling factors  Build performance monitoring SW to sample hardware event counts Performance Monitoring  Use stressmarks to calibrate scaling factors Power Model Multimeter  Use multimeter measurements to validate counter-based power estimates Multimeter

8 8 Define P4 Components Define P4 Events Power Modeling Real Power Measurement P4 Perf. Cntr Reader Define P4 Components  Define components: L1 cache, BPU, Regs, … whose powers we ’ ll model:  from annotated layout Define P4 Events  Determine combination of P4 events that represent component accesses best Specific Example: Intel Pentium 4  Convert counter based access rates into component powers Power Modeling  Verify total power against measured processor power Real Power Measurement P4 Perf. Cntr Reader  Gather counter info with minimal power overhead and program interruption using built Linux loadable kernel module

9 9 Complete Example: Retirement Logic  Initial MaxPower: Area-based estimate  MaxPower = Area% x Max Processor Power = 4.7W  Retirement Logic defined from annotated die layout  Estimates after stressmark tuning:  MaxPower = 1.5W; ClkPower = 2W  Final hardcoded power equation for retirement logic:  Access rate approximation based on performance counters: ΔCycles dUopsRetire  Retirement Power:  Can retire at most 3 uops/cycle

10 10 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  Validation Setup  Stressmark measurements  SPEC measurements  Component-wise validation  What are some interesting uses of this measurement approach?

11 11 Validation Measurement Setup POWER CLIENT POWER SERVER Counter based access rates over ethernet 1mV/Adc conversion Voltage readings via RS232 to logging machine Convert voltage to measured power Convert access rates to modeled powers Sync together in time window

12 12 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  Validation Setup  Stressmark measurements  SPEC measurements  Component-wise validation  What are some interesting uses of this measurement approach?

13 13 Counter-based Power Estimation: Validation Step 1 “Fast” “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” (Hit Rate : 0.1) Using die area proportions for max power Measured Modeled

14 14 Counter-based Power Estimation: Validation Step 2 Measured Modeled Adjusting max power coeff-s on stressmarks

15 15 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  Validation Setup  Stressmark measurements  SPEC measurements  Component-wise validation  What are some interesting uses of this measurement approach?

16 16 Validation for Accuracy: SPEC Results Gcc GzipVpr Vortex Gap Crafty Measured Modeled

17 17 Average SPEC Total Powers  1 st set: Overall, 2 nd set: Non-idle power  Average difference between measurement and estimation: 3W  Worst case: Equake (5.8W)

18 18 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  Validation Setup  Stressmark measurements  SPEC measurements  Component-wise validation  What are some interesting uses of this measurement approach?

19 19 Per-Component Validation  What we would like to do  Compare against gold-standard simulator, or …  Compare against detailed published per-component measured data  What we can do  Validate using per-component stressmarks  Sanity check on trends in real- benchmarks

20 20 Validation for Fidelity: Benchmark Power Breakdowns High issue, exec. & branch power High L2 Cache PowerHigh L1 Cache PowerHigh Bus Power

21 21 Questions We Answer  To what extent can CPU performance counters act as proxies to estimate CPU power?  Can counter-based power estimations offer useful accuracy compared to multimeter measurements?  What are some interesting uses of this measurement approach?

22 22 Counter-Based Power Estimation: Uses and Big Picture Thermal Modeling Power Phases Per-Component Power Estimation Real Power Measurement Performance Counters

23 23 Conclusions  Contributions:  Performance counter based runtime power model and runtime verification with synchronous real power measurement for arbitrarily long timescales!  Physical component based power estimates for processor, which can be used in power phase analyses and thermal modeling  Outcome:  We can perform reasonably accurate runtime power estimates without inducing any significant overhead to power profile

24 24  IF TIME PERMITS:

25 25 Live CPU Performance Monitoring with Hardware Counters  Most CPUs have hardware performance counters  Intel P4 Performance Monitoring HW:  18 Event Counters to count hundreds of possible events  How to Count? 18 Counter Config Control Registers  What to Count? 45 Event Selection Control Registers  Plus additional control registers … [Sprunt, 2002]

26 26 Our Event Counter: Performance Reader  Performance Reader implemented as Linux Loadable Kernel Module  Implements 6 syscalls: select_events() reset_event_counter() start_event_counter() stop_event_counter() get_event_counts() set_replay_MSRs()  User Level Interface:  Defines the events  Starts counters  Stops counters  Reads counters & TSC

27 27 Multimeter Measurements: Stressmarks “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” Array Size 1/100 of L1 “L1Dcache” Array Size x25 of L1~L2 “L1Dcache” Array Size x4 of L2 Initialization Benchmark Execution “Fast”

28 28   Component Breakdowns Component Breakdowns for “branch_exercise” Colors for 4 CPU subsystems Issue - Retire Execution

29 29 SPEC2000 Results: Example  Initialization and computation phases  FP intensive mesh computation phase  Initialization with high complex IA32 instructions Equake:  FP benchmark

30 30 Desktop Applications  We aim to track low power utilizations as well.  Desktop applications are usually low power with intermittent power bursts  3 applications, with common operations such as open/close application, web, streaming media, text editing, save to disk, statistical computations.

31 31 Related Work  Implementing counter readers:  PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002]  Using counters for Power:  CASTLE [Joseph 2001], power profilers  event driven OS/cruise control [Bellosa 2000,2002]  Real Power Measurement:  Compiler Optimizations [Seng 2003]  Cycle-accurate measurement with switch caps [Chang 2002]  Power Management and Modeling Support:  Instruction level energy [Tiwari 1994]  PowerScope: Procedure level energy [Flinn 1999]  Event counter driven energy coprocessor [Haid 2003]  Virtual Energy Counters for Mem. [Kadayif 2001]  ECOsystem: OS energy accounting [Ellis 2002]

32 32 Our Work in Comparison  Power estimation for a complex, aggressively clock-gated processor  Component power estimates with physical binding to die layout  Laying the groundwork for thermal modeling  Portable implementation with current probe and power server LKM  Power oriented phase analysis with acquired power vectors

33 33 EOP

34 34  Counter Access Heuristics  The complete set of event metrics for access rates  Power Numbers  Areas ratios | Area based | final Max Powers & clocks  Description of Phase Analysis  A run-through with similarity analysis & metrics  Grouping matrices & thresholding algorithm  Final reconstructed powers  Current & Future Research  The ‘ Real ’ Big Picture  Phase Analysis  Thermal Modeling  Questions & Rebuttals  Verification?  Events wish list  Also provides some answers to reviewers ’ questions EXTRA SLIDES

35 35  IF NOT IN ANY OF THESE SLIDES, IT IS PROBABLY IN:  PRESENTATION PART II Presentation Part 2

36 36 Counter Access Heuristics  1) BUS CONTROL:  No 3 rd Level cache  BSQ allocations ~ IOQ allocations  Metric1: Bus accesses from all agents Event: IOQ_allocation Counts various types of bus transactions Should account for BSQ as well access based rather than duration MASK: Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH  Metric2: Bus Utilization(The % of time Bus is utilized) Event: FSB_data_activity Counts DataReaDY and DataBuSY events on Bus Mask: Count when processor or other agents drive/read/reserve the bus Expression: FSB_data_activity x BusRatio / Clocks Elapsed To account for clock ratios  Final access rate:

37 37 Counter Access Heuristics  2) L2 Cache:  Metric: 2 nd Level cache references Event: BSQ_cache_reference Counts cache ref-s as seen by bus unit MASK: All MESI reads (LD & RFO) & 2 nd level WR misses  Final Expression:  3) 2 nd Level BPU:  Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, Expression: 8xITLB_Reference Min. 8 instr-ns per 128B L2 line  Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP  Final expression:

38 38 (Max IA32 Instruction Length – for 2 nd Level BPU )  What this boils down to is that the "official" maximum instruction length is  4 prefix  +2 opcode  +1 modrm  +1 sib  +4 displacement  +4 immediate  ------------------------  = 16bytes ...though one may want to assume a 24-byte instruction length to avoid possible buffer overruns when disassembling instructions with more than 4 prefix bytes.

39 39 Counter Access Heuristics  4) ITLB & I-Fetch:  Metric 1: ITLB translations performed Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses  Metric 2: Intruction fetch requests by the front end BPU Event: BPU_fetch_requests Counts Ifetch requests from the BPU Mask: TC lookup misses  Final expression:

40 40 Counter Access Heuristics  5) L1 Cache:  Metric 1: Load & Store retired Event: Front End Event Counts tagged uops that retired Mask: Count also speculatives (BOGUS) Supporting event: Uop type Tags Load and Store instructions Mask: Tag both Loads and Stores  Metric 2: Replays (For Data Speculation) Events: 1) LD port replay Counts replayed events at load port Mask: Split LD 2) ST port replay (Same as Memory Complete, Mask: SSC) Counts replayed events at store port Mask: Split ST  Final expression:

41 41 Counter Access Heuristics  6) MOB:  Metric 1: LD Replays triggered by MOB Event: MOB load replay Counts the load operations replayed by MOB Mask: All replays due to unknown address/data, partial data match, misaligned addresses  No metric for MOB accesses!  Final expression:

42 42 Counter Access Heuristics  7) Memory Control:  Metric 1: Non-idle cycle Event: Machine Clear Counts cycles when the pipeline is flushed Mask: Machine clears due to any cause Expression: TSC count – Machine Clear Cycles  Final Expression:

43 43 Counter Access Heuristics  8) DTLB:  Metric 1: Accesses to either L1 or to MOB Expression: L1 Accesses + MOB Accesses  Final expression:

44 44 Counter Access Heuristics  10) FP Execution:  Metric: FP instructions executed event1: packed_SP_uop counts packed single precision uops event2: packed_DP_uop counts packed double precision uops event3: scalar_SP_uop counts scalar single precision uops event4: scalar_DP_uop counts scalar double precision uops event5: 64bit_MMX_uop counts MMX uops with 64bit SIMD operands event6: 128bit_MMX_uop counts integer SSE2 uops with 128bit SIMD operands event7: x87_FP_UOP counts x87 FP uops Masks1-7: Count ALL event8: x87_SIMD_moves_uop counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Mask 8: Count SIMD moves/stores/lds

45 45 Counter Access Heuristics  10) … FP Execution:  Final Expression:

46 46 Counter Access Heuristics  9) Integer Execution:  Metric 1: Integer uops executed Event:  Substitute Metric: Total speculative Uops executed Event: Uop queue writes Number of uops written to the uop queue in front of TC Mask: All uops from TC, Decoder and Microcode ROM Expression: Uop Rate – FP uop rate Postfix for simple vs complex ALU operations  ~ Final expression: (Actually I rescale FP uop rates as packed,SIMD and MMX uops do multiple concurrent FP operations)

47 47 Counter Access Heuristics  11) Integer Regfile  Metric: Integer uops executed No direct metric for total physical regfile accesses  Final expression:  12) FP Regfile  Metric: FP uops executed No direct metric for total physical regfile accesses  Final expression:

48 48 Counter Access Heuristics  13) Instruction Decode:  Metric 1: Cycles spent in trace building Event: TC Deliver Mode Counts the cycles processor spends in the specified mode Mask: Logical processor 0 in build mode  Final expression:

49 49 Counter Access Heuristics  14,15) Trace Cache:  Metric: Uop queue writes from either modes Event: Uop queue writes Counts Number of uops written to the uop queue in front of TC Mask: All uops from TC and Decoder and ROM  Final expression:

50 50 Counter Access Heuristics  16) 1 st Level BPU:  Metric 1: Branches retired Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP  Final expression:

51 51 Counter Access Heuristics  17) Microcode ROM:  Metric 1: Uops originating from ROM Event: Uop queue writes Counts Number of uops written to the uop queue in front of TC Mask: Uops only from microcode ROM  Final expression:

52 52 Counter Access Heuristics  18) Allocation & 19) Rename & 20) Instruction queue1 & 21) Schedule & 22) Instruction queue2:  Metric: Uops that started their flight Event: Uop queue writes Counts Number of uops written to the uop queue Mask: All uops from TC and Decoder and ROM  Final expression:

53 53 Counter Access Heuristics  23) Retirement Logic:  Metric: Uops that arrive retirement Event: Uops retired Counts number of uops retired in a cycle Mask: Consider also speculative  Final expression: Bac k

54 54 Evolution of Power Numbers UnitArea %Area Based Max Power Estimate Max Power after Tuning Conditional Clk power L2 BPU3.4%2.515.5- L1 BPU4.9%3.510.5 Tr. Cache8.6%6.24.02.0 L1 Cache5.8%4.212.4 (/2) L2 Cache14.7%10.6300.6(/7) Int EXE2.0%1.43.4 FP EXE6.2%4.5 Rename2.3%1.70.41.5 Retire6.5%4.7(/3)0.52.0

55 55 Our Power Phase Analysis  Goal:  Identify phases in program power behavior  Determine execution points that correspond to these phases  Define small set of power signatures that represent overall power behavior

56 56 Our Approach  Our Approach – Outline:  Collect samples of estimated power values for processor sub-units at application runtime  Define a power vector similarity metric  Group sampled program execution into phases  Determine execution points and representative signature vectors for each phase group  Analyze the accuracy of our approximation

57 57 Power Vector Similarity Metric  How to quantify the ‘ power behavior dissimilarity ’ between two execution points? 1. Consider solely total power difference  2. Consider manhattan distance between the corresponding 2 vectors  3. Consider manhattan distance between the corresponding 2 vectors normalized  4. Consider a combination of (2) & (3)   Construct a “ similarity matrix ” to represent similarity among all pairs of execution points similarity matrix  Each entry in the similarity matrix:

58 58 Similarity Based on Both Absolute and Normalized Power Vectors

59 59 Grouping Execution Points  “ Thresholding Algorithm ” :  Define a threshold of similarity  Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metrics  Tag the corresponding execution points (j,j) as the same group  Find next untagged execution point (r,r) and do the same along forward path  Rule: A tagged execution point cannot add new elements to its group! We demonstrate the outcome of thresholding with Grouping Matrices

60 60 Gzip Grouping Matrices  Gzip has 974 power vectors  Cluster vectors based on similarity using “ thresholding ”  Max Gzip power dissimilarity: 47.35W

61 61 Representative Vectors & Execution Points  We have each execution point assigned to a group  For Each Group:  For Each Execution Point:  We can represent whole execution with as many power vectors as the number of generated groups  Define a representative vector as the average of all instances of that group  Select the execution point that started the group (The earliest point in each group)  Assign the corresponding group’s representative vector as that point’s power vector  Assign the power vector of the selected execution point for that group as that point ’ s power vector

62 62  Reconstructing Power Trace with Representative Vectors:

63 63  Reconstructing Power Trace with Selected Execution Points:

64 Similarity Matrix Example  Consider 4 vectors, each with 4 dimensions: 0637 6036 3307 7670  Log all distances in the similarity matrix 0637 6036 3307 7670  Color-scale from black to white (only for upper diagonal)

65 Similarity Matrix Example  Consider 4 vectors, each with 4 dimensions: 0637 6036 3307 7670  Log all distances in the similarity matrix 0637 6036 3307 7670  Color-scale from black to white (only for upper diagonal)

66 66 Interpreting Similarity Matrix Plot Bac k  Level of darkness at any location (r,c) shows the amount of similarity between vectors – samples – r & c. i.e. 0 & 2  All samples are perfectly similar to themselves All (r,r) are black  Vertically above the diagonal shows similarity of the sample at the diagonal to previous samples i.e. 1 vs. 0  Horizontally right of the diagonal shows similarity of the sample at the diagonal to future samples i.e. 1 vs. 2,3

67 Grouping Matrix Example  Consider same 4 vectors: 0637 6036 3307 7670  Mark execution pairs with distance ≤ Threshold 0637 6036 3307 7670 0637 6036 3307 7670 0637 6036 3307 7670 Bac k

68 68 Current & Future Research  FOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED TO POWER ESTIMATION AND PHASES. PLANS FOR FUTURE RESEARCH ARE ALSO DISCUSSED

69 69 0) THE REAL BIG PICTURE Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Power Phases Program Profiling Program Structure To Estimate component power & temperature breakdowns for P4 at runtime… To analyze how power phase behavior relates to program structure Bottom line…

70 70 Power Modeling Power Phases Program Profiling Program Structure Power Phases Program Profiling Program Structure 1) Phase Branch  Power Phase Behavior  Similarity Based on Power Vectors  Identifying similar program regions  Profiling Execution Flow  Sampling process ’ execution  “ PCsampler ” LKM  Program Structure  Execution vs. Code space  Power Phases  Exec. Phases NOT YET?

71 71 Power Modeling Power Phases Program Profiling Program Structure Power Phases POWER PHASE BEHAVIOR  Power Phase Behavior  Similarity Based on Power Vectors  Identifying similar program regions  Profiling Execution Flow  Sampling process ’ execution  “ PCsampler ” LKM  Program Structure  Execution vs. Code space  Power Phases  Exec. Phases NOT YET

72 72 Identifying Power Phases  Most of the methodology is ready  Complete Gzip case in [Isci & Martonosi WWC-6]  Extensibility to other benchmarks  Generated similarity metrics for several  Performed phase identification with thresholding for all  Repeatibilty of the experiment  Several other possible ideas such as:  Thresholding + k-means clustering  Two-pass thresholding  PCA for dimension reduction (or SVD?)  Manhattan = L1 norm Euclidian (L2) – not interesting Chebyschev (L inf ) - ??

73 73 Power Modeling Power Phases Program Profiling Program Structure Program Profiling Program Execution Profile Program Structure  Power Phase Behavior  Similarity Based on Power Vectors  Identifying similar program regions  Profiling Execution Flow  Sampling process ’ execution  “ PCsampler ” LKM  Program Structure  Execution vs. Code space  Power Phases  Exec. Phases NOT YET

74 74 Program Execution Profile  Sample program flow simultaneously with power  Our LKM implementation: “ PCsampler ” Not Finished …  Generate code space similarity in parallel with power space similarity  Relative comparisons of methods for:  Complexity  Accuracy  Applicability, etc.

75 75 CURRENT STATE  Sample PC  Binding to functions  Reacquire PID  Those SPECs, Runspec: always in fixed address at ELF_program interpreter Benches: change pid between datasets  Verify PC with objdump  So we can make sure it is the PC we ’ re sampling

76 76 Initial Data: PC?? Trace For gzip-source Correspond to: functions Bac k

77 77 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling 2) Thermal Modeling Thermal Modeling  Related Work  Performance Monitoring  P4 Performance Counters  Performance Reader LKM  Real Power Measurement  P4 Power Measurement Setup  Examples  Power Modeling  P4 Power Model  Model + Measurement Sync Setup, Verification  Thermal Modeling  Refined Thermal Model  Ex: Ppro Thermal Model

78 78 THERMAL MODELING: A Basic Model  Based on lumped R-C model from packaging  Built upon power modeling  Sampled Component Powers  Respective component areas  Physical processor Parameters Packaging Heat Transfer t : Sampling interval T i : The temperature difference between block and the heatsink

79 79 Refined Thermal Model  Steady State Analysis reveals, Heatsink-Die abstraction is not sufficient for real systems Steady State  Proceeding to a multilayer thermal model:  Active die thickness  metalization/insulation  chip-package interface  package  heatsink  Requires searching of several materials/ dimensions and thermal properties  Multiple layers   Multiple T nodes  Multiple DEs  Baseline Heat removal Structure: HEATSINK Thermal Grease Heat Spreader Package Die

80 80 Physical Structure vs. Thermal Model Ambient Temperature Heatsink Heat Spreader Package Die ThTh RhRh ChCh R _hXA TATA T spr R spr C spr R _grXspr T p,i R p,i C p,i T die,i R die,i C die,i Pi P total Thermal Grease Ambient Airflow

81 81 Analytical Derivation  4 Nodes  4 DEs  1) T spr :

82 82 EX: Ppro Thermal Model Ppro  Use CASTLE [Joseph, 2001] computed component powersCASTLE  Determine component areas from Die photocomponent areas  Determine processor/packaging physical parameters physical parameters  Generate numerical thermal model Generate numerical  Apply component difference equations recursively along power flow T die,i T p,i T spr T h Update T die,i Update T p,i Update T spr Update T h

83 83 Simulation Outputs  Thermal nodes updated every t~20ms  Component Temperatures Build up to ~350K in ~5hrs  T heatsink moves very slowly as expected

84 84 Verification of Results (1/3)  Full validation requires comparing component power estimations to real measured component powers for all demonstrated full execution traces  No such published data available Is it possible to acquire such data? How to probe intra-chip components? Closest would be Hspice simulations Probably infeasible to acquire traces of minutes  No P4 power simulator 1 benchmark would take ~1 CPU-month with current power simulation speeds

85 85 Verification of Results (2/3)  We use proxies:  1) Verifying against total measured power at runtime  Provides us with an immediate comparison  Most tested benchmarks show close approximations  2) Behavior of component powers for simpler benchmarks  We show power trends follow expectations under different corners of execution

86 86 Art RMS error (overall): 4.38W Art RMS error (100-1000s): 4.21W Verification of Results (3/3)  Why not an RMS error measure between measurement and estimate for total power not included?  As two sources of data are completely independent, i.e. measurement branch coming from serial interface and modeling part from ethernet, they are not perfectly synchronized. There are spikes at power jumps Removing those by hand would question fidelity

87 87 Possibility for Other Processors?  Most recent processors are keen on power management  There will be enough power variability to exploit for power modeling and phase analysis  Porting the power estimation to other architectures  Requires significant effort to Define power related metrics Implement counter reader and power estimation user and kernel SW  Porting to same architecture, different implementation  More straightforward Reevaluate max/idle/gated power estimates  Experiences with other architectures:  Castle project for Pentium Pro (P6) Few watts of variation Low dimensionality  IBM Power3 II Very low measured power variation

88 88 Event Counter Wish List (1/2)  Problems from experience:  Memory Related Metrics  L2 metrics are complicated Do not correspond to L2 hits/misses (see optimization reference manual) Granularity issues  MOB accesses metric?  Memory Control?  Integer and FP accesses:  FPE has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect  INTE has no direct measure Cannot differentiate multiply, shifts, logic, arithmetic  Out of order engine:  Cannot differentiate between: Allocate, Rename, Instruction Queues, Schedule

89 89 Event Counter Wish List (2/2)  Suggestions for Future:  Ultimately:  Specific counters that represent component- wise utilizations  Switching bookkeepers for singly ended lines  Specific to P4 & Xeon Generation:  Metrics directly related to memory components  Single aggregate metrics for FP and Int execution  Metrics that explore out of order engine components

90 90 Applications of our Technique  Already discussed:  Power phase analysis  Thermal modeling  In addition:  OS based energy accounting i.e. ECOSystem [Duke; C. Ellis et al.] Fine grained CPU power (rather than ON/OFF)  Detailed SW energy mappings i.e. PowerScope [CMU; Flinn & Satyanarayanan]  Dynamic Power Management i.e. Process Cruise Control [Weissel & Bellosa] Event driven / Power model driven clock scaling

91 91 Statistical Methods (1/2)  Several optimization/curve fitting tools:  JMP, R, SPSS, S+, Stata  Regression:  Used by some previous examples  2 k factorial design  K factors, each at 2 levels (Hi & Lo)  Run design at each 2 k corners  Analyze the interactions and individual factors ’ effects Turn factors ON/OFF

92 92 Statistical Methods (2/2)  We believe, we can produce much closer matching to measured total power with curve fitting techniques and PCA (to define a new set of orthogonal activity factors)  Topic for future research  We avoid to preserve the binding of activity factors to physical component powers as required for thermal analysis  Note on 2 k factorial design  Our factors are: Access rates OR Component power estimations  Our response variable: Measured total power  However: We cannot independently turn ON/OFF the accesses to individual components


Download ppt "Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36."

Similar presentations


Ads by Google