Download presentation
Presentation is loading. Please wait.
Published byClifton Lewis Modified over 9 years ago
1
Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36
2
2 Motivation Power is important! Measurement/Modeling techniques help guide design of power-aware and temperature-aware systems Real-system measurements: Help observe long time periods Help guide on-the-fly adaptive management Our work: live, run-time, per- component power/thermal measures
3
3 Simulation vs. Multimeters… Multimeter: + Fast + Fairly Accurate +/- Existing systems - No on-chip detail Counter-Based Power Estimation: + Fast (Real-time) + Offers estimated view of on-chip detail But: 1) Are the “right” counters available? 2) How accurate is it? Simulation: + Arbitrary detail + Common base - Slow - Possibly inaccurate
4
4 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? What are some interesting uses of this measurement approach?
5
5 Power Simulation: Overview MaxPower[I] * ArchScaling[I] * AccessRate[I] Power of component I = Idealized view: For all components in a processor chip … Die area and Capacitance estimate Cycle-level simulator gathers event counts, activity factors and adjusts scaling…
6
6 Counter-Based Power Estimation: An Overview of Our Approach MaxPower[I] * ArchScaling[I] * AccessRate[I] Power of component I = Idealized view: For all components in a processor chip … … + NonGatedPower[I] More realistic view: Handle non-linear scaling … Empirical Multimeter measurement CPU Performance Counters! From microarch. properties Die area + Stressmarks
7
7 Define Events Performance Monitoring Power Model Counter-Based Power Estimation: General Implementation Define Events Develop per-component access heuristics and scaling factors Build performance monitoring SW to sample hardware event counts Performance Monitoring Use stressmarks to calibrate scaling factors Power Model Multimeter Use multimeter measurements to validate counter-based power estimates Multimeter
8
8 Define P4 Components Define P4 Events Power Modeling Real Power Measurement P4 Perf. Cntr Reader Define P4 Components Define components: L1 cache, BPU, Regs, … whose powers we ’ ll model: from annotated layout Define P4 Events Determine combination of P4 events that represent component accesses best Specific Example: Intel Pentium 4 Convert counter based access rates into component powers Power Modeling Verify total power against measured processor power Real Power Measurement P4 Perf. Cntr Reader Gather counter info with minimal power overhead and program interruption using built Linux loadable kernel module
9
9 Complete Example: Retirement Logic Initial MaxPower: Area-based estimate MaxPower = Area% x Max Processor Power = 4.7W Retirement Logic defined from annotated die layout Estimates after stressmark tuning: MaxPower = 1.5W; ClkPower = 2W Final hardcoded power equation for retirement logic: Access rate approximation based on performance counters: ΔCycles dUopsRetire Retirement Power: Can retire at most 3 uops/cycle
10
10 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? Validation Setup Stressmark measurements SPEC measurements Component-wise validation What are some interesting uses of this measurement approach?
11
11 Validation Measurement Setup POWER CLIENT POWER SERVER Counter based access rates over ethernet 1mV/Adc conversion Voltage readings via RS232 to logging machine Convert voltage to measured power Convert access rates to modeled powers Sync together in time window
12
12 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? Validation Setup Stressmark measurements SPEC measurements Component-wise validation What are some interesting uses of this measurement approach?
13
13 Counter-based Power Estimation: Validation Step 1 “Fast” “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” (Hit Rate : 0.1) Using die area proportions for max power Measured Modeled
14
14 Counter-based Power Estimation: Validation Step 2 Measured Modeled Adjusting max power coeff-s on stressmarks
15
15 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? Validation Setup Stressmark measurements SPEC measurements Component-wise validation What are some interesting uses of this measurement approach?
16
16 Validation for Accuracy: SPEC Results Gcc GzipVpr Vortex Gap Crafty Measured Modeled
17
17 Average SPEC Total Powers 1 st set: Overall, 2 nd set: Non-idle power Average difference between measurement and estimation: 3W Worst case: Equake (5.8W)
18
18 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? Validation Setup Stressmark measurements SPEC measurements Component-wise validation What are some interesting uses of this measurement approach?
19
19 Per-Component Validation What we would like to do Compare against gold-standard simulator, or … Compare against detailed published per-component measured data What we can do Validate using per-component stressmarks Sanity check on trends in real- benchmarks
20
20 Validation for Fidelity: Benchmark Power Breakdowns High issue, exec. & branch power High L2 Cache PowerHigh L1 Cache PowerHigh Bus Power
21
21 Questions We Answer To what extent can CPU performance counters act as proxies to estimate CPU power? Can counter-based power estimations offer useful accuracy compared to multimeter measurements? What are some interesting uses of this measurement approach?
22
22 Counter-Based Power Estimation: Uses and Big Picture Thermal Modeling Power Phases Per-Component Power Estimation Real Power Measurement Performance Counters
23
23 Conclusions Contributions: Performance counter based runtime power model and runtime verification with synchronous real power measurement for arbitrarily long timescales! Physical component based power estimates for processor, which can be used in power phase analyses and thermal modeling Outcome: We can perform reasonably accurate runtime power estimates without inducing any significant overhead to power profile
24
24 IF TIME PERMITS:
25
25 Live CPU Performance Monitoring with Hardware Counters Most CPUs have hardware performance counters Intel P4 Performance Monitoring HW: 18 Event Counters to count hundreds of possible events How to Count? 18 Counter Config Control Registers What to Count? 45 Event Selection Control Registers Plus additional control registers … [Sprunt, 2002]
26
26 Our Event Counter: Performance Reader Performance Reader implemented as Linux Loadable Kernel Module Implements 6 syscalls: select_events() reset_event_counter() start_event_counter() stop_event_counter() get_event_counts() set_replay_MSRs() User Level Interface: Defines the events Starts counters Stops counters Reads counters & TSC
27
27 Multimeter Measurements: Stressmarks “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” Array Size 1/100 of L1 “L1Dcache” Array Size x25 of L1~L2 “L1Dcache” Array Size x4 of L2 Initialization Benchmark Execution “Fast”
28
28 Component Breakdowns Component Breakdowns for “branch_exercise” Colors for 4 CPU subsystems Issue - Retire Execution
29
29 SPEC2000 Results: Example Initialization and computation phases FP intensive mesh computation phase Initialization with high complex IA32 instructions Equake: FP benchmark
30
30 Desktop Applications We aim to track low power utilizations as well. Desktop applications are usually low power with intermittent power bursts 3 applications, with common operations such as open/close application, web, streaming media, text editing, save to disk, statistical computations.
31
31 Related Work Implementing counter readers: PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002] Using counters for Power: CASTLE [Joseph 2001], power profilers event driven OS/cruise control [Bellosa 2000,2002] Real Power Measurement: Compiler Optimizations [Seng 2003] Cycle-accurate measurement with switch caps [Chang 2002] Power Management and Modeling Support: Instruction level energy [Tiwari 1994] PowerScope: Procedure level energy [Flinn 1999] Event counter driven energy coprocessor [Haid 2003] Virtual Energy Counters for Mem. [Kadayif 2001] ECOsystem: OS energy accounting [Ellis 2002]
32
32 Our Work in Comparison Power estimation for a complex, aggressively clock-gated processor Component power estimates with physical binding to die layout Laying the groundwork for thermal modeling Portable implementation with current probe and power server LKM Power oriented phase analysis with acquired power vectors
33
33 EOP
34
34 Counter Access Heuristics The complete set of event metrics for access rates Power Numbers Areas ratios | Area based | final Max Powers & clocks Description of Phase Analysis A run-through with similarity analysis & metrics Grouping matrices & thresholding algorithm Final reconstructed powers Current & Future Research The ‘ Real ’ Big Picture Phase Analysis Thermal Modeling Questions & Rebuttals Verification? Events wish list Also provides some answers to reviewers ’ questions EXTRA SLIDES
35
35 IF NOT IN ANY OF THESE SLIDES, IT IS PROBABLY IN: PRESENTATION PART II Presentation Part 2
36
36 Counter Access Heuristics 1) BUS CONTROL: No 3 rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents Event: IOQ_allocation Counts various types of bus transactions Should account for BSQ as well access based rather than duration MASK: Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH Metric2: Bus Utilization(The % of time Bus is utilized) Event: FSB_data_activity Counts DataReaDY and DataBuSY events on Bus Mask: Count when processor or other agents drive/read/reserve the bus Expression: FSB_data_activity x BusRatio / Clocks Elapsed To account for clock ratios Final access rate:
37
37 Counter Access Heuristics 2) L2 Cache: Metric: 2 nd Level cache references Event: BSQ_cache_reference Counts cache ref-s as seen by bus unit MASK: All MESI reads (LD & RFO) & 2 nd level WR misses Final Expression: 3) 2 nd Level BPU: Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, Expression: 8xITLB_Reference Min. 8 instr-ns per 128B L2 line Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP Final expression:
38
38 (Max IA32 Instruction Length – for 2 nd Level BPU ) What this boils down to is that the "official" maximum instruction length is 4 prefix +2 opcode +1 modrm +1 sib +4 displacement +4 immediate ------------------------ = 16bytes ...though one may want to assume a 24-byte instruction length to avoid possible buffer overruns when disassembling instructions with more than 4 prefix bytes.
39
39 Counter Access Heuristics 4) ITLB & I-Fetch: Metric 1: ITLB translations performed Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses Metric 2: Intruction fetch requests by the front end BPU Event: BPU_fetch_requests Counts Ifetch requests from the BPU Mask: TC lookup misses Final expression:
40
40 Counter Access Heuristics 5) L1 Cache: Metric 1: Load & Store retired Event: Front End Event Counts tagged uops that retired Mask: Count also speculatives (BOGUS) Supporting event: Uop type Tags Load and Store instructions Mask: Tag both Loads and Stores Metric 2: Replays (For Data Speculation) Events: 1) LD port replay Counts replayed events at load port Mask: Split LD 2) ST port replay (Same as Memory Complete, Mask: SSC) Counts replayed events at store port Mask: Split ST Final expression:
41
41 Counter Access Heuristics 6) MOB: Metric 1: LD Replays triggered by MOB Event: MOB load replay Counts the load operations replayed by MOB Mask: All replays due to unknown address/data, partial data match, misaligned addresses No metric for MOB accesses! Final expression:
42
42 Counter Access Heuristics 7) Memory Control: Metric 1: Non-idle cycle Event: Machine Clear Counts cycles when the pipeline is flushed Mask: Machine clears due to any cause Expression: TSC count – Machine Clear Cycles Final Expression:
43
43 Counter Access Heuristics 8) DTLB: Metric 1: Accesses to either L1 or to MOB Expression: L1 Accesses + MOB Accesses Final expression:
44
44 Counter Access Heuristics 10) FP Execution: Metric: FP instructions executed event1: packed_SP_uop counts packed single precision uops event2: packed_DP_uop counts packed double precision uops event3: scalar_SP_uop counts scalar single precision uops event4: scalar_DP_uop counts scalar double precision uops event5: 64bit_MMX_uop counts MMX uops with 64bit SIMD operands event6: 128bit_MMX_uop counts integer SSE2 uops with 128bit SIMD operands event7: x87_FP_UOP counts x87 FP uops Masks1-7: Count ALL event8: x87_SIMD_moves_uop counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Mask 8: Count SIMD moves/stores/lds
45
45 Counter Access Heuristics 10) … FP Execution: Final Expression:
46
46 Counter Access Heuristics 9) Integer Execution: Metric 1: Integer uops executed Event: Substitute Metric: Total speculative Uops executed Event: Uop queue writes Number of uops written to the uop queue in front of TC Mask: All uops from TC, Decoder and Microcode ROM Expression: Uop Rate – FP uop rate Postfix for simple vs complex ALU operations ~ Final expression: (Actually I rescale FP uop rates as packed,SIMD and MMX uops do multiple concurrent FP operations)
47
47 Counter Access Heuristics 11) Integer Regfile Metric: Integer uops executed No direct metric for total physical regfile accesses Final expression: 12) FP Regfile Metric: FP uops executed No direct metric for total physical regfile accesses Final expression:
48
48 Counter Access Heuristics 13) Instruction Decode: Metric 1: Cycles spent in trace building Event: TC Deliver Mode Counts the cycles processor spends in the specified mode Mask: Logical processor 0 in build mode Final expression:
49
49 Counter Access Heuristics 14,15) Trace Cache: Metric: Uop queue writes from either modes Event: Uop queue writes Counts Number of uops written to the uop queue in front of TC Mask: All uops from TC and Decoder and ROM Final expression:
50
50 Counter Access Heuristics 16) 1 st Level BPU: Metric 1: Branches retired Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP Final expression:
51
51 Counter Access Heuristics 17) Microcode ROM: Metric 1: Uops originating from ROM Event: Uop queue writes Counts Number of uops written to the uop queue in front of TC Mask: Uops only from microcode ROM Final expression:
52
52 Counter Access Heuristics 18) Allocation & 19) Rename & 20) Instruction queue1 & 21) Schedule & 22) Instruction queue2: Metric: Uops that started their flight Event: Uop queue writes Counts Number of uops written to the uop queue Mask: All uops from TC and Decoder and ROM Final expression:
53
53 Counter Access Heuristics 23) Retirement Logic: Metric: Uops that arrive retirement Event: Uops retired Counts number of uops retired in a cycle Mask: Consider also speculative Final expression: Bac k
54
54 Evolution of Power Numbers UnitArea %Area Based Max Power Estimate Max Power after Tuning Conditional Clk power L2 BPU3.4%2.515.5- L1 BPU4.9%3.510.5 Tr. Cache8.6%6.24.02.0 L1 Cache5.8%4.212.4 (/2) L2 Cache14.7%10.6300.6(/7) Int EXE2.0%1.43.4 FP EXE6.2%4.5 Rename2.3%1.70.41.5 Retire6.5%4.7(/3)0.52.0
55
55 Our Power Phase Analysis Goal: Identify phases in program power behavior Determine execution points that correspond to these phases Define small set of power signatures that represent overall power behavior
56
56 Our Approach Our Approach – Outline: Collect samples of estimated power values for processor sub-units at application runtime Define a power vector similarity metric Group sampled program execution into phases Determine execution points and representative signature vectors for each phase group Analyze the accuracy of our approximation
57
57 Power Vector Similarity Metric How to quantify the ‘ power behavior dissimilarity ’ between two execution points? 1. Consider solely total power difference 2. Consider manhattan distance between the corresponding 2 vectors 3. Consider manhattan distance between the corresponding 2 vectors normalized 4. Consider a combination of (2) & (3) Construct a “ similarity matrix ” to represent similarity among all pairs of execution points similarity matrix Each entry in the similarity matrix:
58
58 Similarity Based on Both Absolute and Normalized Power Vectors
59
59 Grouping Execution Points “ Thresholding Algorithm ” : Define a threshold of similarity Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metrics Tag the corresponding execution points (j,j) as the same group Find next untagged execution point (r,r) and do the same along forward path Rule: A tagged execution point cannot add new elements to its group! We demonstrate the outcome of thresholding with Grouping Matrices
60
60 Gzip Grouping Matrices Gzip has 974 power vectors Cluster vectors based on similarity using “ thresholding ” Max Gzip power dissimilarity: 47.35W
61
61 Representative Vectors & Execution Points We have each execution point assigned to a group For Each Group: For Each Execution Point: We can represent whole execution with as many power vectors as the number of generated groups Define a representative vector as the average of all instances of that group Select the execution point that started the group (The earliest point in each group) Assign the corresponding group’s representative vector as that point’s power vector Assign the power vector of the selected execution point for that group as that point ’ s power vector
62
62 Reconstructing Power Trace with Representative Vectors:
63
63 Reconstructing Power Trace with Selected Execution Points:
64
Similarity Matrix Example Consider 4 vectors, each with 4 dimensions: 0637 6036 3307 7670 Log all distances in the similarity matrix 0637 6036 3307 7670 Color-scale from black to white (only for upper diagonal)
65
Similarity Matrix Example Consider 4 vectors, each with 4 dimensions: 0637 6036 3307 7670 Log all distances in the similarity matrix 0637 6036 3307 7670 Color-scale from black to white (only for upper diagonal)
66
66 Interpreting Similarity Matrix Plot Bac k Level of darkness at any location (r,c) shows the amount of similarity between vectors – samples – r & c. i.e. 0 & 2 All samples are perfectly similar to themselves All (r,r) are black Vertically above the diagonal shows similarity of the sample at the diagonal to previous samples i.e. 1 vs. 0 Horizontally right of the diagonal shows similarity of the sample at the diagonal to future samples i.e. 1 vs. 2,3
67
Grouping Matrix Example Consider same 4 vectors: 0637 6036 3307 7670 Mark execution pairs with distance ≤ Threshold 0637 6036 3307 7670 0637 6036 3307 7670 0637 6036 3307 7670 Bac k
68
68 Current & Future Research FOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED TO POWER ESTIMATION AND PHASES. PLANS FOR FUTURE RESEARCH ARE ALSO DISCUSSED
69
69 0) THE REAL BIG PICTURE Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Power Phases Program Profiling Program Structure To Estimate component power & temperature breakdowns for P4 at runtime… To analyze how power phase behavior relates to program structure Bottom line…
70
70 Power Modeling Power Phases Program Profiling Program Structure Power Phases Program Profiling Program Structure 1) Phase Branch Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases NOT YET?
71
71 Power Modeling Power Phases Program Profiling Program Structure Power Phases POWER PHASE BEHAVIOR Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases NOT YET
72
72 Identifying Power Phases Most of the methodology is ready Complete Gzip case in [Isci & Martonosi WWC-6] Extensibility to other benchmarks Generated similarity metrics for several Performed phase identification with thresholding for all Repeatibilty of the experiment Several other possible ideas such as: Thresholding + k-means clustering Two-pass thresholding PCA for dimension reduction (or SVD?) Manhattan = L1 norm Euclidian (L2) – not interesting Chebyschev (L inf ) - ??
73
73 Power Modeling Power Phases Program Profiling Program Structure Program Profiling Program Execution Profile Program Structure Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases NOT YET
74
74 Program Execution Profile Sample program flow simultaneously with power Our LKM implementation: “ PCsampler ” Not Finished … Generate code space similarity in parallel with power space similarity Relative comparisons of methods for: Complexity Accuracy Applicability, etc.
75
75 CURRENT STATE Sample PC Binding to functions Reacquire PID Those SPECs, Runspec: always in fixed address at ELF_program interpreter Benches: change pid between datasets Verify PC with objdump So we can make sure it is the PC we ’ re sampling
76
76 Initial Data: PC?? Trace For gzip-source Correspond to: functions Bac k
77
77 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling 2) Thermal Modeling Thermal Modeling Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model
78
78 THERMAL MODELING: A Basic Model Based on lumped R-C model from packaging Built upon power modeling Sampled Component Powers Respective component areas Physical processor Parameters Packaging Heat Transfer t : Sampling interval T i : The temperature difference between block and the heatsink
79
79 Refined Thermal Model Steady State Analysis reveals, Heatsink-Die abstraction is not sufficient for real systems Steady State Proceeding to a multilayer thermal model: Active die thickness metalization/insulation chip-package interface package heatsink Requires searching of several materials/ dimensions and thermal properties Multiple layers Multiple T nodes Multiple DEs Baseline Heat removal Structure: HEATSINK Thermal Grease Heat Spreader Package Die
80
80 Physical Structure vs. Thermal Model Ambient Temperature Heatsink Heat Spreader Package Die ThTh RhRh ChCh R _hXA TATA T spr R spr C spr R _grXspr T p,i R p,i C p,i T die,i R die,i C die,i Pi P total Thermal Grease Ambient Airflow
81
81 Analytical Derivation 4 Nodes 4 DEs 1) T spr :
82
82 EX: Ppro Thermal Model Ppro Use CASTLE [Joseph, 2001] computed component powersCASTLE Determine component areas from Die photocomponent areas Determine processor/packaging physical parameters physical parameters Generate numerical thermal model Generate numerical Apply component difference equations recursively along power flow T die,i T p,i T spr T h Update T die,i Update T p,i Update T spr Update T h
83
83 Simulation Outputs Thermal nodes updated every t~20ms Component Temperatures Build up to ~350K in ~5hrs T heatsink moves very slowly as expected
84
84 Verification of Results (1/3) Full validation requires comparing component power estimations to real measured component powers for all demonstrated full execution traces No such published data available Is it possible to acquire such data? How to probe intra-chip components? Closest would be Hspice simulations Probably infeasible to acquire traces of minutes No P4 power simulator 1 benchmark would take ~1 CPU-month with current power simulation speeds
85
85 Verification of Results (2/3) We use proxies: 1) Verifying against total measured power at runtime Provides us with an immediate comparison Most tested benchmarks show close approximations 2) Behavior of component powers for simpler benchmarks We show power trends follow expectations under different corners of execution
86
86 Art RMS error (overall): 4.38W Art RMS error (100-1000s): 4.21W Verification of Results (3/3) Why not an RMS error measure between measurement and estimate for total power not included? As two sources of data are completely independent, i.e. measurement branch coming from serial interface and modeling part from ethernet, they are not perfectly synchronized. There are spikes at power jumps Removing those by hand would question fidelity
87
87 Possibility for Other Processors? Most recent processors are keen on power management There will be enough power variability to exploit for power modeling and phase analysis Porting the power estimation to other architectures Requires significant effort to Define power related metrics Implement counter reader and power estimation user and kernel SW Porting to same architecture, different implementation More straightforward Reevaluate max/idle/gated power estimates Experiences with other architectures: Castle project for Pentium Pro (P6) Few watts of variation Low dimensionality IBM Power3 II Very low measured power variation
88
88 Event Counter Wish List (1/2) Problems from experience: Memory Related Metrics L2 metrics are complicated Do not correspond to L2 hits/misses (see optimization reference manual) Granularity issues MOB accesses metric? Memory Control? Integer and FP accesses: FPE has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect INTE has no direct measure Cannot differentiate multiply, shifts, logic, arithmetic Out of order engine: Cannot differentiate between: Allocate, Rename, Instruction Queues, Schedule
89
89 Event Counter Wish List (2/2) Suggestions for Future: Ultimately: Specific counters that represent component- wise utilizations Switching bookkeepers for singly ended lines Specific to P4 & Xeon Generation: Metrics directly related to memory components Single aggregate metrics for FP and Int execution Metrics that explore out of order engine components
90
90 Applications of our Technique Already discussed: Power phase analysis Thermal modeling In addition: OS based energy accounting i.e. ECOSystem [Duke; C. Ellis et al.] Fine grained CPU power (rather than ON/OFF) Detailed SW energy mappings i.e. PowerScope [CMU; Flinn & Satyanarayanan] Dynamic Power Management i.e. Process Cruise Control [Weissel & Bellosa] Event driven / Power model driven clock scaling
91
91 Statistical Methods (1/2) Several optimization/curve fitting tools: JMP, R, SPSS, S+, Stata Regression: Used by some previous examples 2 k factorial design K factors, each at 2 levels (Hi & Lo) Run design at each 2 k corners Analyze the interactions and individual factors ’ effects Turn factors ON/OFF
92
92 Statistical Methods (2/2) We believe, we can produce much closer matching to measured total power with curve fitting techniques and PCA (to define a new set of orthogonal activity factors) Topic for future research We avoid to preserve the binding of activity factors to physical component powers as required for thermal analysis Note on 2 k factorial design Our factors are: Access rates OR Component power estimations Our response variable: Measured total power However: We cannot independently turn ON/OFF the accesses to individual components
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.