Download presentation
Presentation is loading. Please wait.
Published byPhilippa Bond Modified over 8 years ago
1
Runtime Processor Power Monitoring VICE VERSA 07/18/2003 Talk Canturk ISCI
2
2 INTRODUCTION Runtime processor power Measurement with HW Estimation with Performance counters CPU Unit Power Breakdowns Runtime verification Processor thermal modeling Power Phase Behavior of programs Mapping between power behavior and program structure
3
3 THE BIG PICTURE Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Power Phases Program Profiling Program Structure To Estimate component power & temperature breakdowns for P4 at runtime… To analyze how power phase behavior relates to program structure Bottom line…
4
4 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Agenda Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Brief Thermal Model Intro Ppro Thermal Model Results
5
5 Power Modeling Power Phases Program Profiling Program Structure Power Phases Program Profiling Program Structure Bonus Material Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases
6
6 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Performance Monitoring Performance Monitoring Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model
7
7 Live CPU Performance Monitoring with Hardware Counters Most CPUs have hardware performance counters P4 Performance Monitoring HW: P4Performance Monitoring HW 18 Event Counters 18 Counter Configuration Control Registers Configure how to count 45 Event Selection Control Registers Configure what to count Additional Control Registers
8
8 Our Event-Counter: Performance Reader Performance Reader implemented as Linux Loadable Kernel Module Implements 6 syscalls: select_events() reset_event_counter() start_event_counter() stop_event_counter() get_event_counts() set_replay_MSRs() User Level Interface: Defines the events Starts counters Stops counters Reads counters & TSC Event Types: 59 event classes 100s of events to count
9
9 Performance Reader: Example Validation L1_Dcache benchmark Controls cache hit behavior Validated against measured cache events Vary hit rate from 0-100%
10
10 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Processor Power Measurement Real Power Measurement Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model
11
11 P4 Power Measurement Setup 1mV/Adc conversion Clamp ammeter on 12V lines on measured CPU Voltage readings via RS232 to logging machine Serial Reader (PowerMeter) (PowerPlotter) Convert to Power vs. time window DMM reading clamp voltages
12
12 PowerPlotter: Example “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” Array Size 1/100 of L1 “L1Dcache” Array Size x25 of L1~L2 “L1Dcache” Array Size x4 of L2 Initialization Benchmark Execution “Fast”
13
13 SPEC Power Examples Different programs show very different power characteristics Timescale of interest can be huge => inaccessible via simulation
14
14 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Processor Power Modeling Power Modeling Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model
15
15 Define Components Performance Monitoring Real Power Measurement Power Modeling Define Events Real Power Measurement Verify total power against measured processor power Power Modeling Convert counter info into component power breakdowns Performance Monitoring Gather counter info with minimal power overhead and program interruption Define Events Determine combination of P4 events that represent component accesses best Define Components Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we ’ ll model: from annotated layout P4 POWER MODEL
16
16 Defining Components
17
17 Defining Events Access Rates We determined 24 events to approximate access rates for 22 components Used Several Heuristics to represent each access rateHeuristics Examples: Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event datarotations
18
18 Access Rates Component Powers “ Performance Counter based Access Rate estimations are used as proxy for max component power weighting together with microarchitectural details in order to estimate processor sub-unit powers ” EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode: Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK power Total power is computed as the sum of all 22 component powers + measured idle power (8W):
19
19 Experiment Setup – Recall: 1mV/Adc conversion Clamp ammeter on 12V lines on measured CPU Voltage readings via RS232 to logging machine Serial Reader (PowerMeter) (PowerPlotter) Convert to Power vs. time window DMM reading clamp voltages
20
20 Experiment Setup Voltage readings via RS232 to logging machine 1mV/Adc conversion
21
21 Experiment Setup POWER CLIENT POWER SERVER Voltage readings via RS232 to logging machine Convert voltage to measured power Convert access rates to modeled powers Sync together in time window 1mV/Adc conversion Component access rates over ethernet
22
22 Tuning Benchmarks “Fast” “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” (Hit Rate : 0.1) Measured Modeled
23
23 Component Breakdowns Component Breakdowns for “branch_exercise” Colors for 4 CPU subsystems Issue - Retire Execution
24
24 Benchmark Power Breakdowns High issue, exec. & branch power High L2 Cache PowerHigh L1 Cache PowerHigh Bus Power
25
25 SPEC2000 Results VPR Elaboration: Integer benchmark 2 runs: 1 st Placement, 2 nd Route 1 st run much stable power, 2 nd more variable Placement has higher miss than route Significant FPE power due to x87_SIMD_moves Twolf Elaboration:(Integer benchmark) Several loop computations traversing memory Although ~const. Total power, component powers have slight gradients Equake Elaboration: (FP benchmark) Initialization and computation phases FP intensive mesh computation phase Initialization with high complex IA32 instructions
26
26 Average SPEC Total Powers 1 st set: Overall, 2 nd set: Non-idle power Average difference between measurement and estimation: 3W Worst case: Equake (5.8W)
27
27 Stdev of SPEC Total Powers 1 st set: Overall, 2 nd set: Non-idle power Average difference: 2W Worst case: Vortex (3.5W)
28
28 Desktop Applications We aim to track low power utilizations as well. Desktop applications are usually low power with intermittent power bursts 3 applications, with common operations such as open/close application, web, streaming media, text editing, save to disk, statistical computations.
29
29 Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling Thermal Model Thermal Modeling Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Thermal Modeling Brief Thermal Model Intro Ppro Thermal Model Results Skip Thermal
30
30 THERMAL MODELING: A Basic Model Based on lumped R-C model from packaging Built upon power modeling Sampled Component Powers Respective component areas Physical processor Parameters Packaging Heat Transfer
31
31 Physical Structure vs. Thermal Model Ambient Temperature Heatsink Heat Spreader Package Die ThTh RhRh ChCh R _hXA TATA T spr R spr C spr R _grXspr T p,i R p,i C p,i T die,i R die,i C die,i Pi P total Thermal Grease Ambient Airflow
32
32 Simulation Outputs Thermal nodes updated every t~20ms Component Temperatures Build up to ~350K in ~5hrs T heatsink moves very slowly as expected
33
33 Power Modeling Power Phases Program Profiling Program Structure Power Phases Power Phase Behavior Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases
34
34 Power Vectors for Similarity Similar to basic block vectors Use component power vector samples to represent program phases Consider Manhattan distance between 2 vectors as the ‘ measure of dissimilarity ’ between the corresponding execution points Construct a similarity matrix to represent similarity among all pairs of execution points Each entry in the similarity matrix:
35
35 Gcc & Gzip Similarity Matrices Gcc Elaboration: Very variant power Almost identical power behavior at 30, 50, 180s. Although 88s, 110s, 140s, 210s and 230s show similar total power; 88, 210 and 230 share higher similarity. Gzip Elaboration: Much regular power behavior Spurious similarities such as 100-150s and 200-280 are distinguished by the similarity analysis
36
36 Generating representative vectors Gzip has ~1000 power vectors Cluster vectors based on similarity Could we represent power behavior with reasonable accuracy, with a small number of ‘ signature ’ vectors? Ex: 26 representative vectors with “ Thresholding Algorithm ” :
37
37 Power Modeling Power Phases Program Profiling Program Structure Program Profiling Program Execution Profile Program Structure Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process ’ execution “ PCsampler ” LKM Program Structure Execution vs. Code space Power Phases Exec. Phases
38
38 Program Execution Profile Sample program flow simultaneously with power Our LKM implementation: “ PCsampler ” Generate code space similarity in parallel with power space similarity Relative comparisons for: Complexity Accuracy Applicability, etc.
39
39 EOP
40
40 Where all this is useful? Measurement/Modeling for microarchitectural details Compiler level power SW power profiling Power Aware OS Dynamic power/thermal/March. Configuration Dynamic memory allocate, Process cruise control, etc. Demonstrates modern processor power Need for speed! Long Timescales, thermal constants Identify program phases w/o knowledge of code, no basic block info whatsoever Program signatures for detailed simulation, say: “ power points rather than simpoints ”
41
41
42
42 Counter Access Heuristics 1) BUS CONTROL: No 3 rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents Event: IOQ_allocation Counts various types of bus transactions Should account for BSQ as well access based rather than duration MASK: Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH Metric2: Bus Utilization(The % of time Bus is utilized) Event: FSB_data_activity Counts DataReaDY and DataBuSY events on Bus Mask: Count when processor or other agents drive/read/reserve the bus Expression: FSB_data_activity x BusRatio / Clocks Elapsed To account for clock ratios
43
43 Counter Access Heuristics 2) L2 Cache: Metric: 2 nd Level cache references Event: BSQ_cache_reference Counts cache ref-s as seen by bus unit MASK: All MESI read misses (LD & RFO) 2 nd level WR misses 3) 2 nd Level BPU: Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses & UC hits Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP
44
44 Counter Access Heuristics 4) ITLB & I-Fetch: etc ……… 10) FP Execution: Metric: FP instructions executed event1: packed_SP_uop counts packed single precision uops event2: packed_DP_uop counts packed single precision uops event3: scalar_SP_uop counts scalar double precision uops event4: scalar_DP_uop counts scalar double precision uops event5: 64bit_MMX_uop counts MMX uops with 64bit SIMD operands event6: 128bit_MMX_uop counts integer SSE2 uops with 128bit SIMD operands event7: x87_FP_UOP counts x87 FP uops event8: x87_SIMD_moves_uop counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Bac k
45
45
46
46 P4 Details Karelian.ee: P4 – 1.4GHz 0.18, C4-FC-PGA-423 Heatsink Folded Fin M6, Al interconnect Die Size: 217 mm 2 Package Size: 5.34cm x 5.17cm Power: Idle/typ./max=??/51.8/71W D$1&T$1/L2: 8K&12KUops/256K Voltage: 1.7/1.75V
47
47 P4 Details 1 st LKM: Implements syscall: getCPUinfo() Gathers CPU info from: /asm/processor.h Intel control registers (CR4) CPUID instruction Reveals: Debug Store mechanism exists for PEBS TSC exists MSRs implemented We can read/write performance counters EX: karelian (P4,willamette): UserLevel_CPUinfo viale (P4, Northwood): UserLevel_CPUinfo Bac k
48
48 P4 Detector - Counter Clusters Event DetectorsEvent Counters 4 bit wide bus P4 Components EVENTS
49
49 Counters, ESCRs & CCCRs Simplified Recipe: 1.Select Event to count 2.Select a counter (also defines CCCR) 3.Select an ESCR 4.Set ESCR fields 5.Set CCCR fields 6.Enable CCCR
50
50 Counting Mechanisms Counting Types Non-retirement: Events occur any time during execution At-Retirement: Events at the retirement of instruction Can count BOGUS vs NBOGUS, Tag uops to count, etc. Terminology Mechanisms: Front end tagging (i.e. LD/ST retired) Execution tagging (i.e. packed_DP_retired) Replay Tagging (i.e. L1 misses) No Tags (i.e. uops retired) Also: Event Counting | IEBS | PEBS Bac k
51
51 At Retirement Counting Terminology Bac k BOGUS/NBOGUS (speculative) Tagging (count uops that encounter event) Replay (Data speculation)
52
52 Verifying Counter Reader 1) L1Dcache_exercise: Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate) i.e. for 10% Hit rate: 80K 20K entries Array Size < L2 size Array elements PRBS of array indices Bench loop: new index array[old index] However, gcc puts 5 LDs in the bench loop 4 static Hit rate ~ 100% 1 our load our desired hit rate
53
53 …Verifying Counter Reader 1) L1Dcache_exercise results: Ex: L1Dcache_exercise Hit Rate = 0.25
54
54 …Verifying Counter Reader 2) branch_exercise: Uses random number comparison Assigns 400K PRBS array outside bench loop To avoid rand() instructions in bench loop bench loop: Compares array index to threshod Threshold = RAND_MAX*TakenRate Repeats 1000 reseeding each time However gcc adds 2 more branches into bench loop: Loop exit condition (Prediction ~ 100%) Unconditional JMP (Prediction ~ 100%) Our Branch ’ s Expected Mispredict Rate: ~ (0.5 - |TakenRate – 0.5| )
55
55 …Verifying Counter Reader 2) branch_exercise results: Ex: branch_exercise Taken Rate=0.5 Bac k
56
56 MEASUREMENT Method Select Power lines that reflect CPU powerPower lines P4 uses 12 V lines Clamp the current probe over the 12V lines 1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings Convert to instantaneous power as: 12 x V sample x 1000 Log Power values Plot Power values
57
57 MEASUREMENT Tools Poll serial port ~20ms quicker overkill, slower overlook Compute running average sample every t you select Easier to sync with Power Model PowerMeter: Convert voltage reading to power and log P=12 x V read x 1000 PowerPlotter: Plot Power samples over sliding time window 100 s history with 1000 samples (t = 100ms)
58
58 Current Probe Fluke i410 Uses Hall Voltage to measure current and convert to Voltage: 1mV / Adc Range: 0.5 – 400A Accuracy: 3.5%+0.5A Generated voltage is fed to DMM Compared against the Ppro Amoeba shunt setup for verification
59
59 DMM Agilent 34401A Measurement Motive: We should sample as quick as possible (grep case) Measurement Setup: Fast 4 digit, Autozero OFF, Display OFF From [8], 1000 readings/s (x150 faster than fast 6 digit) Serial Interface: From [9] 55 ASCII readings /s Polling serial port faster than 20ms is overkill Bac k
60
60 P4 Power Lines Which power lines should we cut / clamp? [5] shows the power lines: 1-CPU power connector 13-System power connector P1 13 & P2 1 [6],[7] say P4 uses 12V lines for CPU, rather than 5V lines Both P1 & P2 have 12, 5 and 3.3 V lines I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines
61
61 Current on Power Lines Reveals ALL 3 12V lines’ currents follow CPU activity All add to CPU Power! Bac k
62
62 About the ripples Add ripple stuff here … !!!!!!!!!!!!!!!!!!!!!!!!!!!
63
63 P4 Architecture vs Layout Components to Model: 1)Bus Control 2)L2 Cache 3)2 nd Level BPU 4)ITLB & Ifetch 5)L1 Cache 6)MOB 7)Mem Control 8)DTLB 9)Int EXE 10)FP EXE 11)Int RF 12)FP RF 13)Decode 14)Trace $ 15)1 st Level BPU 16)Microcode ROM 17)Allocation 18)Rename 19)Inst-n Qs 20)Schedule 21)Inst-n Qs 22)Retirement Bac k
64
64 Counter Rotations Bac k
65
65 EOP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.