Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Similar presentations


Presentation on theme: "Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI"— Presentation transcript:

1 Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

2 Hardware Performance Counters (HPCs) Go beyond Performance
Several explored research avenues Runtime power/thermal estimations Dynamic management Workload phases and application behavior prediction HPCs provide value beyond simulations Long-timescales Real-system behavior Recent interest in hpcs for perf and beyond, Canturk Isci, Gilberto Contreras, Margaret Martonosi

3 Hardware Performance Counters (HPCs) Go beyond Performance
Runtime power Isci & Martonosi [MICRO 2003] Contreras & Martonosi [Submitted 2005] Runtime thermal Lee & Skadron [HP-PAC in IPDPS 2005] Dynamic power management Choi et al. [ISLPED 2004] Weißel & Bellosa [CASES 2002] Dynamic thermal management Bellosa et al. [COLP 2003] Workload phases and application behavior prediction Isci & Martonosi [WWC 2003] Duesterwald et al. [PACT 2003] Recent interest in hpcs for perf and beyond, this gives some examples for several recent literature on this and outlines our examples for this talk Canturk Isci, Gilberto Contreras, Margaret Martonosi

4 High-Performance Corner: P4 Power Estimation
Idea: MaxPower[I] x ArchScaling[I] x AccessRate[I] + NonGatedPower[I] Power of component I = Motivation: Fast (Real-time) Estimated view of on-chip detail (Per physical component) Design: Developed heuristics using 24 events to approximate access rates for 22 chip components Used 15 counters with 4 rotations to collect all event data Validation: Real-time estimates against real-time measured power Access rate from HPCs Canturk Isci, Gilberto Contreras, Margaret Martonosi

5 P4 Power Estimator Results
Gcc Gzip Vpr Vortex Gap Crafty Desktop apps: AbiWord, Gnumeric, xmms, mozilla + file download, mplayer,.. Measured Modeled Average difference: ~5% among all benchmarks SPEC CPU2000 & other applications Canturk Isci, Gilberto Contreras, Margaret Martonosi

6 Embedded Corner: PXA255 Power Estimation
Idea: CPU Powernx1 = PerformanceEventsnx5 x LinearParameters5x1 + IdlePower Mem Powernx1 = PerformanceEventsnx2 x LinearParameters2x1+ IdlePower Motivation: Runtime power optimizations under DVFS Design: Parameter estimation (OLS) using dominant counter readings and live power measurements Power estimation at various CPU configurations Validation: Comparison between estimates and real-time measured power Power weights are LInearParameters PerfEvents are scaling factors Runtime optims: -DVFS config -OS scheduling -JIT compilation levels - Garbage collection (alloc mem or compacting heap) Canturk Isci, Gilberto Contreras, Margaret Martonosi

7 Canturk Isci, Gilberto Contreras, Margaret Martonosi
PXA255 Results DB CDC Java Java CDC (connected device configuration, SpecJVM98): DB, Compress Java CLDC(connection limited device configuration): Rex, Crypto SPEC2000: Bzip2, Vortex, Gap 5% average error across 3 domains Java CDC Java CLDC SPEC2000 Canturk Isci, Gilberto Contreras, Margaret Martonosi

8 Proposals from Experiences
1. Track each physical unit individually for power & thermal: Ex: Dispatch Ports Trace Cache Instr-n Queue1 MEM μop Queue Allocate Rename Schedulers μCode ROM Instr-n Queue2 EXE During these research and others, we had lots of experience wrt limitations of counters for power, from here on, we discuss the major ones and list our proposals and finalize with an ultimate wishlist - For instruction queues, there is a distinction, one is for mem (ld/st) one is for the rest, but we track in flight uops instead of retired bogus+nbogus lds & sts All tracked with in-flight μops written to μop queue Need individual utilization counts for each physical unit available on die for power and hotspot analyses Canturk Isci, Gilberto Contreras, Margaret Martonosi

9 Proposals from Experiences
2. Need bitline activity counts Utilization is not complete information, power in part depends on switching factor Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 30mW (10%) swing Implementation can be wallace tree/CSA of XOR results 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

10 Proposals from Experiences
2. Need bitline activity counts Utilization is not complete information, power in part depends on switching factor Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 000…01 111…11 + + 111…11 111…11 A 000… … … … … … : 000… … … …01 B 111… … … … … … : 000… … … …00 20mW swing 000…00 111…11 + + 111…11 000…00 000…01 + 000…00 Implementation can be wallace tree/CSA of XOR results 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

11 Proposals from Experiences
3. More detailed off-chip/memory access support in the embedded domain Mem Power ~40% of system power Tracking memory hierarchy transactions may help render better memory power estimates Main memory Read/Writes Core + DMA Transaction length in bytes Activity factors can be shared with RegFile This plot is like this becoz we run rex in a loop The high mem power is from 1st rex method which incurs a lot of I$ misses The revision of xscale (the one that goes to 733 MHz) has a mem-access metric, But still doesn’t differentiate between access types/lengths P4 has a pretty good handle in this with BUS_UTILIZATION REX Memory power consumption (one 16b bank) Canturk Isci, Gilberto Contreras, Margaret Martonosi

12 Proposals from Experiences
4. Metrics related to queue occupancy Modern processor ≡ Several queues Depending on implementation Power ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’02] Tradeoffs in Power-Efficient Issue Queue Design Canturk Isci, Gilberto Contreras, Margaret Martonosi

13 Proposals from Experiences
5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses P4 ex1. MOB: Only event MOB_load_replays Counts replays for unknown st addr./data, partial/unaligned addr. match No info for MOB entries/accesses/updates P4 ex2. FPU: Has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect P4 ex3. INT ALU: No dedicated event Canturk Isci, Gilberto Contreras, Margaret Martonosi

14 Additional Comments for HPC Design
General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses Metrics related to RegFile accesses vs. forwarding Semi-distributed implementations will always induce dependencies among simultaneously countable events Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime Implementations that allow counter rotations without need for intermediate logging Partitioned / Dual-mode / Buffered counters Different events for different types of accesses to same units with different magnitude power implications i.e. branch scan < BHT update < BTA update Different API/SW demands: Lightweight implementations for runtime analyses Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots Canturk Isci, Gilberto Contreras, Margaret Martonosi

15 Wishlist for Power/Thermal
1) For each physical unit on die, separate events to track utilization rates Sub events for different type of accesses with different power costs 2) Bitline activity counters for switching units 3) Occupancy counters for related queues 4) Counter support for off-core memory accesses 5) High parallelism among power events for minimal counter rotations This is pretty much summing up all we said before If Not all practically doable, so goes in the order of imp Canturk Isci, Gilberto Contreras, Margaret Martonosi

16 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Conclusions New opportunities remain to be explored in future PMC designs for power and thermal studies Direct correspondence to physical units Bitline and occupancy counters We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target. P6(P3,Ppro): 2 counters  P4: 18 counters lots more events, different modes/ lotsa features POWER3-II: 8 counters  POWER4 also 8 cntrs but > x3 events Canturk Isci, Gilberto Contreras, Margaret Martonosi


Download ppt "Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI"

Similar presentations


Ads by Google