Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Slides:

Advertisements

Similar presentations

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget Represented by: Majid Malaika Authors:

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Energy Management in Virtualized Environments Gaurav Dhiman, Giacomo Marchetti, Raid Ayoub, Tajana Simunic Rosing (CSE-UCSD) Inside Xen Hypervisor Online.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Runtime Software Power Estimation and Minimization Tao Li.

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

© 2003, Carla Ellis Vague idea “groping around” experiences Hypothesis Model Initial observations Experiment Data, analysis, interpretation Results & final.

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Vague idea “groping around” experiences Hypothesis Model Initial observations Experiment Data, analysis, interpretation Results & final Presentation Experimental.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Best detection scheme achieves 100% hit detection with

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Canturk ISCI Margaret MARTONOSI

Jacob R. Lorch Microsoft Research

Reducing Memory Interference in Multicore Systems

Selective Code Compression Scheme for Embedded System

Simultaneous Multithreading

INTEL HYPER THREADING TECHNOLOGY

‘99 ACM/IEEE International Symposium on Computer Architecture

Identifying Program Power Phase Behavior Using Power Vectors

Department of Electrical & Computer Engineering

Hyperthreading Technology

Energy-Efficient Address Translation

What we need to be able to count to tune programs

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Department of Computer Science University of California, Santa Barbara

Comparison of Two Processors

Presented by: Eric Carty-Fickes

Adaptive Code Unloading for Resource-Constrained JVMs

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Performance of computer systems

* From AMD 1996 Publication #18522 Revision E

Performance of computer systems

Hardware Counter Driven On-the-Fly Request Signatures

CS510 - Portland State University

Performance of computer systems

Automatic Tuning of Two-Level Caches to Embedded Applications

Department of Computer Science University of California, Santa Barbara

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Canturk Isci Gilberto Contreras Margaret Martonosi

Presentation transcript:

Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Hardware Performance Counters (HPCs) Go beyond Performance Several explored research avenues Runtime power/thermal estimations Dynamic management Workload phases and application behavior prediction HPCs provide value beyond simulations Long-timescales Real-system behavior Recent interest in hpcs for perf and beyond, Canturk Isci, Gilberto Contreras, Margaret Martonosi

Hardware Performance Counters (HPCs) Go beyond Performance Runtime power Isci & Martonosi [MICRO 2003] Contreras & Martonosi [Submitted 2005] Runtime thermal Lee & Skadron [HP-PAC in IPDPS 2005] Dynamic power management Choi et al. [ISLPED 2004] Weißel & Bellosa [CASES 2002] Dynamic thermal management Bellosa et al. [COLP 2003] Workload phases and application behavior prediction Isci & Martonosi [WWC 2003] Duesterwald et al. [PACT 2003] Recent interest in hpcs for perf and beyond, this gives some examples for several recent literature on this and outlines our examples for this talk Canturk Isci, Gilberto Contreras, Margaret Martonosi

High-Performance Corner: P4 Power Estimation Idea: MaxPower[I] x ArchScaling[I] x AccessRate[I] + NonGatedPower[I] Power of component I = Motivation: Fast (Real-time) Estimated view of on-chip detail (Per physical component) Design: Developed heuristics using 24 events to approximate access rates for 22 chip components Used 15 counters with 4 rotations to collect all event data Validation: Real-time estimates against real-time measured power Access rate from HPCs Canturk Isci, Gilberto Contreras, Margaret Martonosi

P4 Power Estimator Results Gcc Gzip Vpr Vortex Gap Crafty Desktop apps: AbiWord, Gnumeric, xmms, mozilla + file download, mplayer,.. Measured Modeled Average difference: ~5% among all benchmarks SPEC CPU2000 & other applications Canturk Isci, Gilberto Contreras, Margaret Martonosi

Embedded Corner: PXA255 Power Estimation Idea: CPU Powernx1 = PerformanceEventsnx5 x LinearParameters5x1 + IdlePower Mem Powernx1 = PerformanceEventsnx2 x LinearParameters2x1+ IdlePower Motivation: Runtime power optimizations under DVFS Design: Parameter estimation (OLS) using dominant counter readings and live power measurements Power estimation at various CPU configurations Validation: Comparison between estimates and real-time measured power Power weights are LInearParameters PerfEvents are scaling factors Runtime optims: -DVFS config -OS scheduling -JIT compilation levels - Garbage collection (alloc mem or compacting heap) Canturk Isci, Gilberto Contreras, Margaret Martonosi

Canturk Isci, Gilberto Contreras, Margaret Martonosi PXA255 Results DB CDC Java Java CDC (connected device configuration, SpecJVM98): DB, Compress Java CLDC(connection limited device configuration): Rex, Crypto SPEC2000: Bzip2, Vortex, Gap 5% average error across 3 domains Java CDC Java CLDC SPEC2000 Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 1. Track each physical unit individually for power & thermal: Ex: Dispatch Ports Trace Cache Instr-n Queue1 MEM μop Queue Allocate Rename Schedulers μCode ROM Instr-n Queue2 EXE During these research and others, we had lots of experience wrt limitations of counters for power, from here on, we discuss the major ones and list our proposals and finalize with an ultimate wishlist - For instruction queues, there is a distinction, one is for mem (ld/st) one is for the rest, but we track in flight uops instead of retired bogus+nbogus lds & sts All tracked with in-flight μops written to μop queue Need individual utilization counts for each physical unit available on die for power and hotspot analyses Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 2. Need bitline activity counts Utilization is not complete information, power in part depends on switching factor Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 30mW (10%) swing Implementation can be wallace tree/CSA of XOR results 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 2. Need bitline activity counts Utilization is not complete information, power in part depends on switching factor Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 000…01 111…11 + + 111…11 111…11 A 000…01 000…01 000…01 000…01 000…01 000…01 : 000…01 000…01 000…01 000…01 B 111…11 000…00 011…11 000…00 001…11 000…00 : 000…11 000…00 000…01 000…00 20mW swing 000…00 111…11 + + 111…11 000…00 000…01 + 000…00 Implementation can be wallace tree/CSA of XOR results 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 3. More detailed off-chip/memory access support in the embedded domain Mem Power ~40% of system power Tracking memory hierarchy transactions may help render better memory power estimates Main memory Read/Writes Core + DMA Transaction length in bytes Activity factors can be shared with RegFile This plot is like this becoz we run rex in a loop The high mem power is from 1st rex method which incurs a lot of I$ misses The 80200 revision of xscale (the one that goes to 733 MHz) has a mem-access metric, But still doesn’t differentiate between access types/lengths P4 has a pretty good handle in this with BUS_UTILIZATION REX Memory power consumption (one 16b bank) Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 4. Metrics related to queue occupancy Modern processor ≡ Several queues Depending on implementation Power ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’02] Tradeoffs in Power-Efficient Issue Queue Design Canturk Isci, Gilberto Contreras, Margaret Martonosi

Proposals from Experiences 5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses P4 ex1. MOB: Only event MOB_load_replays Counts replays for unknown st addr./data, partial/unaligned addr. match No info for MOB entries/accesses/updates P4 ex2. FPU: Has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect P4 ex3. INT ALU: No dedicated event Canturk Isci, Gilberto Contreras, Margaret Martonosi

Additional Comments for HPC Design General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses Metrics related to RegFile accesses vs. forwarding Semi-distributed implementations will always induce dependencies among simultaneously countable events Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime Implementations that allow counter rotations without need for intermediate logging Partitioned / Dual-mode / Buffered counters Different events for different types of accesses to same units with different magnitude power implications i.e. branch scan < BHT update < BTA update Different API/SW demands: Lightweight implementations for runtime analyses Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots Canturk Isci, Gilberto Contreras, Margaret Martonosi

Wishlist for Power/Thermal 1) For each physical unit on die, separate events to track utilization rates Sub events for different type of accesses with different power costs 2) Bitline activity counters for switching units 3) Occupancy counters for related queues 4) Counter support for off-core memory accesses 5) High parallelism among power events for minimal counter rotations This is pretty much summing up all we said before If Not all practically doable, so goes in the order of imp Canturk Isci, Gilberto Contreras, Margaret Martonosi

Canturk Isci, Gilberto Contreras, Margaret Martonosi Conclusions New opportunities remain to be explored in future PMC designs for power and thermal studies Direct correspondence to physical units Bitline and occupancy counters We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target. P6(P3,Ppro): 2 counters  P4: 18 counters lots more events, different modes/ lotsa features POWER3-II: 8 counters  POWER4 also 8 cntrs but > x3 events Canturk Isci, Gilberto Contreras, Margaret Martonosi