Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
CSE Dept., (XHU) 1 The Salishan conference on High-Speed Computing No Free Lunch, No Hidden Cost X. Sharon Hu Dept. Computer Science and Engineering University.
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget Represented by: Majid Malaika Authors:
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Princeton University Electrical Engineering 12th International Symposium on High-Performance Computer Architecture HPCA-12, Austin, TX Feb 14, 2006.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
© 2003, Carla Ellis Vague idea “groping around” experiences Hypothesis Model Initial observations Experiment Data, analysis, interpretation Results & final.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Identifying Program Power Phase Behavior Using Power Vectors Canturk Isci & Margaret Martonosi WWC Austin, TX.
Sunpyo Hong, Hyesoon Kim
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Vague idea “groping around” experiences Hypothesis Model Initial observations Experiment Data, analysis, interpretation Results & final Presentation Experimental.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Best detection scheme achieves 100% hit detection with
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
Canturk ISCI Margaret MARTONOSI
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Department of Computer Science University of California, Santa Barbara
Comparison of Two Processors
Adaptive Optimization in the Jalapeño JVM
Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI
* From AMD 1996 Publication #18522 Revision E
Automatic Tuning of Two-Level Caches to Embedded Applications
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 2  Several explored research avenues  Runtime power/thermal estimations  Dynamic management  Workload phases and application behavior prediction  HPCs provide value beyond simulations  Long-timescales  Real-system behavior Hardware Performance Counters (HPCs) Go beyond Performance

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 3  Runtime power  Isci & Martonosi [MICRO 2003]  Contreras & Martonosi [Submitted 2005]  Runtime thermal  Lee & Skadron [HP-PAC in IPDPS 2005]  Dynamic power management  Choi et al. [ISLPED 2004]  Weißel & Bellosa [CASES 2002]  Dynamic thermal management  Bellosa et al. [COLP 2003]  Workload phases and application behavior prediction  Isci & Martonosi [WWC 2003]  Duesterwald et al. [PACT 2003] Hardware Performance Counters (HPCs) Go beyond Performance

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 4 High-Performance Corner: P4 Power Estimation  Idea: MaxPower[I] x ArchScaling[I] x AccessRate[I] + NonGatedPower[I] Power of component I =  Motivation:  Fast (Real-time)  Estimated view of on-chip detail (Per physical component)  Design:  Developed heuristics using 24 events to approximate access rates for 22 chip components  Used 15 counters with 4 rotations to collect all event data  Validation:  Real-time estimates against real-time measured power

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 5 P4 Power Estimator Results  Average difference: ~5% among all benchmarks  SPEC CPU2000 & other applications Gcc Measured Modeled GzipVprVortexGap Crafty

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 6 Embedded Corner: PXA255 Power Estimation  Idea: PerformanceEvents nx5 x LinearParameters 5x1 + IdlePower CPU Power nx1 =  Motivation:  Runtime power optimizations under DVFS  Design:  Parameter estimation (OLS) using dominant counter readings and live power measurements  Power estimation at various CPU configurations  Validation:  Comparison between estimates and real-time measured power PerformanceEvents nx2 x LinearParameters 2x1 + IdlePower Mem Power nx1 =

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 7 PXA255 Results  DB CDC Java  5% average error across 3 domains  Java CDC  Java CLDC  SPEC2000

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 8 Proposals from Experiences  1. Track each physical unit individually for power & thermal:  Ex: Trace Cache μCode ROM μop Queue Allocate Rename Instr-n Queue1 Instr-n Queue2 Schedulers MEM EXE Dispatch Ports All tracked with in-flight μops written to μop queue  Need individual utilization counts for each physical unit available on die for power and hotspot analyses

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 9 Proposals from Experiences  2. Need bitline activity counts  Utilization is not complete information, power in part depends on switching factor  Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 30mW (10%) swing 400Mhz 1.3V PXA255 Processor

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 10 Proposals from Experiences  2. Need bitline activity counts  Utilization is not complete information, power in part depends on switching factor  Not necessarily fully detailed counts Accumulate bitwise XOR of current and previous input/output ports Sample RegFile ports/bit populations 20mW swing 111…11 000… …11 000… Mhz 1.3V PXA255 Processor 111… …01 000… …01 111…11 + B 111…11 000…00 011…11 000…00 001…11 000…00 : 000…11 000…00 000…01 000…00 A 000…01 000…01 000…01 000…01 000…01 000…01 : 000…01 000…01 000…01 000…01

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 11 Proposals from Experiences  3. More detailed off-chip/memory access support in the embedded domain  Mem Power ~40% of system power  Tracking memory hierarchy transactions may help render better memory power estimates REX Memory power consumption (one 16b bank) Main memory Read/Writes  Core + DMA Transaction length in bytes Activity factors can be shared with RegFile

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 12 Proposals from Experiences  4. Metrics related to queue occupancy  Modern processor ≡ Several queues  Depending on implementation Power ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’02] Tradeoffs in Power-Efficient Issue Queue Design

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 13 Proposals from Experiences  5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses  P4 ex1. MOB: Only event MOB_load_replays Counts replays for unknown st addr./data, partial/unaligned addr. match No info for MOB entries/accesses/updates  P4 ex2. FPU: Has 8 separate events (with 2 dedicated ESCRs) Need at least 4 rotations to collect  P4 ex3. INT ALU: No dedicated event

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 14 Additional Comments for HPC Design  General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses  Metrics related to RegFile accesses vs. forwarding  Semi-distributed implementations will always induce dependencies among simultaneously countable events  Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime  Implementations that allow counter rotations without need for intermediate logging Partitioned / Dual-mode / Buffered counters  Different events for different types of accesses to same units with different magnitude power implications  i.e. branch scan < BHT update < BTA update  Different API/SW demands:  Lightweight implementations for runtime analyses  Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 15 Wishlist for Power/Thermal  1) For each physical unit on die, separate events to track utilization rates  Sub events for different type of accesses with different power costs  2) Bitline activity counters for switching units  3) Occupancy counters for related queues  4) Counter support for off-core memory accesses  5) High parallelism among power events for minimal counter rotations

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk Isci, Gilberto Contreras, Margaret Martonosi 16 Conclusions  New opportunities remain to be explored in future PMC designs for power and thermal studies  Direct correspondence to physical units  Bitline and occupancy counters  We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target.