September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
COMP25212 Advanced Pipelining Out of Order Processors.
Power Reduction Techniques For Microprocessor Systems
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Temperature-Aware Design Presented by Mehul Shah 4/29/04.
Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Analysis of Branch Predictors
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Dynamic Scheduling Why go out of style?
Microarchitecture.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
SECTIONS 1-7 By Astha Chawla
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Alpha Microarchitecture
Sampoorani, Sivakumar and Joshua
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
What Are Performance Counters?
Presentation transcript:

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture

Sept 28 th Motivation  ITRS Roadmap: Reasons for increasing power consumption  Higher chip operating frequencies  Increased gate leakage of transistors  Higher interconnect Capacitances and Resistances Lack of interconnect architecture design tool until 2009 Inability of the Interconnect to scale for performance beyond 2009

Sept 28 th Heterogeneous Interconnects: A starting point  Two sets of Interconnects Low Delay, high power wires Low Power wires(high delay)  Easier to target instructions  Augurs well for a more sophisticated model

Sept 28 th Interconnect transfers - Types Bypassed register value Ready register value Address transfer Store value Load value

Sept 28 th Bypassed Register Values  Operands produced in a cluster that are immediately required by another cluster  Criticality based on two factors Operand arrival time at the cluster Actual issue time of the sourcing instruction  Criticality changes at runtime Needs a dynamic predictor Rename & Dispatch IQ Regfile FU IQ Regfile FU IQ Regfile FU Producing Instruction completing execution at cycle 120 Consumer Instruction dispatched at Cycle 100

Sept 28 th The Data Criticality Predictor  A table indexed by the lower order bits of the instruction address, updated dynamically to indicate the criticality of data.  Difference in arrival time and usage calculated for each operand of an instruction  Difference < Threshold Critical  Difference > Threshold Non-Critical

Sept 28 th Summary of transfers CriticalNon-Critical Load ValuesStore value Effective address unpredictedEffective address predicted Bypassed register value Ready register value

Sept 28 th Result summary  Two kinds of non-critical transfers Data that are not immediately used – 36% Verification of address predictions – 13%  Criticality based case 49% of all data transfers through the Power-optimized wires Performance penalty - only 2.5% Potential energy savings of around 50% in the interconnects

Sept 28 th Things that are missing  Power modeling for the processor as a whole.  Implications on transient temperature variations for varying workloads.  Lack of a good on chip interconnect power/temperature simulator  Complexity effective design for the criticality predictor

Sept 28 th Interconnect simulator: Problems  Should account for: No. of wires in the particular process. Deal with a 3-D space for routing of wires. Satisfy the design rule constraints. More of a layout optimization problem.

Sept 28 th What we propose to do  Wattch: incorporated into a scalable 16 cluster system  HotSpot: Transient temperature model  HotLeakage: Leakage power model  Build a prototype layout to satisfy the above requirements

Sept 28 th Wattch  Power model from Princeton University  Simulates an o-o-o processor (Alpha 21264)  Caveat: Interconnects are not accurately simulated

Sept 28 th Wattch Modified  Wattch uses a single instruction window logic  Issue queue model Separate Int and FP Wakeup logic Separate Int and FP Selection logic Helps in efficient distribution

Sept 28 th Wattch Modified  Single result bus, FUs and register files  Distributed units Separate Integer and floating point register files Separate Integer and floating point execution and result bus units

Sept 28 th Wattch Modified  Wattch: Simple Alpha  Modified for a scalable 16 cluster system Modular: easy for adaptation and testability. Caveat: There is lot of scope for improvement

September 28 th 2004University of Utah16 Visual Feature Recognition Elastic Bunch Graph Matching(EBGM)

Sept 28 th History  No particular algorithm known  Many algorithms for face and object recognition  Few feature recognition benchmarks like the FERET  Eigen faces – traditionally known for face recognition

Sept 28 th Motivation: EBGM FLESH TONING SEGMENT- ATION FACE DETECTION FACE RECOGNITION No Segmentation needed in EBGM! Steps in Face Recognition

Sept 28 th EBGM Steps involved in EBGM NORMALIZATION/ PREPROCESSING FACE GRAPH CREATION FACE IDENTIFICATION Looks easy

Sept 28 th EBGM: Mathematically  Image descriptions are based on a Wavelet transform  Gabor jets are extracted from each landmark  Local image information around each node is the key

Sept 28 th EBGM: What is missing?  Landmark localization is less reliable  Difficult to track small differences in face orientation now  Compute intensive Gabor jets

Sept 28 th Questions? Thank you