UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Lecture 12 Reduce Miss Penalty and Hit Time

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Low power Design Strategies Daniele Folegnani. Talk outline Why Low Power is Important Power Consumption in CMOS Circuits New Trends for Future Microprocessors.

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,

Fetch Directed Prefetching - a Study

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Dynamic Associative Caches:

Lecture: Out-of-order Processors

Multiscalar Processors

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

SECTIONS 1-7 By Astha Chawla

Out-of-Order Commit Processors

Lecture: SMT, Cache Hierarchies

Computer Architecture Lecture 3

Tolerating Long Latency Instructions

Power-Aware Microprocessors

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: Out-of-order Processors

Out-of-Order Commit Processors

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: SMT, Cache Hierarchies

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya

UPC MOTIVATION Power consumption High performance microarchitecture Cooling systems Reliability Embedded systems Battery life

UPC OUTLINE Power Consumption in Superscalar Processors IPC-based Instruction Queue Resize Results Conclusions

UPC Power Evaluation Methodology Dynamic Power Estimator [Cai,Lim MICRO32] Architectural design partition Architectural block fits a circuit block Power consumption evalutation at block level Power density of blocks (SPICE, input sets, technology and circuit styles definition) Blocks and sub-blocks activity (execution-driven) Area (feedback from VLSI design)

UPC The Power Model 0.18 microm CMOS 5 Types of logic (static, dynamic, SRAM, clock, PLA) 32 Blocks and area associated Custom design Power densities ( APD, IPD )

UPC EXPERIMENTAL FRAMEWORK 4 instr. fetch, issue and commit 128 entries instruction queue size I-Cache 128Kbytes, direct mapped, 32 byte line, 1 cycle hit, 3 cycle miss D-Cache 128Kbytes, 4 way set ass, 32 byte line, 1 cycle hit, 3 cycle miss UL2-Cache,1024Kbytes, 4 way set ass, 64 byte line, 3 cycle hit Combined predictor of 1K entries with Gshare with 1K 2-bit counters, 8 bit global history and bimodal pred. of 2K entries with 2-bit counters 4 intALU, 4fpALU, 1int mul/div, 1 fp mul/div Out of order issue, oldest ready first selection policy

UPC Power Consumption in Superscalar Processors

UPC ANALYSIS Power Analysis IQ + ROB = 53% of total consumption Almost independent to instruction mix Trends in Superscalar Increasing IW entries in the window IQ Power contribution may grow in the future

UPC ANALYSIS Considering Periods of execution with low parallelism Some parts of the IQ has negligible impact on total IPC Periods of execution with high parallelism Few parts of IQ can satisfy the issue width

UPC ISSUE IN THE IQ

UPC

UPC COMMIT IN THE IQ

UPC

UPC IPC-based Instruction Queue Resize IQ Resize Based on IPC contribution Avoid wake-up on disabled parts IQ has a circular FIFO without collapsing

UPC IPC-based Instruction Queue Resize IQ Resize IQ physically divided in 16 parts of 8 entries Add the limit pointer, updated as the head pointer At resize time, move the limit of one part If limit reach the tail, stop to insert new instructions

UPC Heuristics Heuristic to reduce size Statistic of committed instructions in youngest part every quantum time => add a bit in each ROB entry Threshold based resize decision No size limit to disable Heuristic to grow size Grow one portion every 5 quantum time The threshold based scheme will decide the correctness the next quantum time

UPC Results

UPC Conclusions IQ is a the critical point for power consumption in superscalar processors Dynamically adapting the IQ size based on IPC contribution can save about 15% of total power with negligible impact on performance

UPC Q & A ?