1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
Hot-and-Cold: Using Criticality in the Design of Energy-Efficient Caches Rajeev Balasubramonian, University of Utah Viji Srinivasan, IBM T.J. Watson Sandhya.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Dynamic Associative Caches:
Adaptive Cache Partitioning on a Composite Core
ALPHA Introduction I- Stream
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Out-of-Order Commit Processors
Commit out of order Phd student: Adrián Cristal.
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
IA-64 Microarchitecture --- Itanium Processor
Understanding Performance Counter Data - 1
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Out-of-Order Commit Processors
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Lecture 9: Dynamic ILP Topics: out-of-order processors
ECE 721 Modern Superscalar Microarchitecture
Presentation transcript:

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi, Sandhya Dwarkadas, Greg Semeraro, Grigorios Magklis, and Michael Scott ECE and CS Departments University of Rochester

2 Why Adaptive Structures? General purpose uP are “one size fits all” But, needs vary across (within) applications Can save considerable energy by matching resources to the application Objective: Less energy for same performance by adapting storage structures to application

3 Related Work Adaptable cache –Balasubramonian et al., MICRO 2000 –Dhodapkar and Smith, ISCA 2002 Adaptable issue logic –Buyuktosunoglu et al., GLS VLSI 2001 –Folegnani and Gonzalez, ISCA 2000

4 Common Themes A single adaptive structure Use of global information for feedback Exploration-based (caches)

5 Related Work (cont) Adaptable IQ, LSQ, and ROB –Ponomarev et al., MICRO 2001 –Three (3) adaptable structures –Reconfigurations based on local state

6 Integrating Multiple Adaptive Structures L2 Unified Cache ROB Rename map FPQ IPREG IIQ LSQ L1 Dcache Branch predict L1 Icache Integer Memory Floating Pt FPREG Int FUs FP FUs FetchQ

7 Challenges Multiple (9) adaptive structures creates state explosion problem Use of global information makes assigning cause and effect difficult Potential for additive performance effects among the structures

8 Approach: Local Management Local information for configuration decisions Tight control over performance variance

9 Part I: The Caches L2 Unified Cache ROB Rename map FPQ IPREG IIQ LSQ L1 Dcache Branch predict L1 Icache Integer Memory Floating Pt FPREG Int FUs FP FUs FetchQ

10 The Accounting Cache A access (primary) B access (secondary) Sequential accesses, A then B Save energy on A access hit Swap blocks on A access miss Swap A1 B3 A2 B2 A3 B1 A4 B0

11 Most-Recently-Used Statistics 0123 Way1234 LineABCD MRU State Transitions MRU[0] MRU State Counters MRU[1] MRU[2] MRU[3] Misses A A A B B C

12 Configuration Evaluation MRU[0]MRU[1]MRU[2]MRU[3]Misses (lru)(mru) Delay = 6 D A + 3 D B Delay = 6 D A + 1 D B Delay = 6 D A Energy = 6 E E 3 Energy = 7 E 2 Energy = 6 E 3 Energy = 6 E 4 BASE

13 Tolerance and the Bank Account Tolerance allows more delay than BASE –D TOL = D BASE (1 + TOL) –TOL = {0.015, 0.062, 0.25} (1/64, 1/16, 1/4) Bank account allows accumulation of unused tolerance Use account credits in later intervals –Allows aggressive resizing –Amortizes mistakes over many intervals

14 Memory Hierarchy L1 I-Cache (A/B) L1 D-Cache (A, no B) L2 Unified Cache (A/B) One Possible Configuration

15 Environment Simplescalar simulator Microarchitecture is similar to Alpha Benchmarks are a mix of SPEC95, SPEC2K, and Olden Energy models for buffers and caches from Buyuktosunoglu et al., GLS VLSI 2001 and Balasubramonian et al., MICRO 2000

16 Cache Results

17 Part II: Queues, Regs, and ROB L2 Unified Cache ROB Rename map FPQ IPREG IIQ LSQ L1 Dcache Branch predict L1 Icache Integer Memory Floating Pt FPREG Int FUs FP FUs FetchQ

18 Resizable Queues/Reg File m Buffer PNPN P1P1 N partitions of m elements

19 Buffer Sizing Distribution of Buffer Size Full Grow buffer Proper size Precise shrink ave 8K cycle period Tolerances: 1.5% (1/64) 6.2% (1/16) 25.0% (1/4) With Limited Histogramming

20 Resizing the Register File Issue: Do not know when registers expire Solution: To make reg file smaller, move values out of partition (P) to be turned off –First, inhibit new assignments to P –Next, use a software interrupt routine to move values via normal rename logic mov r1 r1 –Register mappings automatically updated

21 Floating Point App Results

22 Summary Results

23 Conclusion Simultaneous adaptation of all major regular structures –Accounting cache –Limited histogramming for buffers –Adaptable register file Local control yet tolerable performance loss Future work –Augment local control with global control for bounded performance loss