Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

Improving Cache Performance by Exploiting Read-Write Disparity

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

Revisiting Load Value Speculation:

Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

1 Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi,

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Adaptive Cache Partitioning on a Composite Core

Out-of-Order Commit Processors

Energy-Efficient Address Translation

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Phase Capture and Prediction with Applications

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Out-of-Order Commit Processors

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester

The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamically at runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile

Application phase behavior Varying behavior over time [Sherwood, Sair, Calder, ISCA 2003] Can exploit to save power gcc L2 misses IPC L1I misses L1D misses branch mispred E per interval [Buyuktosunoglu, et al., GLSVLSI 2001] adaptive issue queue

What about performance? Lower power and faster access time! entriesrelative delay RAM delay entriesrelative delay CAM delay [Buyuktosunoglu, GLSVLSI 2001]

What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing

What about performance? [Albonesi, ISCA 1998] Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue Main Memory L2 Cache Ld/St Unit L1 D-Cache clock Br Pred ALUs & RF FPinteger

What about performance? [Albonesi, ISCA 1998]

Enter GALS… Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue ALUs & RF Main Memory L2 Cache Ld/St Unit Integer DomainFP Domain Memory Domain Front-end DomainExternal Domain Br Pred L1 D-Cache [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]

Outline  Motivation and background  Adaptive GALS microarchitecture  Control mechanisms  Evaluation methodology  Results  Conclusions and future work

Adaptive GALS microarchitecture Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred

Adaptive GALS operation Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF Dispatch, Rename, ROB L1 I-Cache Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred L1 I-Cache

Resizable cache organization  Access A part first, then B part on a miss  Swap A and B blocks on a A miss, B hit  Select A/B split according to application phase behavior

Resizable cache control A MRU State (LRU)(MRU) MRU[1]++ MRU[2]++ MRU[0]++ MRU[3]++ Example Accesses Config A1 B3 hits A = MRU[0] hits B = MRU[1] + [2] + [3] Config A2 B2 hits A = MRU[0] + [1] hits B = MRU[2] + [3] Config A3 B1 hits A = MRU[0] + [1] + [2] hits B = MRU[3] Config A4 B0 hits A = MRU[0] + [1] + [2] + [3] hits B = BCD ABCD BCAD BCAD Calculate the cost for each possible configuration: A access costs = (hits A + hits B + misses) * Cost A B access costs = (hits B + misses) * Cost B Miss access costs = misses * Cost Miss Total access cost = A + B + Miss (normalized to frequency)

Resizable issue queue control  Measures the exploitable ILP for each queue size  Timestamp counter is reset at the start of an interval and incremented each cycle  During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand  The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64)  ILP is estimated as N/MAXN  Queue size with highest ILP (normalized to frequency) is selected Read the paper

Resizable hardware – some details  Front end domain Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way Branch predictor sized with Icache – gshare PHT: 16KB-64KB – Local BHT: 2KB-8KB – Local PHT: 1024 entries – Meta: 16KB-64KB  Load/store domain Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8- way L2 cache “A” sized with Dcache – 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way  Integer and floating point domains Issue queue: 16, 32, 48, or 64 entries

Evaluation methodology  SimpleScalar and Cacti  40 benchmarks from SPEC, Mediabench, and Olden  Baseline: best overall performing fully synchronous like design found out of 1,024 simulated options  Adaptive MCD costs imposed: Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) Frequency penalty as much as 31%  Mean PLL locking time of 15 µsec  Program-Adaptive: profile application and pick the best adaptive configuration for the whole program  Phase-Adaptive: use online cache and issue queue control mechanisms

Performance improvement MediabenchOldenSPEC

Phase behavior – art issue queue entries 100 million instruction window

Phase behavior – apsi Dcache “A” size 32KB 128KB 64KB 256KB 100 million instruction window

Performance summary  Program Adaptive: 17% performance improvement  Phase Adaptive: 20% performance improvement Automatic Never degrades performance for 40 applications Few phases in chosen application windows – could perhaps do better  Distribution of chosen configurations for Program Adaptive: Integer IQFP IQD/L2 CacheIcache 1685% 325% 485% 645% 32KB/256KB50% 64KB/512KB18% 128KB/1MB23% 256KB/2MB10% 16KB55% 32KB18% 48KB8% 64KB20% 1673% 3215% 488% 645%

Domain frequency versus IQ size

Conclusions  Application phase behavior can be exploited to improve performance in addition to power savings  GALS approach is key to localizing the impact of slowing the clock  Cache and queue control mechanisms can evaluate all possible configurations within a single interval  Phase adaptive approach improves performance by as much as 48% and by an average of 20%

Future work  Explore multiple adaptive structures in each domain  Better take into account the branch predictor  Resize the instruction cache by sets rather than ways  Explore better issue queue design alternatives  Build circuits  Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester