Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Instruction Level Parallelism (ILP) Colin Stevens.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
Superscalar Processors (Pictured above is the DEC Alpha 21064) Presented by Jeffery Aguiar.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester.
SUPERSCALAR ARCHITECTURE
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original.
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: Hongkyu Kim
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Performance. Moore's Law Moore's Law Related Curves.
15-740/ Computer Architecture Lecture 21: Superscalar Processing
SECTIONS 1-7 By Astha Chawla
Lynn Choi School of Electrical Engineering
Multilevel Memories (Improving performance using alittle “cash”)
Lynn Choi Dept. Of Computer and Electronics Engineering
CPE731: Advanced Computer Architecture Course Introduction
Dave Maze Edwin Olson Andrew Menard
SUPERSCALAR ARCHITECTURE
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
The Microarchitecture of the Pentium 4 processor
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University
Design of Digital Circuits Lecture 19a: VLIW
Presentation transcript:

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005

2 Presentation Overview Current Research Direction Related Work Experiments What Next?

3 Current Research Direction [Yeager96] Wide superscalar, out-of- order execution processor core Exploits ILP But true data dependencies are inherent in application programs MIPS R10k, NetBurst, AMD etc. use bypass network to forward just-computed result  allow back-to-back issue of dependent instructions Complexity of bypass network grows quadratic w.r.t. issue width

4 Current Research Direction Observation 1: Multi-cycle broadcast Wire delays – accounted for in Intel NetBurst Allows higher processor clock frequency at the cost of reduced IPC Observation 2: FP execution unit is idle most of the time, even in FP-intensive applications (5-10%) [Sassone04]

5 Literature Survey [Epalza04]: Dynamic Allocation of Functional Units in Superscalar Processors Switch the execution mode of idle floating-point units to four additional integer ALUs Addition of bypass networks add 1 cycle latency to the modified FPU. 19% speedup for SPECint % speedup for SPECfp2000 Issues: Need to improve control for mode switching

6 Plans Other patterns:

7 Related Work [Palacharla97] Dependence-based (FIFO queues + clustered execution units)

8 Related Work Extension to rePLay framework [Yehia04]

9 Experiment : Chaining pairs of dependent instructions [Intel01] Double-speed ALUs [Vassiliadis96] 3-1 Interlock Collapsing ALUs

10 Experiment : Chaining pairs of dependent instructions Modifications to sim-outorder for SimpleScalar PISA.

11 Experiment : Chaining pairs of dependent instructions 2 CIALUs sufficient IPC improvement of ~8%, solely due to savings of broadcast cycles Reduces utlization of IALUs by ~50% Reduces up to 45% of queue entries waiting for result Up to 25% speedup as broadcast cycles = 4

12 What Next? Chaining sequence of 3 dependent instructions, other patterns out of the 80. Architectural impact of adding chained units complexity of local bypass network etc. Replace chained units by xALUs converted from the CSA trees in a FP multiply/divide unit Need to explore the hardware circuits of FP multiply/divide Develop an adaptive configuration scheme – to best match the interconnections of the swappable xALUs to the patterns of in-flight instructions. Need to determine the most frequent subset of patterns

References [Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis Vassiliadis. [Yeager96] The MIPS R10000 Superscalar Microprocessor. Kenneth C. Yeager. IEEE Micro [Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. Complexity-Effective Superscalar Processor. ISCA [Intel01] The Microarchitecture of the Pentium® 4 Processor. Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q [Epalza04] Dynamic Reallocation of Functional Units In Superscalar Processors. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), [Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam. [Sassone04] Multicycle Broadcast Bypass: Too Readily Overlooked. Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004.

Thank You

15 Overview of Research Topic Goal of this research: “investigate the feasibility and potential benefit of effective, automated runtime compilation and execution of software binaries on reconfigurable microprocessors”

16 Motivations Improved execution performance by exploiting parallelism and redundancy in hardware. Adaptation of hardware resources based on the dynamic behaviour of programs. Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time. Compilation at the binary level allows execution of legacy software binaries. Runtime compilation allows transparent migration of software code to hardware.