The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original.

Slides:



Advertisements
Similar presentations
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 1 Early ILP Processors.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Instruction Level Parallelism (ILP) Colin Stevens.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Pipelining and Parallelism Mark Staveley
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS 352H: Computer Systems Architecture
Topics to be covered Instruction Execution Characteristics
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Computer Architecture Principles Dr. Mike Frank
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Ronny Krashinsky and Mike Sung Term Project Presentation
/ Computer Architecture and Design
Flow Path Model of Superscalars
Advantages of Dynamic Scheduling
Superscalar Processors & VLIW Processors
Levels of Parallelism within a Single Processor
Hardware Multithreading
Coe818 Advanced Computer Architecture
Mattan Erez The University of Texas at Austin
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
/ Computer Architecture and Design
The Vector-Thread Architecture
Overview Prof. Eric Rotenberg
Mattan Erez The University of Texas at Austin
Levels of Parallelism within a Single Processor
CSC3050 – Computer Architecture
How to improve (decrease) CPI
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Presentation transcript:

The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original proposal initial idea of energy-efficient program control used a lot of existing ideas already New idea of combining existing work in decoupled architectures to simplify program control and issue logic for high-performance microprocessors Basic idea is that program control is inherently separate from other types of instructions (access and execute). We propose to decouple the program control from the rest of the instruction stream(s) to uncover ILP

Motivation for Decoupled Architectures Increased performance by exploiting fine-grained parallelism between access and execute functions Decoupling access from execution allows access processor to run ahead or “slip” w.r.t. execution processor: dynamic reordering Memory latency toleration Dynamic loop unrolling (exposing ILP between loop iterations, hide functional unit latencies by overlapping executions of diff. iterations) Simplified issue/decode logic Much simpler than complex superscalar architectures (IW, ROB, bypass) Scalability – direct consequence of simplified logic For superscalar processors, need to increase IW which does not scale (Palacharla/Agawal papers) For decoupled machines, simply lengthen queues to allow more “slip”

Prior Work Separation of access and execute functions IBM 360/370, CDC 6600, CDC 7600, CRAY-1 Explicit partition of access and computation functions J. Smith, PIPE (compile time splitting) G. Tyson, MISC (Multiple Inst. Stream Computer), descendant of PIPE A. Pleszkun, SMA (Structured Memory Access) architecture J. Smith, Astronautics Corporation’s ZS-1 (splits fix-point/addressing from floating-point operations) A. Wulf, WM architecture IBM’s FOM (FORTRAN Oriented Machine) Shares similarities with trace processors Technically, superscalar machines have a slightly decoupled nature

Decoupled Program Control: CAE Architecture Hierarchy of decoupling: 3 levels of decoupling (Control, Access, and Execute). Control flow is most elastic, providing ample instructions for both access and execute pipelines MEM A$ C$ E$ AP CP EP

Decoupled Program Control Benefits Program control flow can be easily determined without waiting Provides “out-of-order” execution without complexity Inherits the memory latency toleration of DAE architectures Simplified issue logic Can be implemented with small structures/queues Allows non-speculative instruction prefetching Because of prefetching, we can shrink data structures like caches, potentially reducing critical paths as well as reducing power Still provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of caches

Issues with Decoupled Program Control Deadlocking with queues Queue size determines how much slip can occur Draining queues or explicit queue manipulation (push/pop) instructions Performance issues from feedback of values from execution/ access pipelines that program control depends on Dependencies limit how far the program pipeline can slip in front of the access and execute pipelines, etc. Likewise, feedback dependencies from execute to access pipelines Complexities in queue interactions (correctness, verification, ease of programming) Basically an issue of how to to synchronize instruction streams correctly

Progress and Roadmap Completed Literature search of decoupled architectures Initial ISA exploration and microarchitectural development for proposed CAE architecture Roadmap Classifying control flow and dependencies in applications ISA development for each instruction stream (control, access, execute) Complete architectural specification of CAE architecture (TRS) Implementation of RTL-level simulator (SyCHOSys) Simulation and performance analysis (SyCHOSys)