ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with.
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Introduction to Programmable Logic John Coughlan RAL Technology Department Electronics Division.
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Computer Architecture & Organization
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Processor Technology and Architecture
Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Nanotechnology: Spatial Computing Using Molecular Electronics Mihai Budiu joint work with Seth Copen Goldstein Dan Rosewater.
Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,
Chapter 6 Memory and Programmable Logic Devices
Advanced Computer Architectures
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Intel Pentium II Processor Brent Perry Pat Reagan Brian Davis Umesh Vemuri.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Chapter Overview General Concepts IA-32 Processor Architecture
Assembly language.
Architecture & Organization 1
Architecture & Organization 1
Computer system 돈벌자
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Mattan Erez The University of Texas at Austin
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
Presentation transcript:

ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002

/322 Resources

/323 CPU Problems Complexity Power Global Signals Limited issue window => limited ILP We propose an architecture with none of these limits

/324 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware

/325 FU * clock freq Computational Bandwidth CPU Unbounded RH * + / a=a+b b=b+c

/326 Registers Fixed RH Unbounded eax ebx ecx edx ijklmijklm spillsp[0] CPU

/327 Register Bandwidth Fixed CPU R1 R2 R3 W1 W2 RH Unbounded

/328 Out-of-Order Execution RHCPU Fetch Decode Dispatch Execute Commit In-order Limited by window Compiler’s window is unbounded

/329 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware

/3210 Hybrid system: CPU+RH High ILP application- specific Low ILP + OS + VM generic CPURH Memory Tight coupling

/3211 Problem HLL Program CPURH Memory Compiler

/3212 Our Solution General: applicable to today’s software Automatic: compiler-driven [RISC approach] Scalable: with clock, hardware and program size Parallelism: exploit application parallelism bit-level ILP pipeline loop-level

/3213 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware

/3214 Peering a( ) { b( ); } b( ) { c( ); } c( ) { d( ) } d( ) { } CPURH a b c d Program

/3215 marshalling, control transfer software procedure call hardware dependent RH “RPC” CPU a b c d b’ c’ d’ Stubs built automatically.

/3216 Stub Synthesis Procedures for RH RH Compiler Procedures for CPU Program Partitioning Stubs Configuration Linker Executable

/3217 Outline Scalability Reconfigurable hardware advantages A hybrid RH + CPU architecture CPU and RH as peers Application Specific Hardware

/3218 Application-Specific Hardware Reconfigurable hardware HLL program Compiler Circuit HLL Program CPURH Memory Compiler

/3219 CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH

/3220 Asynchronous Computation + data ready ack Can extend to locally synchronous, globally asynchronous

/3221 Dataflow Graphs int plus(int x, int y) { return x + y; }

/3222 From Control Flow to Data Flow

/3223 From Control Flow to Data Flow

/3224 From Control Flow to Data Flow

/3225 Conditionals = Speculation int cond(int p, int x, int y) { int z; if (p) z = x; else z = y; return z; }

/3226 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

/3227 Executing Lenient Operators if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Up to 40% performance improvement.

/3228 Pipelining PipelinedCycles N903 Y653

/3229 Loop Pipelining PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3

/3230 Loop Pipelining PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3

/3231 ASH Features What you code is what you get –no hidden control logic –really lean hardware (no CAM, decoders, multiported files, etc.) Compiler has complete control Dynamic scheduling => latency tolerant Naturally exploits ILP, even across loop iterations

/3232 Conclusions ASH = Compiler-synthesized hardware ASH matches program parallelism Dynamically scheduled RH ASH scales with –clock frequency –transistors –program size

/3233 Backup Slides

/3234 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches

/3235 Switch controlled by a 1-bit RAM cell Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell

/3236 Stubs a( ) { r = b(b_args); } b(b_args) { } a( ) { r = b’(b_args); } b’(b_args) { send_rh(b_args); invoke_rh(b); r = receive_rh( ); return r; } RH Program

/3237 Independent of b Dispatcher Stubs a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program b’(b_args) { send_rh(b_args); invoke_rh(b); while (1) { com = get_rh_command( ); if (! com) break; (*com)( ); } r = receive_rh( ); return r; } c’s stub

/3238 C’s Stub a( ) { r = b(b_args); } b(b_args) { if (x) c( ); return r; } c( ) { } Program c’( ) { receive_rh(c_args); r = c(c_args); send_rh(r); invoke_rh(return_to_rh); } back

/3239 Input to Output int io(int x) { return x; }

/3240 Loops int loop() { int w = 10; while (w > 0) w--; return w; }

/3241 Pointers and Arrays int a[10]; void pointer(int *p) { a[2] += a[4] + *p; }

/3242 int sum() { int s = 0; int i; for (i=0; i < 10; i++) s += a[i]; return s; } Pointers and Loops