Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education.
Computer Abstractions and Technology
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Nanotechnology: Spatial Computing Using Molecular Electronics Mihai Budiu joint work with Seth Copen Goldstein Dan Rosewater.
Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Multiscalar processors
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
Chapter 6 Memory and Programmable Logic Devices
COM181 Computer Hardware Ian McCrumRoom 5B18,
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Specification and Design of Quasi- Delay-Insensitive Java Card Microprocessor Fu-Chiung Cheng & Chuin-Ren Wang Dept. of Computer Science and Engineering.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
ECE 465 Introduction to CPLDs and FPGAs Shantanu Dutt ECE Dept. University of Illinois at Chicago Acknowledgement: Extracted from lecture notes of Dr.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Moore’s Law in Microprocessors Pentium® proc P Year Transistors.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Computer Organization and Design Computer Abstractions and Technology
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.
1 i206: Lecture 4: The CPU, Instruction Sets, and How Computers Work Marti Hearst Spring 2012.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
The Processor & its components. The CPU The brain. Performs all major calculations. Controls and manages the operations of other components of the computer.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
CHAPTER 2 Instruction Set Architecture 3/21/
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Lynn Choi School of Electrical Engineering
Stateless Combinational Logic and State Circuits
Multiscalar Processors
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Architecture & Organization 1
/ Computer Architecture and Design
Vector Processing => Multimedia
Architecture & Organization 1
Dynamically Reconfigurable Architectures: An Overview
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
HIGH LEVEL SYNTHESIS.
Mihai Budiu Monday seminar, Apr 12, 2004
Computer Systems An Introducton.
Presentation transcript:

Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4

2 Complexity Pentium Pentium III

3 A New Approach CPU Compiler Program Executable Reconfigurable hw Configuration Compiler CAD Tool Program

4 Outline Reconfigurable Hardware Application-Specific Hardware (ASH) ASH Properties Conclusions

5 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches

6 Switch controlled by a 1-bit RAM cell Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell

7 Place and Route int reverse(int x) { int k,r=0; for (k=0; k<64; k++) r |= x&1; x = x >> 1; r = r << 1; } int func(int* a,int *b) { int j,sum=0; for (j=0; *a>0; j++) sum+=reverse(*b

8 Application C ProgramVerilog CADCompiler OS support communication manual RH Today

9 Three Models of Computation CPUASIC Universal Interpretation Custom Direct execution RH Universal Direct execution Defect tolerance

10 Outline Reconfigurable Hardware Application-Specific Hardware (ASH) ASH Properties Conclusions

11 Application-Specific Hardware Reconfigurable hardware HLL program Compiler Circuit

12 CASH: Compiling for ASH Memory partitioning Interconnection net Circuits C Program RH

13 Stages of Compilation 1. Program int reverse(int x) { int k,r=0; for (k=0; k > 1; r = r << 1; } } Unknown latency ops. Computations & local storage 2. Split-phase Abstract Machines 3. Configurations placed independently 4. Placement on chip

14 Split-phase Abstract Machines SAM 1 SAM 2 SAM 3 CFG

15 Hyperblock => SAM Single-entry, multiple exit May contain loops

16 Computation = Dataflow x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits variables wires

17 Speculation if (x > 0) y = -x; else y = b*x; * xb0 y ! ComputationPredicates -> Q

18 Loops for (i=0; i < 10; i++) a[i] += i; + load + store &a[0] + 1 i 0 a[0] a[1] a[2] a[3] = Pipelining

19 Example int f(void) { int i=0, j = 0; for (; i < 10; i++) j += i; return j; }

20 Outline Reconfigurable Hardware Application-Specific Hardware (ASH) ASH Properties Conclusions

21 Defect Tolerance CPU One defect: chip useless ASH Can reconfigure around defects

22 Power Consumption CPU 100+W 30M transistors 2Ghz ASH 1 SAM active, all other idle

23 Verification CPU Huge effort Extremely complex ASH Program translation validation: feasible program compilerCPU program compiler P in = P out

24 CAD Tools CPU Lots of exceptions handled manually Very long time ASH Local structures Interactive compilation

25 Circuit Size circuit # operations All circuits for all programs in SpecINT95 and Mediabench

26 Total Size (Largest 2 Programs) Benchmarkjpeg_e147.vortex Lines 26,88167,210 SAMs 1,3311,433 FP Load/store 8,693 24,913 Call/ret 1,964 9,602 Predicates 8,167 39,195 Arithmetic 1,022,023 1,448,933 Mux 200, ,839 Registers 76,722 32,850 UnitsUnits Bit-opsBit-ops

27 Implications Enough resources in the near future A case for datapath oriented RH: –a better match for computation –high density –fast configuration –more amenable to compilation –few predicate operations

28 Instruction-Level Parallelism CPU Wide execution path (4-6) Low sustained ILP (~1.5) ASH ILP statically extracted Sustained >3

29 ILP circuit # avg ILP

30 Cost CPU Plant = 3B$ Mask = 100K$ (need 20+) Design = ?M$ ASH Can use defective chips Same masks for all chips Design = free

31 Summary Microprocessor complexity becomes overwhelming Application-Specific Hardware (ASH) translates applications into hardware ASH has novel properties and promises to scale well with increasing resources

32 Extras CPU + ASH Speculation and critical paths Computing predicates

33 CPU+ASH core computation support computation + OS + VM CPUASH Memory HLL Program back

34 Speculation if (x > 0) y = -x; else y = b*x; * x  b0 y ! slow ComputationPredicates -> -> and Eager Muxes back to talk back to extras

35 Computing Predicates Correct for irreducible graphs Correct even when speculatively computed Can be eagerly computed st b back