Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Advertisements

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Slide 1 Insert your own content. Slide 2 Insert your own content.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with.

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.

Inter-Iteration Scalar Replacement in the Presence of Control-Flow Mihai Budiu – Microsoft Research, Silicon Valley Seth Copen Goldstein – Carnegie Mellon.

Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.

TUNING SOC’S USING THE DYNAMIC CRITICAL PATH

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.

Mihai Budiu May 23, Based On Critical Path: A Tool for System-Level Timing Analysis Girish Venkataramani, Tiberiu Chelcea, Mihai Budiu, and Seth.

Architectural Support for Software-Based Protection Mihai Budiu Úlfar Erlingsson Martín Abadi ASID Workshop, Oct 21, 2006 Silicon Valley.

On the Critical Path of (Parallel) Computations Mihai Budiu March 30, 2005.

ALGEBRAIC EXPRESSIONS

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

SE-292 High Performance Computing

1 Dynamic Interconnection Networks Buses CEG 4131 Computer Architecture III Miodrag Bolic.

Processor Data Path and Control Diana Palsetia UPenn

Chapter 7: System Buses Dr Mohamed Menacer Taibah University

SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan

Chapter 14 Software Testing Techniques - Testing fundamentals - White-box testing - Black-box testing - Object-oriented testing methods (Source: Pressman,

Progam.-(6)* Write a program to Display series of Leaner, Even and odd using by LOOP command and Direct Offset address. Design by : sir Masood.

1 Decision Procedures An algorithmic point of view Equality Logic and Uninterpreted Functions.

DATAFLOW TESTING DONE BY A.PRIYA, 08CSEE17, II- M.s.c [C.S].

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

Complexity Analysis (Part II)

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Bottoms Up Factoring. Start with the X-box 3-9 Product Sum

1  1998 Morgan Kaufmann Publishers Interfacing Processors and Peripherals.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.

3-Software Design Basics in Embedded Systems

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.

Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.

BitValue: Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Majd Sakr, Kip.

Nanotechnology: Spatial Computing Using Molecular Electronics Mihai Budiu joint work with Seth Copen Goldstein Dan Rosewater.

2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.

Peer-to-peer Hardware-Software Interfaces for Reconfigurable Fabrics Mihai Budiu Mahim Mishra Ashwin Bharambe Seth Copen Goldstein Carnegie Mellon University.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.

Presentation at May 17, 2004 Mihai Budiu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.

Detecting and Exploiting Narrow Bitwidth Computations Mihai Budiu Carnegie Mellon University joint work with Seth Copen Goldstein.

Global Critical Path: A Tool for System-Level Timing Analysis

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,

ASH: A Substrate for Scalable Architectures Mihai Budiu Seth Copen Goldstein CALCM Seminar, March 19, 2002.

EEE440 Computer Architecture

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

NISC set computer no-instruction

Slack Analysis in the System Design Loop Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University.

Chapter 3 Top Level View of Computer Function and Interconnection

Antonia Zhai, Christopher B. Colohan,

Spatial Computation Computing without General-Purpose Processors

Dynamically Scheduled High-level Synthesis

Mihai Budiu Monday seminar, Apr 12, 2004

Presentation transcript:

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors May 10, 2005

2 Outline Intro: Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation Conclusions 1000 Performance

3 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

4 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire

5 Complexity ALUs Cannot rely on global signals (clock is a global signal) 5ps 20ps gate wire Automatic translation C ! HW Simple, short, unidirectional interconnect No interpretation Distributed control, Asynchronous Simple hw, mostly idle

6 Our Proposal: Application-Specific Hardware ASH addresses these problems ASH is not a panacea ASH complementary to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

7 Outline Problems of current architectures CASH: Compiling Application-Specific Hardware ASH Evaluation Conclusions

8 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw HW backend

9 Computation Dataflow x = a & 7;... y = x >> 2; Program & a 7 >> 2 x IR a Circuits &7 >>2 No interpretation Operations Nodes Pipeline stages Variables Def-use edges Channels (wires)

10 Basic Computation= Pipeline Stage data valid ack latch +

11 + Asynchronous Computation data valid ack latch

12 Distributed Control Logic +- ack rdy global FSM short, local wires

13 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation SSA = no arbitration Critical path

14 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !

15 i +1 < * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret back

16 Pipelining i + <= * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1

17 Pipelining i + <= * + sum step 2

18 Pipelining i + <= * + sum step 3

19 Pipelining i + <= * + sum step 4

20 Pipelining i + <= i=1 i=0 + sum step 5

21 Pipelining i + <= * i=1 i=0 + sum step 6 back

22 Pipelining i + <= * + sum is loop sums loop Long latency pipe predicate step 7

23 Predicate ack edge is on the critical path. Pipelining i + <= * + sum critical path is loop sums loop

24 Pipeline balancing i + <= * + sum is loop sums loop decoupling FIFO step 7

25 Pipeline balancing i + <= * + sum is loop sums loop critical path decoupling FIFO back back to talk

26 Procedures Caller Callee Call Argument Return Continuation

27 Memory Access LD ST LD Monolithic Memory local communicationglobal structures Future work: fragment this! pipelined arbitrated network

28 Outline Problems of current architectures Compiling ASH ASH Evaluation Conclusions

29 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem commercial tools

30 Compile Time C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 20 seconds 10 seconds 20 minutes 1 hour 200 lines Mem

31 ASH Area (mm 2 ) P4: 217 minimal RISC core

32 ASH vs 600MHz CPU [4-wide OOO,.18 m]

33 Bottleneck: Memory Protocol LD ST Memory Enabling dependent operations requires round-trip to memory. LSQ Exploring novel memory access protocols.

34 Power (mW) DSP 110 mP 4000 Xeon [+cache] 67000

35 Energy-delay

36 Energy Efficiency (op/nJ)

37 Energy Efficiency Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels FPGA Microprocessors 1000x Asynchronous P

38 Outline Problems of current architectures Compiling ASH Evaluation Related work, Conclusions

39 Bilbliography Dataflow: A Complement to Superscalar Mihai Budiu, Pedro Artigas, and Seth Copen Goldstein ISPASS 2005 Spatial Computation Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein ASPLOS 2004 C to Asynchronous Dataflow Circuits: An End-to-End Toolflow Girish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein IWLS 2004 Optimizing Memory Accesses For Spatial Computation Mihai Budiu and Seth Copen Goldstein CGO 2003 Compiling Application-Specific Hardware Mihai Budiu and Seth Copen Goldstein FPL 2002

40 Related Work Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control

41 ASH Design Point Design an ASIC in a day Fully automatic synthesis to layout Fully distributed control and computation (spatial computation) –Replicate computation to simplify wires Energy/op rivals custom ASIC Performance rivals superscalar E £ t 100 times better than any processor

42 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesigner productivity Spatial computation strengths

43 Backup Slides Absolute performance Control logic Exceptions Leniency Normalized area ASH weaknesses Splitting memory Recursive calls Leakage Why not compare to… Targeting FPGAs

44 Absolute Performance CPU range back

= rdy in ack out rdy out ack in data in data out Reg back Pipeline Stage C

46 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

47 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

48 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths backback to talk

49 Normalized Area back

50 ASH Weaknesses Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

51 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

52 Memory Partitioning MIT RAW project: Babb FCCM 99, Barua HiPC 00,Lee ASPLOS 00 Stanford SpC: Semeria DAC 01, TVLSI 02 Illinois FlexRAM: Fraguella PPoPP 03 Hand-annotations #pragma back

53 Recursion recursive call save live values restore live values stack back

54 Leakage Power P s = k Area e -V T Employ circuit-level techniques Cut power supply of idle circuit portions –most of the circuit is idle most of the time –strong locality of activity back

55 Why Not Compare To… In-order processor –Worse in all metrics than superscalar, except power –We beat it in all metrics, including performance DSP –We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) ASIC –No available tool-flow supports C to the same degree Asynchronous ASIC –We compared with a Balsa synthesis system –We are 15 times better in Et compared to resulting ASIC Async processor –We are 350 times better in Et than Amulet (scaled to.18) back

56 Why not target FPGA Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits We are designing an async FPGA back