Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.

Spatial Computation Computing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004

2 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation A computation model based on: application-specific hardware no interpretation minimal resource sharing Spatial Computation

3 The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); }

4 Research Scope Object: future architectures Tool: compilers Evaluation: simulators

5 Research Methodology Constraint Space state-of-the-art X (e.g., power) Y (e.g., cost) “reasonable limits” incremental evolution new solutions

6 Outline Introduction: problems of current architectures Compiling Application-Specific Hardware Pipelining ASH Evaluation Conclusions 1000 Performance

7 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

8 Design Complexity 198119831985198719891991199319951997199920032001200520072009 Designer productivity 10 4 Chip size 10 5 10 6 10 7 10 8 10 9 10 Transistors

9 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant

10 Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)

11 Energy Efficiency ALUs Pentium 4

12 Clock Speed Cannot rely on global signals (clock is a global signal) 3GHz 6GHz 10GHz

13 Instruction-Set Architecture Software Hardware ISA VERY rigid to changes (e.g. x86 vs Itanium)

14 Our Proposal ASH addresses these problems ASH is not a panacea ASH “complementary” to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

15 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions

16 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw SW HW ISA HW backend

17 Application-Specific Hardware C program Compiler Dataflow IR CPU [predication] SW backend Soft

18... def-use may-dep. Key: Intermediate Representation Traditionally SSA + predication + speculation Uniform for scalars and memory Explicitly encodes may-depend Executable Precise semantics Dataflow IR Close to asynchronous target Our IR CFG

19 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits

20 Basic Computation + data valid ack latch

21 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch

22 Distributed Control Logic +- ack rdy global FSM asynchronous control short, local wires

23 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions

24 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x  b0 y ! -> Conditionals ) Speculation critical path SSA = no arbitration

25 Control Flow ) Data Flow  data predicate Merge (label) Gateway data Split (branch) p !

26 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret

27 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory

28 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this! related workcomplexity

29 CASH Optimizations SSA-based optimizations –unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining Memory optimizations –dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling Boolean optimizations –Espresso CAD tool, bitwidth analysis

30 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions

31 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1

32 Pipelining i + <= 100 1 * + sum step 2

35 Pipelining i + <= 100 1 i=1 i=0 + sum step 5

36 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6

37 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate step 7

38 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop

39 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO step 7

40 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO

41 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions

42 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem

43 ASH Area P4: 217 normalized area minimal RISC core

44 ASH vs 600MHz CPU [.18  m]

45 Bottleneck: Memory Protocol LD ST Memory Token release to dependents: requires round-trip to memory. Limit study: round trip zero time ) up to 6x speed-up. LSQ Exploring protocol for in-order data delivery & fast token release.

46 Power DSP 110 mP 4000 Xeon [+cache] 67000

47 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels Asynchronous  P Microprocessors 1000x FPGAs

48 Outline Problems of current architectures +Compiling ASH +Pipelining +ASH Evaluation =Future/related work & conclusions

49 Related Work Nanotechnology Dataflow machines High-level synthesis Reconfigurable computing Computer architecture Embedded systems Asynchronous circuits Compilation

50 Future Work Optimizations for area/speed/power Memory partitioning Concurrency Compiler-guided layout Explore extensible ISAs Hybridization with superscalar mechanisms Reconfigurable hardware support for ASH Formal verification

51 How far can you go? Grand Vision: Certified Circuit Generation Translation validation: input ´ output Preserve input properties –e.g., C programs cannot deadlock –e.g., type-safe programs cannot crash Debug, test, verify only at source-level HLLIRIR opt Veriloggateslayout formally validated

52 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesign productivity, no ISA Spatial computation strengths

53 Backup Slides Reconfigurable hardware Critical paths Control logic ASH vs... ASH weaknesses Exceptions Normalized area Why C? Splitting memory More performance Recursive calls

54 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches

55 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell back

56 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

57 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back to talkback

= rdy in ack out rdy out ack in data in data out  Reg C Asynchronous Control backback to talk

59 HLL to HW High-level Synthesis Behavioral HDL Synchronous Hardware Reconfigurable Computing C [subsets] Hardware configuration (spatial computation) Asynchronous circuits Concurrent Language Asynchronous Hardware Prior work This research

60 CASH vs High-Level Synthesis CASH: the only existing tool to translate complete ANSI C to hardware CASH generates asynchronous circuits CASH does not treat C as an HDL –no annotations required –no reactivity model –does not handle non-C, e.g., concurrency back

61 ASH Weaknesses Low efficiency for low-ILP code Does not adapt at runtime Monolithic memory Resource waste Not flexible No support for exceptions

62 ASH Weaknesses (2) Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

63 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

64 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

65 Why C Huge installed base Embedded specifications written in C Small and simple language –Can leverage existing tools –Simpler compiler Techniques generally applicable Not a toy language back

66 Performance

67 Parallelism Profile

68 Normalized Area back back to talk

69 Memory Partitioning MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 Berkeley CCured: Necula POPL ‘02 Illinois FlexRAM: Fraguella PPoPP ‘03 Hand-annotations #pragma back back to talk

70 Memory Complexity back LSQ RAM addr data back to talk

71 Recursion recursive call save live values restore live values stack back

72 Me?

Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.

Similar presentations

Presentation on theme: "Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.

Similar presentations

Presentation on theme: "Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004."— Presentation transcript:

Similar presentations

About project

Feedback