Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.

Similar presentations


Presentation on theme: "Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004."— Presentation transcript:

1 Spatial Computation Computing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004

2 2 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation A computation model based on: application-specific hardware no interpretation minimal resource sharing Spatial Computation

3 3 The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); }

4 4 Research Scope Object: future architectures Tool: compilers Evaluation: simulators

5 5 Research Methodology Constraint Space state-of-the-art X (e.g., power) Y (e.g., cost) “reasonable limits” incremental evolution new solutions

6 6 Outline Introduction: problems of current architectures Compiling Application-Specific Hardware Pipelining ASH Evaluation Conclusions 1000 Performance

7 7 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]

8 8 Design Complexity 198119831985198719891991199319951997199920032001200520072009 Designer productivity 10 4 Chip size 10 5 10 6 10 7 10 8 10 9 10 Transistors

9 9 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant

10 10 Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)

11 11 Energy Efficiency ALUs Pentium 4

12 12 Clock Speed Cannot rely on global signals (clock is a global signal) 3GHz 6GHz 10GHz

13 13 Instruction-Set Architecture Software Hardware ISA VERY rigid to changes (e.g. x86 vs Itanium)

14 14 Our Proposal ASH addresses these problems ASH is not a panacea ASH “complementary” to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $

15 15 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions

16 16 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw SW HW ISA HW backend

17 17 Application-Specific Hardware C program Compiler Dataflow IR CPU [predication] SW backend Soft

18 18... def-use may-dep. Key: Intermediate Representation Traditionally SSA + predication + speculation Uniform for scalars and memory Explicitly encodes may-depend Executable Precise semantics Dataflow IR Close to asynchronous target Our IR CFG

19 19 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits

20 20 Basic Computation + data valid ack latch

21 21 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch

22 22 Distributed Control Logic +- ack rdy global FSM asynchronous control short, local wires

23 23 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions

24 24 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x  b0 y ! -> Conditionals ) Speculation critical path SSA = no arbitration

25 25 Control Flow ) Data Flow  data predicate Merge (label) Gateway data Split (branch) p !

26 26 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret

27 27 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory

28 28 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this! related workcomplexity

29 29 CASH Optimizations SSA-based optimizations –unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining Memory optimizations –dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling Boolean optimizations –Espresso CAD tool, bitwidth analysis

30 30 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions

31 31 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1

32 32 Pipelining i + <= 100 1 * + sum step 2

33 33 Pipelining i + <= 100 1 * + sum step 3

34 34 Pipelining i + <= 100 1 * + sum step 4

35 35 Pipelining i + <= 100 1 i=1 i=0 + sum step 5

36 36 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6

37 37 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate step 7

38 38 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop

39 39 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO step 7

40 40 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO

41 41 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions

42 42 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem

43 43 ASH Area P4: 217 normalized area minimal RISC core

44 44 ASH vs 600MHz CPU [.18  m]

45 45 Bottleneck: Memory Protocol LD ST Memory Token release to dependents: requires round-trip to memory. Limit study: round trip zero time ) up to 6x speed-up. LSQ Exploring protocol for in-order data delivery & fast token release.

46 46 Power DSP 110 mP 4000 Xeon [+cache] 67000

47 47 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels Asynchronous  P Microprocessors 1000x FPGAs

48 48 Outline Problems of current architectures +Compiling ASH +Pipelining +ASH Evaluation =Future/related work & conclusions

49 49 Related Work Nanotechnology Dataflow machines High-level synthesis Reconfigurable computing Computer architecture Embedded systems Asynchronous circuits Compilation

50 50 Future Work Optimizations for area/speed/power Memory partitioning Concurrency Compiler-guided layout Explore extensible ISAs Hybridization with superscalar mechanisms Reconfigurable hardware support for ASH Formal verification

51 51 How far can you go? Grand Vision: Certified Circuit Generation Translation validation: input ´ output Preserve input properties –e.g., C programs cannot deadlock –e.g., type-safe programs cannot crash Debug, test, verify only at source-level HLLIRIR opt Veriloggateslayout formally validated

52 52 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesign productivity, no ISA Spatial computation strengths

53 53 Backup Slides Reconfigurable hardware Critical paths Control logic ASH vs... ASH weaknesses Exceptions Normalized area Why C? Splitting memory More performance Recursive calls

54 54 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches

55 55 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell back

56 56 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

57 57 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back to talkback

58 = rdy in ack out rdy out ack in data in data out  Reg C Asynchronous Control backback to talk

59 59 HLL to HW High-level Synthesis Behavioral HDL Synchronous Hardware Reconfigurable Computing C [subsets] Hardware configuration (spatial computation) Asynchronous circuits Concurrent Language Asynchronous Hardware Prior work This research

60 60 CASH vs High-Level Synthesis CASH: the only existing tool to translate complete ANSI C to hardware CASH generates asynchronous circuits CASH does not treat C as an HDL –no annotations required –no reactivity model –does not handle non-C, e.g., concurrency back

61 61 ASH Weaknesses Low efficiency for low-ILP code Does not adapt at runtime Monolithic memory Resource waste Not flexible No support for exceptions

62 62 ASH Weaknesses (2) Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back

63 63 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back

64 64 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$

65 65 Why C Huge installed base Embedded specifications written in C Small and simple language –Can leverage existing tools –Simpler compiler Techniques generally applicable Not a toy language back

66 66 Performance

67 67 Parallelism Profile

68 68 Normalized Area back back to talk

69 69 Memory Partitioning MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 Berkeley CCured: Necula POPL ‘02 Illinois FlexRAM: Fraguella PPoPP ‘03 Hand-annotations #pragma back back to talk

70 70 Memory Complexity back LSQ RAM addr data back to talk

71 71 Recursion recursive call save live values restore live values stack back

72 72 Me?


Download ppt "Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004."

Similar presentations


Ads by Google