Download presentation
Presentation is loading. Please wait.
1
Spatial Computation Computing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004
2
2 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation A computation model based on: application-specific hardware no interpretation minimal resource sharing Spatial Computation
3
3 The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); }
4
4 Research Scope Object: future architectures Tool: compilers Evaluation: simulators
5
5 Research Methodology Constraint Space state-of-the-art X (e.g., power) Y (e.g., cost) “reasonable limits” incremental evolution new solutions
6
6 Outline Introduction: problems of current architectures Compiling Application-Specific Hardware Pipelining ASH Evaluation Conclusions 1000 Performance
7
7 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]
8
8 Design Complexity 198119831985198719891991199319951997199920032001200520072009 Designer productivity 10 4 Chip size 10 5 10 6 10 7 10 8 10 9 10 Transistors
9
9 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant
10
10 Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)
11
11 Energy Efficiency ALUs Pentium 4
12
12 Clock Speed Cannot rely on global signals (clock is a global signal) 3GHz 6GHz 10GHz
13
13 Instruction-Set Architecture Software Hardware ISA VERY rigid to changes (e.g. x86 vs Itanium)
14
14 Our Proposal ASH addresses these problems ASH is not a panacea ASH “complementary” to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $
15
15 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions
16
16 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw SW HW ISA HW backend
17
17 Application-Specific Hardware C program Compiler Dataflow IR CPU [predication] SW backend Soft
18
18... def-use may-dep. Key: Intermediate Representation Traditionally SSA + predication + speculation Uniform for scalars and memory Explicitly encodes may-depend Executable Precise semantics Dataflow IR Close to asynchronous target Our IR CFG
19
19 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits
20
20 Basic Computation + data valid ack latch
21
21 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch
22
22 Distributed Control Logic +- ack rdy global FSM asynchronous control short, local wires
23
23 Outline Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs Pipelining ASH Evaluation Conclusions
24
24 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation critical path SSA = no arbitration
25
25 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !
26
26 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret
27
27 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory
28
28 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this! related workcomplexity
29
29 CASH Optimizations SSA-based optimizations –unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining Memory optimizations –dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling Boolean optimizations –Espresso CAD tool, bitwidth analysis
30
30 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions
31
31 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1
32
32 Pipelining i + <= 100 1 * + sum step 2
33
33 Pipelining i + <= 100 1 * + sum step 3
34
34 Pipelining i + <= 100 1 * + sum step 4
35
35 Pipelining i + <= 100 1 i=1 i=0 + sum step 5
36
36 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6
37
37 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate step 7
38
38 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop
39
39 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO step 7
40
40 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO
41
41 Outline Problems of current architectures Compiling ASH Pipelining Evaluation: CASH vs. clocked designs Conclusions
42
42 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem
43
43 ASH Area P4: 217 normalized area minimal RISC core
44
44 ASH vs 600MHz CPU [.18 m]
45
45 Bottleneck: Memory Protocol LD ST Memory Token release to dependents: requires round-trip to memory. Limit study: round trip zero time ) up to 6x speed-up. LSQ Exploring protocol for in-order data delivery & fast token release.
46
46 Power DSP 110 mP 4000 Xeon [+cache] 67000
47
47 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels Asynchronous P Microprocessors 1000x FPGAs
48
48 Outline Problems of current architectures +Compiling ASH +Pipelining +ASH Evaluation =Future/related work & conclusions
49
49 Related Work Nanotechnology Dataflow machines High-level synthesis Reconfigurable computing Computer architecture Embedded systems Asynchronous circuits Compilation
50
50 Future Work Optimizations for area/speed/power Memory partitioning Concurrency Compiler-guided layout Explore extensible ISAs Hybridization with superscalar mechanisms Reconfigurable hardware support for ASH Formal verification
51
51 How far can you go? Grand Vision: Certified Circuit Generation Translation validation: input ´ output Preserve input properties –e.g., C programs cannot deadlock –e.g., type-safe programs cannot crash Debug, test, verify only at source-level HLLIRIR opt Veriloggateslayout formally validated
52
52 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesign productivity, no ISA Spatial computation strengths
53
53 Backup Slides Reconfigurable hardware Critical paths Control logic ASH vs... ASH weaknesses Exceptions Normalized area Why C? Splitting memory More performance Recursive calls
54
54 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches
55
55 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell back
56
56 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
57
57 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back to talkback
58
= rdy in ack out rdy out ack in data in data out Reg C Asynchronous Control backback to talk
59
59 HLL to HW High-level Synthesis Behavioral HDL Synchronous Hardware Reconfigurable Computing C [subsets] Hardware configuration (spatial computation) Asynchronous circuits Concurrent Language Asynchronous Hardware Prior work This research
60
60 CASH vs High-Level Synthesis CASH: the only existing tool to translate complete ANSI C to hardware CASH generates asynchronous circuits CASH does not treat C as an HDL –no annotations required –no reactivity model –does not handle non-C, e.g., concurrency back
61
61 ASH Weaknesses Low efficiency for low-ILP code Does not adapt at runtime Monolithic memory Resource waste Not flexible No support for exceptions
62
62 ASH Weaknesses (2) Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back
63
63 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back
64
64 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$
65
65 Why C Huge installed base Embedded specifications written in C Small and simple language –Can leverage existing tools –Simpler compiler Techniques generally applicable Not a toy language back
66
66 Performance
67
67 Parallelism Profile
68
68 Normalized Area back back to talk
69
69 Memory Partitioning MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 Berkeley CCured: Necula POPL ‘02 Illinois FlexRAM: Fraguella PPoPP ‘03 Hand-annotations #pragma back back to talk
70
70 Memory Complexity back LSQ RAM addr data back to talk
71
71 Recursion recursive call save live values restore live values stack back
72
72 Me?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.