Download presentation
Presentation is loading. Please wait.
1
Presentation at May 17, 2004 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors
2
2 Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University Spatial Computation A computation model based on: application-specific hardware no interpretation minimal resource sharing Spatial Computation
3
3 Research Scope Object: future architectures Tool: compilers Evaluation: simulators
4
4 Three Spatial Computation Projects virtual reconfigurable hardware Application-Specific Hardware (ASH) C Compiler reconfigurable hardware [FPGA 99] [ISCA 99] [IEEE Computer 00] [Euro-Par 00] [2 licenses] [ISCA 01] [ASAP 03] [Chapter 03] [FCCM 01] [FPL 02] [FPL 02a] [CGO 03] [IWLS 04] [MSP 04] [3 submitted] nanoFabrics
5
5 Main Results of My Research (1) Developed DIL compiler Completely replaces CAD tool-chain 700 times faster than commercial tools New optimizations (BitValue, place-and-route) Streaming kernels execute 20-300 times faster than on P
6
6 Main Results of My Research (2) nanoFabrics Identified strengths & limitations of nanodevices Proposed new reconfigurable architecture & HLL ! HW compilation for spatial computation Studied first-order properties of spatial computation
7
7 Main Results of My Research (3) Fast prototyping: automatic from ANSI C ! HW High performance: sustained > 0.8 GOPS [180nm] Low power: Energy/op 100-1000 £ better than P Application-Specific Hardware (ASH) Compiler-synthesized architecture
8
8 Related Work Nanotechnology Dataflow machines High-level synthesis Reconfigurable computing Computer architecture Embedded systems Asynchronous circuits Compilation
9
9 Outline Research overview Problems of current architectures Compiling Application-Specific Hardware ASH Evaluation New compiler optimizations Conclusions 1000 Performance
10
10 Resources We do not worry about not having hardware resources We worry about being able to use hardware resources [Intel]
11
11 Complexity ALUs Cannot rely on global signals (clock is a global signal) 10pswire 5psgate Delay
12
12 Instruction-Set Architecture Software Hardware ISA VERY rigid to changes (e.g. x86 vs Itanium)
13
13 Our Proposal ASH addresses these problems ASH is not a panacea ASH “complementary” to CPU High-ILP computation Low ILP computation + OS + VM CPUASH Memory $
14
14 What’s New? Investigate new computational model Source is full ANSI C Result is asynchronous circuit Build spatial dataflow hardware No resources limitations New compiler algorithms End-to-end results –C to structural Verilog in seconds –high performance results –excellent power efficiency black box Investigate new computational model Source is full ANSI C Result is asynchronous circuit Build spatial dataflow hardware No resources limitations New compiler algorithms End-to-end results –C to structural Verilog in seconds –high performance results –excellent power efficiency
15
15 Outline Research overview Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs ASH Evaluation New compiler optimizations Conclusions
16
16 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw SW HW ISA HW backend
17
17 Application-Specific Hardware C program Compiler Dataflow IR CPU [predication] SW backend Soft
18
18... def-use may-dep. Key: Intermediate Representation Traditionally SSA + predication + speculation Uniform for scalars and memory Explicitly encodes may-depend Executable Precise semantics Dataflow IR Close to asynchronous target Our IR CFG
19
19 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits
20
20 Basic Computation + data valid ack latch
21
21 + Asynchronous Computation data valid ack 1 + 2 + 3 + 4 + 8 + 7 + 6 + 5 latch
22
22 Distributed Control Logic +- ack rdy global FSM asynchronous control short, local wires
23
23 Outline Research overview Problems of current architectures CASH: Compiling ASH –program representation –compiling C programs ASH Evaluation New compiler optimizations Conclusions
24
24 MUX: Forward Branches if (x > 0) y = -x; else y = b*x; * x b0 y ! -> Conditionals ) Speculation critical path SSA = no arbitration
25
25 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !
26
26 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret pipelining
27
27 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory
28
28 Memory Access LD ST LD Monolithic Memory local communicationglobal structures pipelined arbitrated network Future work: fragment this! related workcomplexity
29
29 CASH Optimizations SSA-based optimizations –unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining Memory optimizations –dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling Boolean optimizations –Espresso CAD tool, bitwidth analysis
30
30 Outline Research overview Problems of current architectures Compiling ASH Evaluation: CASH vs. clocked designs New compiler optimizations Conclusions
31
31 Evaluating ASH C CASH core Verilog back-end Synopsys, Cadence P/R ASIC 180nm std. cell library, 2V ~1999 technology Mediabench kernels (1 hot function/benchmark) ModelSim (Verilog simulation) performance numbers Mem
32
32 ASH Area P4: 217 normalized area minimal RISC core
33
33 ASH vs 600MHz CPU [.18 m]
34
34 Bottleneck: Memory Protocol LD ST Memory Token release to dependents: requires round-trip to memory. Limit study: round trip zero time ) up to 6x speed-up. LSQ Exploring protocol for in-order data delivery & fast token release.
35
35 Power DSP 110 mP 4000 Xeon [+cache] 67000
36
36 Energy Efficiency 0. 01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] General-purpose DSP Dedicated hardware ASH media kernels Asynchronous P Microprocessors 1000x
37
37 Outline Research overview Nanotechnology and architecture Compiling ASH ASH Evaluation New compiler optimizations BitValue dataflow analysis Optimizing memory accesses SIDE: static instantiation, dynamic evaluation Conclusions
38
38 Detecting Constant Bits a b = a >> 4; b0000
39
39 Detecting Useless Bits a b XXXX Don’t care bits b = a >> 4;
40
40 Dataflow on * In practice 32 Forward ) generalize constant propagation Backward ) generalize dead-code Transfer functions non-trivial BitValue Dataflow Analysis a b = a >> 4; b XXXX 0000 X 01 U Bit lattice
41
41 BitValue on C Programs MediabenchSpecInt95SpecInt2K % useless int arithmetic 27
42
42 Outline [...] New compiler optimizations BitValue dataflow analysis Memory access optimization Static Instantiation, Dynamic Evaluation Conclusions
43
43 Meaning of Token Edges [Token graph is maintained transitively reduced] Maybe dependent No intervening memory operation Independent …=*q *p=… …=*q *p=…
44
44 Dead Code Elimination *p=… (false)
45
45 ¼ PRE...=*p (p1)...=*p (p2)...=*p (p1 Ç p2) This corresponds in the CFG to lifting the load to a basic block dominating the original loads
46
46 Register Promotion …=*p (p2) *p=… (p1) …=*p *p=… (p1) (p2 Æ : p1) Load is executed only if store is not
47
47 (p2 Æ : p1) Register Promotion (2) …=*p (p2) *p=… (p1) …=*p (false) *p=… (p1) When p2 ) p1 the load becomes dead......i.e., when store dominates load in CFG
48
48 Outline [...] New compiler optimizations BitValue dataflow analysis Memory access optimization A SIDE dish: dataflow analysis Static Instantiation, Dynamic Evaluation Conclusions
49
49 Availability Dataflow Analysis y y = a*b;... if (x) {...... = a*b; }
50
50 Dataflow Analysis Is Conservative if (x) {... y = a*b; }...... = a*b; y?y?
51
51 Static Instantiation, Dynamic Evaluation flag = false; if (x) {... y = a*b; flag = true; }...... = flag ? y : a*b;
52
52 SIDE Register Promotion Effect Loads Stores % reduction
53
53 Outline Research overview +Problems of current architectures +Compiling ASH +ASH Evaluation +New compiler optimizations =Future work & conclusions
54
54 Future Work Optimizations for area/speed/power Memory partitioning Concurrency Compiler-guided layout Explore extensible ISAs Hybridization with superscalar mechanisms Reconfigurable hardware support for ASH Formal verification
55
55 How far can you go? Grand Vision: Certified Circuit Generation Translation validation: input ´ output Preserve input properties –e.g., C programs cannot deadlock –e.g., type-safe programs cannot crash Debug, test, verify only at source-level HLLIRIR opt Veriloggateslayout formally validated
56
56 Conclusions FeatureAdvantages No interpretationEnergy efficiency, speed Spatial layoutShort wires, no contention AsynchronousLow power, scalable DistributedNo global signals Automatic compilationDesign productivity, no ISA Spatial computation strengths
57
57 Backup Slides Reconfigurable hardware Critical paths Software pipelining Control logic More on PipeRench ASH vs... ASH weaknesses Exceptions Research methodology Normalized area Why C? Splitting memory More performance Recursive calls Nanotech and architecture
58
58 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable switches
59
59 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell backback to talk
60
= rdy in ack out rdy out ack in data in data out Reg back to talkback Pipeline Stage C
61
61 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
62
62 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solves the problem of unbalanced paths back to talkback
63
63 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; step 1
64
64 Pipelining i + <= 100 1 * + sum step 2
65
65 Pipelining i + <= 100 1 * + sum step 3
66
66 Pipelining i + <= 100 1 * + sum step 4
67
67 Pipelining i + <= 100 1 i=1 i=0 + sum step 5
68
68 Pipelining i + <= 100 1 * i=1 i=0 + sum step 6 back
69
69 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate step 7
70
70 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop
71
71 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO step 7
72
72 Pipeline balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO back back to talk
73
73 Process 0.18 m, 6 Al metal layers Area49 mm 2 Clock60MHz I/O 120MHz internal Power< 4W Stripes16 physical 256 virtual Compiler functional on first silicon Licensed by two companies
74
74 Hardware Virtualization compute configure Page in Page out Configuration Hardware Overlap configuration with computation.
75
75 PipeRench Hardware ALU Interconnection Network ALU Interconnection Network ALU Register data flow
76
76 Register Network Register Network Mapping Computation + concat << substr >><< &~ bit-shuffling Network used for computation
77
77 Register Compiler-Controlled Clock Network Register Slow clock Register Network Register Fast clock
78
78 Time-Multiplexing Wires 12 43 12 43 ? One channel available for two wires compute in even cycles compute in odd cycles
79
79 Compilation Times (sec on PII/400)
80
80 Compilation Speed (PII/400)
81
81 Placed Circuit Utilization
82
82 PipeRench Performance Speed-up vs. 300Mhz UltraSparc
83
83 PipeRench Compiler Role Classical optimizations Partial evaluation Data width inference (~ type inference) Module generation (~ macro expansion) Placement (~ VLIW scheduling) Routing (~irregular register allocation) Network link multiplexing (~ spilling) Clock-cycle management Technology mapping (~ instruction selection) Code generation back
84
84 HLL to HW High-level Synthesis Behavioral HDL Synchronous Hardware Reconfigurable Computing C [subsets] Hardware configuration (spatial computation) Asynchronous circuits Concurrent Language Asynchronous Hardware Prior work This research
85
85 CASH vs High-Level Synthesis CASH: the only existing tool to translate complete ANSI C to hardware CASH generates asynchronous circuits CASH does not treat C as an HDL –no annotations required –no reactivity model –does not handle non-C, e.g., concurrency back
86
86 ASH Weaknesses Low efficiency for low-ILP code Does not adapt at runtime Monolithic memory Resource waste Not flexible No support for exceptions
87
87 ASH Weaknesses (2) Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient back
88
88 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path back
89
89 Research Methodology Constraint Space state-of-the-art X (e.g., power) Y (e.g., cost) “reasonable limits” incremental evolution new solutions back
90
90 Exceptions Strictly speaking, C has no exceptions In practice hard to accommodate exceptions in hardware implementations An advantage of software flexibility: PC is single point of execution control High-ILP computation Low ILP computation + OS + VM + exceptions CPUASH Memory back $$$
91
91 Why C Huge installed base Embedded specifications written in C Small and simple language –Can leverage existing tools –Simpler compiler Techniques generally applicable Not a toy language back
92
92 Performance
93
93 Parallelism Profile
94
94 Normalized Area back back to talk
95
95 Memory Partitioning MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 Berkeley CCured: Necula POPL ‘02 Illinois FlexRAM: Fraguella PPoPP ‘03 Hand-annotations #pragma back back to talk
96
96 Memory Complexity back LSQ RAM addr data back to talk
97
97 Recursion recursive call save live values restore live values stack back
98
98 Nanotechnology and Architecture
99
99 Nanotechnology Implications new devices new manufacturing new architectures new compilers my work
100
100 CAEN V DD Output Input 1 Input 2 lithography Study computer architecture implications of Chemically-Assembled Electronic Nanotechnology
101
101 No Complex Irregular Structures
102
102 Control Regular Substrate 10 11 gates
103
103 High Defect Rate
104
104 Executable Paradigm Shift Configuration Complex fixed chip + Program Dense, regular structure + Configuration defects
105
105 New Computer Architecture CMOSSelf-assembled circuits TransistorNew molecular devices Custom hardwareReconfigurable hardware Yield (defect) controlDefect tolerance through reconfiguration Synchronous circuitsAsynchronous computation MicroprocessorsApp-specific Hardware+CPU
106
106 + + + + + + Exploiting Nanotechnology Nanotechnology + cheap + high-density + low-power – unreliable Computer architecture + vast body of knowledge – expensive – high-power Reconfigurable Computing + defect tolerant + high performance – low density – – – –
107
107 Research Convergence Systems research issues feature size decrease Deep sub-micron CMOS Chemically-assembled electronic nanotechnology my work back
108
108 Venues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.