Download presentation
Presentation is loading. Please wait.
1
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein, chair Todd Mowry Peter Lee Babak Falsafi, ECE Nevin Heintze, Agere Systems
2
2 Four Types of Research Solve nonexistent problems Solve past problems Solve current problems Solve future problems
3
3 The Law (source: Intel)
4
4 The Crossover Phenomenon time technology
5
5 Example Crossover time DRAM CPU 1980 caches access speed (ns) no caches 200
6
Trouble Ahead for Microarchitecture
7
7 Signal Propagation time now mm die size distance in 1 clock 20
8
8 Reliability & Yield time defects/chip tolerable new process occurring now
9
9 Energy time now 100W CPU consumption thermal dissipation power
10
10 Instruction-Level Parallelism (ILP) time fetch commit instructions now
11
11 Premises of this Research We will have lots of gates –Moore’s law continues –Nanotechnology Contemporary architectures do not scale
12
12 Outline Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work
13
13 ASH Application-Specific Hardware Reconfigurable hardware HLL program Compiler Circuit
14
14 ASH: A Scalable Architecture -- Thesis Statement -- Application-specific hardware on a reconfigurable-hardware substrate is a solution for the smooth evolution of computer architecture. We can provide scalable compilers for translating high-level languages into hardware.
15
15 Example int f(void) { int i=0, j = 0; for (; i < 10; i++) j += i; return j; }
16
16 Outline Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work
17
17 Build reconfigurable hardware using nanotechnology Huge structures ASH and Nanotechnology Low Power: 10 10 gates use less than 2 W Low cost: nanocents/gate High density: 10 5 x over CMOS Nano-RAM cell In yellow: a CMOS RAM cell.
18
18 A graph of the whole program execution: A Limit Study of Performance Memory word Basic block Memory write Memory read Control-flow transfer
19
19 Typical Program Graph (g721_e) Control flow transfer 100% memory cluster Memory reads 100% code cluster memcpy
20
20 Program Graph After Inlining memcpy memcpy
21
21 Application Slowdown
22
22 How Time Is Spent No caches: reads expensive No speculation
23
23 Lesson The spatial model of computation has different properties.
24
24 Outline Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Future work
25
25 CASH: Compiling for ASH Memory partitioning Interconnection net Program to circuits
26
26 Compilation 1. Program int reverse(int x) { int k,r=0; for (k=0; k > 1; r = r << 1; } } Unknown latency ops. Computations & local storage 2. Split-phase Abstract Machines 3. Configurations placed independently 4. Placement on chip Reliability
27
27 Split-phase Abstract Machines SAM 1 SAM 2 SAM 3 CFG Power
28
28 Hyperblock => SAM Single-entry, multiple exit May contain loops
29
29 SAM => FSM StartLoop Exit Remote Memory Local memory
30
30 Implementing SAMs - interesting details -
31
31 The SAM FSM Computation Predicates (control) Combinational logic start exit Register argsresults
32
32 Computation = Dataflow Variables => wires + tokens No token store; no token matching Local communication only Signals x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits
33
33 Tokens & Synchronization Tokens signal operation completion Possible implementations: data valid ack Local data valid reset Global data valid Static
34
34 Speculation if (x > 0) y = -x; else y = b*x; * x b0 y ! slow ComputationPredicates -> -> and Eager Muxes Static-Single Assignment implemented in hardware ILP
35
35 Predicates *q = 2; Guard side-effects –Memory access –Procedure calls Control looping Decide exit branch Select variable definition x=......=x
36
36 Computing Predicates Correct for irreducible graphs Correct even when speculatively computed Can be eagerly computed st b
37
37 Loops + Dataflow for (i=0; i < 10; i++) a[i] += i; + load + store &a[0] + 1 i a[0] 0 a[1] a[2] a[3] = Pipelining
38
38 Outline Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work
39
39 Evolutionary Path MicroprocessorsASH The problem with ASH: Resources
40
40 Virtualization
41
41 CPU+ASH core computation support computation + OS + VM CPUASH Memory
42
42 Outline Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work
43
43 ASH Benefits ProblemSolution ReliabilityConfiguration around defects PowerOnly “useful” gates switching SignalsLocalized computation ILPStatically extracted
44
44 Scalable Performance performance CPU ASH time now
45
45 Summary Contemporary CPU architecture faces lots of problems Application-Specific Hardware (ASH) provides a scalable technology Compiling HLL into hardware dataflow machines is an effective solution
46
46 Timeline 12/0206/01 CASH core 09/0112/0104/0206/0209/02 Write thesis Hw/sw partitioning (ASH + CPU) Cost models ASH Simulation Loop parallelization Explore architectural/compiler trade-offs now Memory partitioning
47
47 Extras Related work Reconfigurable hardware Other cross-over phenomena A CPU + ASH study More about predicates
48
48 Related Work Hardware synthesis from HLL Reconfigurable hardware Predicated execution Dataflow machines Speculative execution Predicated SSA back
49
49 Reconfigurable Hardware Universal gates and/or storage elements Interconnection network Programmable Switches backback to presentation
50
50 Switch controlled by a 1-bit RAM cell 00010001 Universal gate = RAM a0 a1 a0 a1 data a1 & a2 0 data in control Main RH Ingredient: RAM Cell back
51
51 Reconfigurable Computing Back to ENIAC-style computation Synthesize one machine to solve one problem back back to “extras”
52
52 Efficiency time idle used hardware resources now
53
53 Manufacturing Cost time 3x10 9 $ now cost affordable cost
54
54 Complexity time transistors manageable available 10 9 10 8 10 now
55
55 CAD Tools time manual interventions now feasible necessary back
56
56 ASH Benefits ProblemSolution ReliabilityConfiguration around defects PowerOnly “useful” gates switching SignalsLocalized computation ILPStatically extracted ComplexityHierarchy of abstractions CADCompiler + local place & route EfficiencyCircuit customized to application CostNo masks, no physics, same substrate PerformanceScalable back
57
57 CPU+ASH Study Reconfigurable functional unit on processor pipeline Adapted SimpleScalar 3.0 ASH & CPU use the same memory hierarchy (incl. L1) ASH can access CPU registers CPU pipeline interlocked with ASH Results pending back
58
58 Simplifying Predicates Shared implementations Control equivalence a b c
59
59 Deep Speculation if (p) if (q) x = a; else x = b; else x = c; x abc !pp&!qp&q
60
60 Predicates & Tokens *q = 2 ready safe q ~x ready safe x *q = 2 1 ready & safe q Predicated tokensEliminate speculation ~x safe & readyx back ready Eliminate wires PP_ready P & P_ready
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.