Download presentation
Presentation is loading. Please wait.
2
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003
3
2 CPU Problems Design Complexity Power Global Signals Limited issue window ) limited ILP
4
3 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant
5
4 Network Global Communication Reg Instruction unit
6
5 Our Approach: ASH Application-Specific Hardware
7
6 1) Unroll Pipeline NetworkReg Network Reg Instruction unit original processor
8
7 1. 2. 1. 2. Programs Resource Binding Time CPUASH
9
8 2) Specialize Pipeline NetworkReg Network Reg Instruction unit Fixed program
10
9 2) Specialize Pipeline: Functional Units NetworkReg Network Reg Instruction unit Fixed program
11
10 2) Specialize Pipeline: Interconnection Network Reg Instruction unit Fixed program
12
11 Instruction unit 2) Specialize Pipeline: Register Files Fixed program 1 0
13
12 Instruction unit 2) Specialize Pipeline: Shrink Wires Fixed program 1 0
14
13 2) Specialize Pipeline: No Instruction Fetch, Decode, Issue 1 0
15
14 Loops 1 0
16
15 Memory LSQ To memory 1 0
17
16 Outline Introduction CASH: Compiling for ASH ASH vs CPU Analyzing the Results Conclusions
18
17 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw
19
18 Asynchronous Computation + data valid ack latch
20
19 Distributed Control Logic +- ack rdy FSM more info
21
20 Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Conditionals ) Speculation
22
21 Control Flow ) Data Flow data predicate Merge Gateway data Split (branch) p !
23
22 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret
24
23 Outline Introduction Compiling for ASH ASH vs CPU Analyzing the Results Conclusions
25
24 ASH vs: 1.4- & 8-wide VLIWs 2. Superscalar, media kernels 3. Superscalar, SpecInt95
26
25 OpenDIVX IDCT, Normalized Running Time
27
26 OpenDIVX IDCT, Sustained IPC includes speculative ops no data
28
27 Media Kernels, vs 4-way OOO
29
28 Media Kernels, IPC
30
29 Cost of Performance
31
30 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good)
32
31 SpecInt95, ASH vs 4-way OOO
33
32 Outline Introduction: spatial computation CASH: Compiling for ASH ASH vs CPU Dissection Conclusions
34
33 The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized
35
34 Dynamic Critical Path for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; load predicate loop predicate sizeof(X[j]) definition
36
35 MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
37
36 If Branch Prediction Correct L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:
38
37 SpecInt95, perfect prediction
39
38 Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
40
39 Prediction + Load Speculation ~4 cycles! Load not pipelined (self-anti-dependence) ack edge for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
41
40 OOO Pipe Snapshot IFDAEXWBCT L5 L1 L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 register renaming LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:
42
41 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration
43
42 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient No virtualization No dynamic optimization
44
43 Outline Introduction: spatial computation +CASH: Compiling for ASH +ASH vs CPU +Result Analysis =Conclusions
45
44 Conclusions ASH promising for media processing; to evaluate – power – performance – cost Prediction does much more than avoid issue stalls von Neumann model of computation very powerful hardware resources are not everything
46
45 Backup Slides Evaluation model Control logic Pipeline balancing Lenient execution Dynamic Critical Path
47
46 How Performance Is Evaluated C Unlimited ILP LSQ limited BW (2 words/c) L1 8K L2 1/4M Mem 2 8 72
48
47 Simulation Parameters back Compared to 4-wide OOO SimpleScalar Same operation latencies Same cache hierarchy No measurements in library functions/OS 3-cycle multiply, 20 cycle divide
49
48 Control Logic C C Reg rdy in ack in rdy out ack out data in data out backback to talk
50
49 Outline Introduction Compiling for ASH ASH at run-time ASH vs CPU Conclusions
51
50 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->
52
51 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths back
53
52 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; cycle=1
54
53 Pipelining i + <= 100 1 * + sum cycle=2
55
54 Pipelining i + <= 100 1 * + sum cycle=3
56
55 Pipelining i + <= 100 1 * + sum cycle=4
57
56 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe cycle=5
58
57 Pipelining i + <= 100 1 * i=1 i=0 + sum cycle=6
59
58 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate cycle=7
60
59 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop
61
60 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO cycle=7
62
61 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO
63
62 FIFO Impact * PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3 i + <= 100 1 * + sum decoupling FIFO
64
63 Pipelining Potential, Mediabench
65
64 Dataflow Loop Pipelining Related to software pipelining Copes with unknown latencies –control-flow –memory accesses Does not require parallelization Applicable to memory accesses as well back
66
65 Last-Arrival Events + data valid ack Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges
67
66 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node backback to talk
68
67 low power? simple verification? specialized to app. unlimited ILP simple hardware no fixed window economies of scale highly optimized branch prediction register renaming full-dataflow global signals/decision Strengths
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.