Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003

2 CPU Problems Design Complexity Power Global Signals Limited issue window ) limited ILP

3 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant

4 Network Global Communication Reg Instruction unit

5 Our Approach: ASH Application-Specific Hardware

6 1) Unroll Pipeline NetworkReg Network Reg Instruction unit original processor

7 1. 2. 1. 2. Programs Resource Binding Time CPUASH

8 2) Specialize Pipeline NetworkReg Network Reg Instruction unit Fixed program

9 2) Specialize Pipeline: Functional Units NetworkReg Network Reg Instruction unit Fixed program

10 2) Specialize Pipeline: Interconnection Network Reg Instruction unit Fixed program

11 Instruction unit 2) Specialize Pipeline: Register Files Fixed program 1 0

12 Instruction unit 2) Specialize Pipeline: Shrink Wires Fixed program 1 0

13 2) Specialize Pipeline: No Instruction Fetch, Decode, Issue 1 0

14 Loops 1 0

15 Memory LSQ To memory 1 0

16 Outline Introduction CASH: Compiling for ASH ASH vs CPU Analyzing the Results Conclusions

17 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

18 Asynchronous Computation + data valid ack latch

19 Distributed Control Logic +- ack rdy FSM more info

20 Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Conditionals ) Speculation

21 Control Flow ) Data Flow data predicate Merge Gateway data Split (branch) p !

22 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret

23 Outline Introduction Compiling for ASH ASH vs CPU Analyzing the Results Conclusions

24 ASH vs: 1.4- & 8-wide VLIWs 2. Superscalar, media kernels 3. Superscalar, SpecInt95

25 OpenDIVX IDCT, Normalized Running Time

26 OpenDIVX IDCT, Sustained IPC includes speculative ops no data

27 Media Kernels, vs 4-way OOO

28 Media Kernels, IPC

29 Cost of Performance

30 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good)

31 SpecInt95, ASH vs 4-way OOO

32 Outline Introduction: spatial computation CASH: Compiling for ASH ASH vs CPU Dissection Conclusions

33 The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized

34 Dynamic Critical Path for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; load predicate loop predicate sizeof(X[j]) definition

35 MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

36 If Branch Prediction Correct L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

37 SpecInt95, perfect prediction

38 Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

39 Prediction + Load Speculation ~4 cycles! Load not pipelined (self-anti-dependence) ack edge for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

40 OOO Pipe Snapshot IFDAEXWBCT L5 L1 L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 register renaming LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

41 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration

42 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient No virtualization No dynamic optimization

43 Outline Introduction: spatial computation +CASH: Compiling for ASH +ASH vs CPU +Result Analysis =Conclusions

44 Conclusions ASH promising for media processing; to evaluate – power – performance – cost Prediction does much more than avoid issue stalls von Neumann model of computation very powerful hardware resources are not everything

45 Backup Slides Evaluation model Control logic Pipeline balancing Lenient execution Dynamic Critical Path

46 How Performance Is Evaluated C Unlimited ILP LSQ limited BW (2 words/c) L1 8K L2 1/4M Mem 2 8 72

47 Simulation Parameters back Compared to 4-wide OOO SimpleScalar Same operation latencies Same cache hierarchy No measurements in library functions/OS 3-cycle multiply, 20 cycle divide

48 Control Logic C C   Reg rdy in ack in rdy out ack out data in data out backback to talk

49 Outline Introduction Compiling for ASH ASH at run-time ASH vs CPU Conclusions

50 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

51 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths back

52 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; cycle=1

53 Pipelining i + <= 100 1 * + sum cycle=2

56 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe cycle=5

57 Pipelining i + <= 100 1 * i=1 i=0 + sum cycle=6

58 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate cycle=7

59 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop

60 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO cycle=7

61 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO

62 FIFO Impact * PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3 i + <= 100 1 * + sum decoupling FIFO

63 Pipelining Potential, Mediabench

64 Dataflow Loop Pipelining Related to software pipelining Copes with unknown latencies –control-flow –memory accesses Does not require parallelization Applicable to memory accesses as well back

65 Last-Arrival Events + data valid ack Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges

66 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node backback to talk

67 low power? simple verification? specialized to app. unlimited ILP simple hardware no fixed window economies of scale highly optimized branch prediction register renaming full-dataflow global signals/decision Strengths

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Similar presentations

Presentation on theme: "Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Similar presentations

Presentation on theme: "Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003."— Presentation transcript:

Similar presentations

About project

Feedback