Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Similar presentations


Presentation on theme: "Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003."— Presentation transcript:

1

2 Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003

3 2 CPU Problems Design Complexity Power Global Signals Limited issue window ) limited ILP

4 3 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant

5 4 Network Global Communication Reg Instruction unit

6 5 Our Approach: ASH Application-Specific Hardware

7 6 1) Unroll Pipeline NetworkReg Network Reg Instruction unit original processor

8 7 1. 2. 1. 2. Programs Resource Binding Time CPUASH

9 8 2) Specialize Pipeline NetworkReg Network Reg Instruction unit Fixed program

10 9 2) Specialize Pipeline: Functional Units NetworkReg Network Reg Instruction unit Fixed program

11 10 2) Specialize Pipeline: Interconnection Network Reg Instruction unit Fixed program

12 11 Instruction unit 2) Specialize Pipeline: Register Files Fixed program 1 0

13 12 Instruction unit 2) Specialize Pipeline: Shrink Wires Fixed program 1 0

14 13 2) Specialize Pipeline: No Instruction Fetch, Decode, Issue 1 0

15 14 Loops 1 0

16 15 Memory LSQ To memory 1 0

17 16 Outline Introduction CASH: Compiling for ASH ASH vs CPU Analyzing the Results Conclusions

18 17 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

19 18 Asynchronous Computation + data valid ack latch

20 19 Distributed Control Logic +- ack rdy FSM more info

21 20 Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Conditionals ) Speculation

22 21 Control Flow ) Data Flow data predicate Merge Gateway data Split (branch) p !

23 22 i +1 < 100 0 * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret

24 23 Outline Introduction Compiling for ASH ASH vs CPU Analyzing the Results Conclusions

25 24 ASH vs: 1.4- & 8-wide VLIWs 2. Superscalar, media kernels 3. Superscalar, SpecInt95

26 25 OpenDIVX IDCT, Normalized Running Time

27 26 OpenDIVX IDCT, Sustained IPC includes speculative ops no data

28 27 Media Kernels, vs 4-way OOO

29 28 Media Kernels, IPC

30 29 Cost of Performance

31 30 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good)

32 31 SpecInt95, ASH vs 4-way OOO

33 32 Outline Introduction: spatial computation CASH: Compiling for ASH ASH vs CPU Dissection Conclusions

34 33 The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized

35 34 Dynamic Critical Path for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; load predicate loop predicate sizeof(X[j]) definition

36 35 MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

37 36 If Branch Prediction Correct L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

38 37 SpecInt95, perfect prediction

39 38 Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

40 39 Prediction + Load Speculation ~4 cycles! Load not pipelined (self-anti-dependence) ack edge for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

41 40 OOO Pipe Snapshot IFDAEXWBCT L5 L1 L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 register renaming LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

42 41 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration

43 42 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr) Memory is “far” Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient No virtualization No dynamic optimization

44 43 Outline Introduction: spatial computation +CASH: Compiling for ASH +ASH vs CPU +Result Analysis =Conclusions

45 44 Conclusions ASH promising for media processing; to evaluate – power – performance – cost Prediction does much more than avoid issue stalls von Neumann model of computation very powerful hardware resources are not everything

46 45 Backup Slides Evaluation model Control logic Pipeline balancing Lenient execution Dynamic Critical Path

47 46 How Performance Is Evaluated C Unlimited ILP LSQ limited BW (2 words/c) L1 8K L2 1/4M Mem 2 8 72

48 47 Simulation Parameters back Compared to 4-wide OOO SimpleScalar Same operation latencies Same cache hierarchy No measurements in library functions/OS 3-cycle multiply, 20 cycle divide

49 48 Control Logic C C   Reg rdy in ack in rdy out ack out data in data out backback to talk

50 49 Outline Introduction Compiling for ASH ASH at run-time ASH vs CPU Conclusions

51 50 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

52 51 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths back

53 52 Pipelining i + <= 100 1 * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; cycle=1

54 53 Pipelining i + <= 100 1 * + sum cycle=2

55 54 Pipelining i + <= 100 1 * + sum cycle=3

56 55 Pipelining i + <= 100 1 * + sum cycle=4

57 56 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe cycle=5

58 57 Pipelining i + <= 100 1 * i=1 i=0 + sum cycle=6

59 58 Pipelining i + <= 100 1 * + sum i’s loop sum’s loop Long latency pipe predicate cycle=7

60 59 Predicate ack edge is on the critical path. Pipelining i + <= 100 1 * + sum critical path i’s loop sum’s loop

61 60 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop decoupling FIFO cycle=7

62 61 Pipelinine balancing i + <= 100 1 * + sum i’s loop sum’s loop critical path decoupling FIFO

63 62 FIFO Impact * PipeFIFOCycles N0903 N1 Y0653 Y1474 Y2408 Y3 i + <= 100 1 * + sum decoupling FIFO

64 63 Pipelining Potential, Mediabench

65 64 Dataflow Loop Pipelining Related to software pipelining Copes with unknown latencies –control-flow –memory accesses Does not require parallelization Applicable to memory accesses as well back

66 65 Last-Arrival Events + data valid ack Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges

67 66 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node backback to talk

68 67 low power? simple verification? specialized to app. unlimited ILP simple hardware no fixed window economies of scale highly optimized branch prediction register renaming full-dataflow global signals/decision Strengths


Download ppt "Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003."

Similar presentations


Ads by Google