Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

Similar presentations


Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),"— Presentation transcript:

1 Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nlkokkeler@utwente.nl

2 Contents Introduction Processor Architecture Loop Unrolling Software Pipelining

3 Introduction Common Name Issue Structure Hazard Detection SchedulingCharacteristicExamples Superscalar (static) dynamichardwarestaticIn order execution Sun UltraSPARC II/III Superscalar (dynamic) dynamichardwaredynamicSome out of order execution IBM Power2 Superscalar (speculative) dynamichardwareDynamic with speculation Out of order execution Pentium III VLIW staticsoftwarestaticNo hazards between issue packets Trimedia, i860 EPIC Mostly static Mostly software Mostly static Expl. depen- dencies marked comp Itanium

4 Processor Architecture 5 stage pipeline Static scheduling Integer and Floating Point unit IFIDINT EX MEMWB IFIDFP EX FP EX FP EX FP EX MEMWB

5 Processor Architecture Latencies: IFIDINT EX MEMWB IFIDINT EX MEMWB IFIDFP EX FP EX FP EX FP EX MEMWB IFIDStall FP EX FP EX FP EX FP EX MEMWB Latency = 3 Integer ALU => Integer ALU Floating point ALU => Floating point ALU No Latency Int. ALU FP ALU FP ALU

6 Processor Architecture Latencies: IFIDEXMEMWB IFIDEXMEMWB Load Memory => Store Memory No Latency Load Store

7 Processor Architecture Latencies: IFIDINT EX MEMWB IFIDEXMEMWB IFIDFP EX FP EX FP EX FP EX MEMWB IFIDStall EXMEMWB Latency = 2 Integer ALU => Store Memory Floating point ALU => Store Memory No Latency Int. ALU Store FP ALU Store

8 Processor Architecture Latencies: IFIDEXMEMWB IFIDstallINT EX MEMWB IFIDStallFP EX FP EX FP EX FP EX MEMWB Latency = 1 Load Memory => Integer ALU Load Memory => Floating point ALU Latency = 1 IFIDEXMEMWB Load Int. ALU Load FP ALU

9 Processor Architecture Latencies: IFIDEXMEMWB IFStallIDINT EX MEMWB Integer ALU => Branch Latency = 1 Branch Int. ALU

10 Loop Unrolling For i:=1000 downto 1 do x[i] := x[i]+s; Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot R1: pointer within array F2: value to be added (s) R2: last element in array F0: value in array F4: value to be written in array

11 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Load Memory => FP ALU 1 stall

12 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot FP ALU => Store Memory => 2 stalls

13 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s stall stall S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Integer ALU => Branch 1 stall

14 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s stall stall S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 stall BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Smart compiler

15 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] DADDUIR1,R1,#-8; i  i-1 ADD.DF4,F0,F2; F4  x[i]+s stall BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F4; x[i]  x[i]+s Integer ALU => Branch 1 stall From 10 cycles per loop to 6 cycles per loop

16 Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] DADDUIR1,R1,#-8; i  i-1 ADD.DF4,F0,F2; F4  x[i]+s BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F4; x[i]  x[i]+s 5 instructions —3 ‘doing the job’ —2 control or ‘overhead’ Reduce overhead => loop unrolling —Add code —From 1000 iterations to 500 iterations

17 Loop Unrolling Original Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Copy this part With correct ‘data pointer’

18 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s L.DF0,-8(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D-8(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-2 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot There are still a lot of stalls. Removing is easier if some additional registers are used 1 stall 2 stalls 1 stall 2 stalls 1 stall

19 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s L.DF6,-8(R1); F6  x[i] ADD.DF8,F6,F2; F8  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall 2 stalls 1 stall

20 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall

21 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s S.D0(R1),F4; x[i]  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 2 stalls 1 stall +16 +8

22 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s DADDUIR1,R1,#-16; i  i-1 S.D16(R1),F4; x[i]  x[i]+s S.D8(R1),F8; x[i]  x[i]+s BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot

23 Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s DADDUIR1,R1,#-16; i  i-1 S.D16(R1),F4; x[i]  x[i]+s BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F8; x[i]  x[i]+s Clock cyclesOriginal loop (1000 times) Unrolled loop (500 times) Savings L.D Instrucions1000 0 ADD.D instructions1000 0 S.D instructions1000 0 DADDUI instructions1000500 BNE instructions1000500 Stall cycles10000 Totals600040002000

24 Loop Unrolling In example: loop-unrolling factor 2 In general: loop-unrolling factor k Limitations concerning k —Amdahls law: 3000 cycles are always needed —Increasing k => increasing number of registers —Increasing k => increasing code size

25 Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i] F4  x[i] + x x[i]  x[i] + s Consider these as three different stages 1 stall 2 stalls 1 stall

26 Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i]Stage 1 F4  x[i] + xStage 2 x[i]  x[i] + sStage 3 Associate array element with the stages

27 Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i]Stage 1, x[i] F4  x[i] + xStage 2, x[i] x[i]  x[i] + sStage 3, x[i]

28 Software Pipelining Normal Execution X[1000] X[999] Time Stage 1Stage 2Stage 3 X[998] F0F4 1 stall 2 stalls 1 stall 2 stalls 1 stall 2 stalls Register Empty Register Occupied Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4

29 Software Pipelining Software Pipelined Execution X[1000] X[999] Time Stage 1Stage 2Stage 3 X[998] F0F4 1 stall 0 stalls 1 stall 0 stalls 1 stall Register Empty Register Occupied X[997] Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4

30 Software Pipelining Software Pipelined Execution X[1000] X[i] X[999] X[i-1] Stage 1Stage 2Stage 3 X[i-2] 1 stall 0 stalls 1 stall 0 stalls L.DF0,0(R1); F0  x[1000] ADD.DF4,F0,F2; F4  x[1000] + s LD.DF0,-8(R1); F0  x[999] S.D0(R1),F4; x[i]  F4 ADD.DF4,F0,F2; F4  x[i-1] + s LD.DF0,-16(R1); F0  x[i-2] BNER1,R2,Loop; repeat if i≠1 DADDUIR1,R1,#-8;i  i-8 Loop:

31 Software Pipelining Software Pipelined Execution X[1000] X[i] X[999] X[i-1] Stage 1Stage 2Stage 3 X[i-2] 1 stall 0 stalls L.DF0,0(R1); F0  x[1000] ADD.DF4,F0,F2; F4  x[1000] + s LD.DF0,-8(R1); F0  x[999] S.D0(R1),F4; x[i]  F4 ADD.DF4,F0,F2; F4  x[i-1] + s LD.DF0,-16(R1); F0  x[i-2] BNER1,R2,Loop; repeat if i≠1 DADDUIR1,R1,#-8;i  i-8 Loop:

32 Software Pipelining No stalls inside loop Additional start-up (and clean-up) code No reduction of control overhead No additional registers

33 VLIW To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.) Extreme form: Very Long Instruction Word processors

34 VLIW Superscalar VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units

35 VLIW Suppose 4 functional units —Memory load unit —Floating point unit —Memory store unit —Integer/Branch unit Instruction Memory loadFP operationMemory storeInteger/Branch

36 VLIW Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall Memory loadFP operationMemory storeInteger/Branch L.D stall ADD.D stall S.D Limit stall cycles by clever compilers (loop unrolling, software pipelining)

37 VLIW Superscalar VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units

38 VLIW Superscalar Dynamic VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units Initiation

39 Dynamic VLIW VLIW: no caches because no hardware to deal with cache misses Dynamic VLIW: Hardware to stall on a cache miss. Not used frequently

40 VLIW Dynamic VLIW Explicitly Parallel Instruction Computing (EPIC) Instructions Execution Units Initiation Execution Unit Assign- ment

41 EPIC IA-64 architecture by HP and Intel IA-64 is an instruction set architecture intended for implementation on EPIC Itanium is first Intel product 64-bit architecture Basic concepts: —Instruction level parallelism indicated by compiler —Long or very long instruction words —Branch predication (≠ prediction) —Speculative loading

42 Key Features Large number of registers —IA-64 instruction format assumes 256 –128 * 64 bit integer, logical & general purpose –128 * 82 bit floating point and graphic —64 * 1 bit predicated execution registers (see later) —To support high degree of parallelism Multiple execution units —Expected to be 8 or more —Depends on number of transistors available —Execution of parallel instructions depends on hardware available –8 parallel instructions may be spilt into two lots of four if only four execution units are available

43 IA-64 Execution Units I-Unit —Integer arithmetic —Shift and add —Logical —Compare —Integer multimedia ops M-Unit —Load and store –Between register and memory —Some integer ALU B-Unit —Branch instructions F-Unit —Floating point instructions

44 Instruction Format Diagram

45 Instruction Format 128 bit bundle —Holds three instructions (syllables) plus template —Can fetch one or more bundles at a time —Template contains info on which instructions can be executed in parallel –Not confined to single bundle –e.g. a stream of 8 instructions may be executed in parallel –Compiler will have re-ordered instructions to form contiguous bundles –Can mix dependent and independent instructions in same bundle

46 Assembly Language Format [qp] mnemonic [.comp] dest = srcs // qp - predicate register —1 at execution then execute and commit result to hardware —0 result is discarded mnemonic - name of instruction comp – one or more instruction completers used to qualify mnemonic dest – one or more destination operands srcs – one or more source operands // - comment Instruction groups and stops indicated by ;; —Sequence without read after write or write after write —Do not need hardware register dependency checks

47 Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 —Changed by first instruction —Can not be in same group for parallel execution

48 Predication

49 cmp.eq p1, p2 = 0, a ;; (p1)addj = 1, j (p2)addk = 1, k ifa == 0 thenj = j+1 elsek = k+1 cmpa,0 jneL1 addj,1 jmpL2 L1:addk,1 L2: If a == 0 Then p1 = 1 and p2 = 0 Else p1 = 0 and p2 = 1 Pseudo code Using branches Predicated Should NOT be there to enable parallelism

50 Speculative Loading

51 Data Speculation st8[r4] = r12 ld8r6 = [r8];; addr5=r6, r7;; st8[r18] = r5 stallWhat if r4 contains same address as r8 ? Ld8.ar6 = [r8];;advanced load st8[r4] = r12 Ld8.cr6 = [r8];;check load addr5=r6, r7;; st8[r18] = r5 Writes source address (contents of r8) to Advanced Load Adress Table (ALAT) Each store checks ALAT and removes entry if match If no matching entry in ALAT: Load is performed again

52 Control & Data Speculation Control Speculation —AKA Speculative loading —Load data from memory before needed Data Speculation —Load moved before store that might alter memory location —Subsequent check in value


Download ppt "Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),"

Similar presentations


Ads by Google