Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102), smit@cs.utwente.nlsmit@cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler@utwente.nlkokkeler@utwente.nl

Contents Introduction Processor Architecture Loop Unrolling Software Pipelining

Introduction Common Name Issue Structure Hazard Detection SchedulingCharacteristicExamples Superscalar (static) dynamichardwarestaticIn order execution Sun UltraSPARC II/III Superscalar (dynamic) dynamichardwaredynamicSome out of order execution IBM Power2 Superscalar (speculative) dynamichardwareDynamic with speculation Out of order execution Pentium III VLIW staticsoftwarestaticNo hazards between issue packets Trimedia, i860 EPIC Mostly static Mostly software Mostly static Expl. depen- dencies marked comp Itanium

Processor Architecture 5 stage pipeline Static scheduling Integer and Floating Point unit IFIDINT EX MEMWB IFIDFP EX FP EX FP EX FP EX MEMWB

Processor Architecture Latencies: IFIDINT EX MEMWB IFIDINT EX MEMWB IFIDFP EX FP EX FP EX FP EX MEMWB IFIDStall FP EX FP EX FP EX FP EX MEMWB Latency = 3 Integer ALU => Integer ALU Floating point ALU => Floating point ALU No Latency Int. ALU FP ALU FP ALU

Processor Architecture Latencies: IFIDEXMEMWB IFIDEXMEMWB Load Memory => Store Memory No Latency Load Store

Processor Architecture Latencies: IFIDINT EX MEMWB IFIDEXMEMWB IFIDFP EX FP EX FP EX FP EX MEMWB IFIDStall EXMEMWB Latency = 2 Integer ALU => Store Memory Floating point ALU => Store Memory No Latency Int. ALU Store FP ALU Store

Processor Architecture Latencies: IFIDEXMEMWB IFIDstallINT EX MEMWB IFIDStallFP EX FP EX FP EX FP EX MEMWB Latency = 1 Load Memory => Integer ALU Load Memory => Floating point ALU Latency = 1 IFIDEXMEMWB Load Int. ALU Load FP ALU

Processor Architecture Latencies: IFIDEXMEMWB IFStallIDINT EX MEMWB Integer ALU => Branch Latency = 1 Branch Int. ALU

Loop Unrolling For i:=1000 downto 1 do x[i] := x[i]+s; Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot R1: pointer within array F2: value to be added (s) R2: last element in array F0: value in array F4: value to be written in array

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Load Memory => FP ALU 1 stall

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot FP ALU => Store Memory => 2 stalls

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s stall stall S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Integer ALU => Branch 1 stall

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] stall ADD.DF4,F0,F2; F4  x[i]+s stall stall S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 stall BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Smart compiler

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] DADDUIR1,R1,#-8; i  i-1 ADD.DF4,F0,F2; F4  x[i]+s stall BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F4; x[i]  x[i]+s Integer ALU => Branch 1 stall From 10 cycles per loop to 6 cycles per loop

Loop Unrolling Loop:L.DF0,0(R1); F0  x[i] DADDUIR1,R1,#-8; i  i-1 ADD.DF4,F0,F2; F4  x[i]+s BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F4; x[i]  x[i]+s 5 instructions —3 ‘doing the job’ —2 control or ‘overhead’ Reduce overhead => loop unrolling —Add code —From 1000 iterations to 500 iterations

Loop Unrolling Original Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Copy this part With correct ‘data pointer’

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s L.DF0,-8(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D-8(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-2 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot There are still a lot of stalls. Removing is easier if some additional registers are used 1 stall 2 stalls 1 stall 2 stalls 1 stall

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s L.DF6,-8(R1); F6  x[i] ADD.DF8,F6,F2; F8  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall 2 stalls 1 stall

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s S.D0(R1),F4; x[i]  x[i]+s S.D-8(R1),F8; x[i]  x[i]+s DADDUIR1,R1,#-16; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 2 stalls 1 stall +16 +8

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s DADDUIR1,R1,#-16; i  i-1 S.D16(R1),F4; x[i]  x[i]+s S.D8(R1),F8; x[i]  x[i]+s BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot

Loop Unrolling Unrolled Code Sequence: Loop:L.DF0,0(R1); F0  x[i] L.DF6,-8(R1); F6  x[i] ADD.DF4,F0,F2; F4  x[i]+s ADD.DF8,F6,F2; F8  x[i]+s DADDUIR1,R1,#-16; i  i-1 S.D16(R1),F4; x[i]  x[i]+s BNER1,R2,Loop; repeat if i≠0 S.D8(R1),F8; x[i]  x[i]+s Clock cyclesOriginal loop (1000 times) Unrolled loop (500 times) Savings L.D Instrucions1000 0 ADD.D instructions1000 0 S.D instructions1000 0 DADDUI instructions1000500 BNE instructions1000500 Stall cycles10000 Totals600040002000

Loop Unrolling In example: loop-unrolling factor 2 In general: loop-unrolling factor k Limitations concerning k —Amdahls law: 3000 cycles are always needed —Increasing k => increasing number of registers —Increasing k => increasing code size

Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i] F4  x[i] + x x[i]  x[i] + s Consider these as three different stages 1 stall 2 stalls 1 stall

Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i]Stage 1 F4  x[i] + xStage 2 x[i]  x[i] + sStage 3 Associate array element with the stages

Software Pipelining Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot Three actions involved with actual calculations: F0  x[i]Stage 1, x[i] F4  x[i] + xStage 2, x[i] x[i]  x[i] + sStage 3, x[i]

Software Pipelining Normal Execution X[1000] X[999] Time Stage 1Stage 2Stage 3 X[998] F0F4 1 stall 2 stalls 1 stall 2 stalls 1 stall 2 stalls Register Empty Register Occupied Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4

Software Pipelining Software Pipelined Execution X[1000] X[999] Time Stage 1Stage 2Stage 3 X[998] F0F4 1 stall 0 stalls 1 stall 0 stalls 1 stall Register Empty Register Occupied X[997] Stage 1: fill F0 Stage 2: read F0 fill F4 Stage 3: read F4

Software Pipelining Software Pipelined Execution X[1000] X[i] X[999] X[i-1] Stage 1Stage 2Stage 3 X[i-2] 1 stall 0 stalls 1 stall 0 stalls L.DF0,0(R1); F0  x[1000] ADD.DF4,F0,F2; F4  x[1000] + s LD.DF0,-8(R1); F0  x[999] S.D0(R1),F4; x[i]  F4 ADD.DF4,F0,F2; F4  x[i-1] + s LD.DF0,-16(R1); F0  x[i-2] BNER1,R2,Loop; repeat if i≠1 DADDUIR1,R1,#-8;i  i-8 Loop:

Software Pipelining Software Pipelined Execution X[1000] X[i] X[999] X[i-1] Stage 1Stage 2Stage 3 X[i-2] 1 stall 0 stalls L.DF0,0(R1); F0  x[1000] ADD.DF4,F0,F2; F4  x[1000] + s LD.DF0,-8(R1); F0  x[999] S.D0(R1),F4; x[i]  F4 ADD.DF4,F0,F2; F4  x[i-1] + s LD.DF0,-16(R1); F0  x[i-2] BNER1,R2,Loop; repeat if i≠1 DADDUIR1,R1,#-8;i  i-8 Loop:

Software Pipelining No stalls inside loop Additional start-up (and clean-up) code No reduction of control overhead No additional registers

VLIW To simplify processor hardware: sophisticated compilers (loop unrolling, software pipelining etc.) Extreme form: Very Long Instruction Word processors

VLIW Superscalar VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units

VLIW Suppose 4 functional units —Memory load unit —Floating point unit —Memory store unit —Integer/Branch unit Instruction Memory loadFP operationMemory storeInteger/Branch

VLIW Original unrolled loop: Loop:L.DF0,0(R1); F0  x[i] ADD.DF4,F0,F2; F4  x[i]+s S.D0(R1),F4; x[i]  x[i]+s DADDUIR1,R1,#-8; i  i-1 BNER1,R2,Loop; repeat if i≠0 NOP; branch delay slot 1 stall 2 stalls 1 stall Memory loadFP operationMemory storeInteger/Branch L.D stall ADD.D stall S.D Limit stall cycles by clever compilers (loop unrolling, software pipelining)

VLIW Superscalar VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units

VLIW Superscalar Dynamic VLIW Hardware -Grouping -Execution Unit Assignment -Initiation Instructions Execution Units Initiation

Dynamic VLIW VLIW: no caches because no hardware to deal with cache misses Dynamic VLIW: Hardware to stall on a cache miss. Not used frequently

VLIW Dynamic VLIW Explicitly Parallel Instruction Computing (EPIC) Instructions Execution Units Initiation Execution Unit Assign- ment

EPIC IA-64 architecture by HP and Intel IA-64 is an instruction set architecture intended for implementation on EPIC Itanium is first Intel product 64-bit architecture Basic concepts: —Instruction level parallelism indicated by compiler —Long or very long instruction words —Branch predication (≠ prediction) —Speculative loading

Key Features Large number of registers —IA-64 instruction format assumes 256 –128 * 64 bit integer, logical & general purpose –128 * 82 bit floating point and graphic —64 * 1 bit predicated execution registers (see later) —To support high degree of parallelism Multiple execution units —Expected to be 8 or more —Depends on number of transistors available —Execution of parallel instructions depends on hardware available –8 parallel instructions may be spilt into two lots of four if only four execution units are available

IA-64 Execution Units I-Unit —Integer arithmetic —Shift and add —Logical —Compare —Integer multimedia ops M-Unit —Load and store –Between register and memory —Some integer ALU B-Unit —Branch instructions F-Unit —Floating point instructions

Instruction Format Diagram

Instruction Format 128 bit bundle —Holds three instructions (syllables) plus template —Can fetch one or more bundles at a time —Template contains info on which instructions can be executed in parallel –Not confined to single bundle –e.g. a stream of 8 instructions may be executed in parallel –Compiler will have re-ordered instructions to form contiguous bundles –Can mix dependent and independent instructions in same bundle

Assembly Language Format [qp] mnemonic [.comp] dest = srcs // qp - predicate register —1 at execution then execute and commit result to hardware —0 result is discarded mnemonic - name of instruction comp – one or more instruction completers used to qualify mnemonic dest – one or more destination operands srcs – one or more source operands // - comment Instruction groups and stops indicated by ;; —Sequence without read after write or write after write —Do not need hardware register dependency checks

Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 —Changed by first instruction —Can not be in same group for parallel execution

Predication

cmp.eq p1, p2 = 0, a ;; (p1)addj = 1, j (p2)addk = 1, k ifa == 0 thenj = j+1 elsek = k+1 cmpa,0 jneL1 addj,1 jmpL2 L1:addk,1 L2: If a == 0 Then p1 = 1 and p2 = 0 Else p1 = 0 and p2 = 1 Pseudo code Using branches Predicated Should NOT be there to enable parallelism

Speculative Loading

Data Speculation st8[r4] = r12 ld8r6 = [r8];; addr5=r6, r7;; st8[r18] = r5 stallWhat if r4 contains same address as r8 ? Ld8.ar6 = [r8];;advanced load st8[r4] = r12 Ld8.cr6 = [r8];;check load addr5=r6, r7;; st8[r18] = r5 Writes source address (contents of r8) to Advanced Load Adress Table (ALAT) Each store checks ALAT and removes entry if match If no matching entry in ALAT: Load is performed again

Control & Data Speculation Control Speculation —AKA Speculative loading —Load data from memory before needed Data Speculation —Load moved before store that might alter memory location —Subsequent check in value

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

Similar presentations

Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

Similar presentations

Presentation on theme: "Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),"— Presentation transcript:

Similar presentations

About project

Feedback