COSC3330 Computer Architecture

COSC3330 Computer Architecture
Lecture 12. ILP – Cont’d Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

ILP and OOO Plan Topic today: ILP and OOO
next week: branch predictors and tutorial on hw2 programming part

Projects FPGA related HPC cluster related
3D cellular automata and LED cube …

Instruction Level Parallelism (ILP)
Basic idea Execute several instructions in parallel We already do pipelining… But it can only churn out at best 1 instr/cycle We want multiple instr/cycle Yes, it gets a bit complicated and we have to add a fan to cool the processor, but it delivers performance (power is another issue) That’s how we got from 486 (pipelined) to Pentium and beyond

Diversified Pipelines
Separate pipelines for integer, multiply, FPU, load/store Temporal vs. Spatial vs. Both 5 5

ILP is Bounded For any sequence of instructions, the available parallelism is limited Hazards/Dependencies are what limit the ILP Data dependencies Control dependencies Memory dependencies

RAW Memory Dependency RAW (Read-After-Write)
A writes to a location, B reads from the location, therefore B has a RAW dependency on A Also called a “true dependency” A: STORE R1, 0[R2] B: LOAD R5, 0[R2] Instructions executing in same cycle cannot have RAW

WAR Memory Dependency WAR (Write-After-Read)
A reads from a location, B writes to the location, therefore B has a WAR dependency on A If B executes before A has read its operand, then the operand will be lost Also called an anti-dependence A: LOAD R5, 0[R2] B: STORE R3, 0[R2] A: LOAD R5, 0[R2] ADD R7, R5, R7 B: STORE R3, 0[R2]

WAW Memory Dependency Write-After-Write
A writes to a location, B writes to the same location If B writes first, then A writes, the location will end up with the wrong value Also called an output-dependence A: STORE R1, 0[R2] B: STORE R3, 0[R2] A: STORE R1, 0[R2] LOAD R5, 0[R2] B: STORE R3, 0[R2]

Memory Location Ambiguity
When the exact location is not known: A: STORE R1, 0[R2] B: LOAD R5, 24[R8] C: STORE R3, -8[R9] RAW exists if (R2+0) == (R8+24) WAR exists if (R8+24) == (R9 – 8) WAW exists if (R2+0) == (R9 – 8)

Memory Dependency Ambiguous dependency also forces “sequentiality” To increase ILP, needs dynamic memory disambiguation mechanisms that are either safe or recoverable ILP could be 1, could be 3, depending on the actual dependence i1: load r2, (r12) i2: store r7, 24(r20) i3: store r1, (0xFF00) ? ? ?

Control Dependencies If we have a conditional branch, until we actually know the outcome, all later instructions must wait That is, all instructions are control dependent on all earlier branches This is true for unconditional branches as well (e.g., can’t return from a function until we’ve loaded the return address) la $8, array beq $20, $22, L1 lb $10, 1($8) add $11, $9, $10 sb $11, ($8) L1: addiu $8, $8, 4

Pop from the Stack MIPS snippet li $v0, 1 j fact fact:
blt $a0, $v0, return sub $sp, $sp, 8 sw $ra, 4($sp) sw $a0, 0($sp) sub $a0, $a0, 1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) mult $a0, $v0 mflo $v0 add $sp, $sp, 8 return: jr $ra C code snippet z = fact(x); int fact(int n) { if (n < 1) return(1); else return(n * fact(n-1)) } Return address $a0 (= X) $sp

Name Dependency WAR and WAW result due to reuse of names R2 = R1 + R3
Would WAR and WAW exist with more registers?

ILP Example False dependency removed ILP = 3/2 = 1.5
True dependency forces “sequentiality” ILP = 3/3 = 1 False dependency removed ILP = 3/2 = 1.5 c1=i1: load r2, (r12) c2=i2: add r1, r2, 9 c3=i3: mul r2, r5, r6 i1: load r2, (r12) i2: add r1, r2, 9 i3: mul r8, r5, r6 t t o a c1: load r2, (r12) c2: add r1, r2, #9 mul r8, r5, r6

Eliminating WAR Dependencies
WAR dependencies are from reusing registers A: R1 = R3 / R4 B: R3 = R2 * R4 A: R1 = R3 / R4 B: R5 = R2 * R4 X A A 5 -2 9 3 R1 R2 R3 R4 B A 4 R5 -6 R1 5 3 3 R1 5 5 -2 B B R2 -2 -2 -2 R2 -2 -2 -2 R3 9 9 -6 R3 9 -6 -6 R4 3 3 3 R4 3 3 3 With no dependencies, reordering still produces the correct results

Eliminating WAW Dependencies
WAW dependencies are also from reusing registers A: R1 = R2 + R3 B: R1 = R3 * R4 A: R5 = R2 + R3 B: R1 = R3 * R4 X A B 5 -2 9 3 R1 R2 R3 R4 27 7 A B 5 -2 9 3 R1 R2 R3 R4 27 B A 4 R5 7 R1 5 7 27 R2 -2 -2 -2 R3 9 9 9 R4 3 3 3 Same solution works

Another Register Example
When only 4 registers available R1 = 8(R0) R3 = R1 – 5 R2 = R1 * R3 24(R0) = R2 R1 = 16(R0) 32(R0) = R2 ILP =

Another Register Example
When more registers (or register renaming) available R1 = 8(R0) R3 = R1 – 5 R2 = R1 * R3 24(R0) = R2 R1 = 16(R0) 32(R0) = R2 R1 = 8(R0) R3 = R1 – 5 R2 = R1 * R3 24(R0) = R2 R5 = 16(R0) R6 = R5 – 5 R7 = R5 * R6 32(R0) = R7 ILP =

Obvious Solution: More Registers
Add more registers to the ISA? Changing the ISA can break binary compatibility All code must be recompiled Not a scalable solution BAD!!!

Better Solution: Register Renaming
Give processor more registers than specified by the ISA  temporarily map ISA registers (“logical” or “architected” registers) to the physical registers to avoid overwrites Components: mapping mechanism physical registers allocated vs. free registers allocation/deallocation mechanism

Register Renaming Example
I3 can not exec before I2 because I3 will overwrite R6 I5 can not go before I2 because I2, when it goes, will overwrite R2 with a stale value Program code I1: ADD R1, R2, R3 I2: SUB R2, R1, R6 I3: AND R6, R11, R7 I4: OR R8, R5, R2 I5: XOR R2, R4, R11 RAW WAR WAW

Register Renaming Program code Program code Solution:
I1: ADD R1, R2, R3 I2: SUB R2, R1, R6 I3: AND R6, R11, R7 I4: OR R8, R5, R2 I5: XOR R2, R4, R11 Program code I1: ADD R1, R2, R3 I2: SUB S, R1, R6 I3: AND U, R11, R7 I4: OR R8, R5, S I5: XOR T, R4, R11 Program code Solution: Let’s give I2 temporary name/ location (e.g., S) for the value it produces. But I4 uses that value, so we must also change that to S… In fact, all uses of R5 from I3 to the next instruction that writes to R5 again must now be changed to S! We remove WAW deps in the same way: change R2 in I5 (and subsequent instrs) to T.

Register Renaming Implementation Simple Solution Program code
I1: ADD R1, R2, R3 I2: SUB S, R1, R5 I3: AND U, R11, R7 I4: OR R8, R5, S I5: XOR T, R4, R11 Program code Implementation Space for S, T, U etc. How do we know when to rename a register? Simple Solution Do renaming for every instruction Change the name of a register each time we decode an instruction that will write to it. Remember what name we gave it 

Register File Organization
We need some physical structure to store the register values Architected Register File ARF “Outside” world sees the ARF RAT PRF One Physical REG per instruction in-flight Register Alias Table Physical Register File

Putting it all Together
top: R1 = R2 + R3 R2 = R4 – R1 R1 = R3 * R6 R2 = R1 + R2 R3 = R1 >> 1 BNEZ R3, top Free pool: X9, X11, X7, X2, X13, X4, X8, X12, X3, X5… ARF PRF R1 X1 R2 X2 R3 X3 R4 X4 R5 X5 R6 X6 X7 X8 RAT X9 R1 R1 X10 R2 R2 X11 R3 R3 X12 R4 R4 X13 R5 R5 X14 R6 R6 X15 X16

Renaming in action Free pool:
R1 = R2 + R3 R2 = R4 – R1 R1 = R3 * R6 R2 = R1 + R2 R3 = R1 >> 1 BNEZ R3, top Free pool: X9, X11, X7, X2, X13, X4, X8, X12, X3, X5… = R2 + R3 = R4 – = R3 * R6 = = >> 1 BNEZ , top = – = * R6 ARF PRF R1 X1 R2 X2 R3 X3 R4 X4 R5 X5 R6 X6 X7 X8 RAT X9 R1 R1 X10 R2 R2 X11 R3 R3 X12 R4 R4 X13 R5 R5 X14 R6 R6 X15 X16

Even Physical Registers are Limited
We keep using new physical registers What happens when we run out? There must be a way to “recycle” When can we recycle? When we have given its value to all instructions that use it as a source operand! This is not as easy as it sounds

Instruction Commit (Leaving the Pipe)
Architected register file contains the “official” processor state ARF R3 When an instruction leaves the pipeline, it makes its result “official” by updating the ARF RAT R3 PRF The ARF now contains the correct value; update the RAT T42 Free Pool T42 is no longer needed, return to the physical register free pool

Careful with the RAT Update!
Update ARF as usual Deallocate physical register ARF R3 Don’t touch that RAT! (Someone else is the most recent writer to R3) RAT R3 At some point in the future, the newer writer of R3 exits PRF T17 T42 Free Pool This instruction was the most recent writer, now update the RAT Deallocate physical register

Cortex A9

ILP != IPC ILP is an attribute of the program
also dependent on the ISA, compiler IPC depends on the actual machine implementation ILP is an upper bound on IPC achievable IPC depends on instruction latencies, cache hit rates, branch prediction rates, structural conflicts, instruction window size, etc., etc., etc.

Data Dependency Graph (Data Flow Graph)
i1 i2 i3 i4 i6 i7 i8 i5 i9 i11 i12 i13 i10 i14 i15 i16 i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 – r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 – r8 i14: r8 = r15* r8 i15: r8 = r4 – r8 i16: (r28) = r8

Out-of-Order Execution
i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 – r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 – r8 i14: r8 = r15* r8 i15: r8 = r4 – r8 i16: (r28) = r8

Dynamic Execution Model
To exploit maximal ILP An instruction can be executed immediately after All source operands are ready Execution unit available Destination is ready

Dynamic HW Scheduling Exploit ILP at run-time
Execute instructions out-of-order Hardware will Maintain true dependency (data flow manner) Find ILP within an Instruction Window (pool) Pros Scalable performance: allows code to be compiled on one platform, but also run efficiently on another Handle cases where dependency is unknown at compile-time Cons Hardware complexity

Intel Quad Core

Dynamic Pipeline Fetch OOO execution  out-of-order completion
Program Order Fetch OOO execution  out-of-order completion OOO execution  out-of-order retirement (commit) No instruction allowed to retire until it is confirmed on the right path Fetch, decode, issue (i.e., front-end) are still done in the program order Decode Dispatch Out of Order Execute In Program Order Complete Retire

Dynamic Pipeline ( in order ) ( out of order ) ( out of order )
IF ID RD ( in order ) Dispatch Buffer ( out of order ) ALU FP1 MEM1 BR EX FP2 MEM2 FP3 ( out of order ) Reorder Buffer ( in order ) WB

Implementing Dynamic Scheduling
Tomasulo’s Algorithm Used in IBM 360/91 (in the 60s) Tracks when operands are available to satisfy data dependences Removes name dependences through register renaming Very similar to what is used today Almost all modern high-performance processors use a derivative of Tomasulo’s… much of the terminology survives to today.

Robert Tomasulo Eckert–Mauchly Award, 1997 For the ingenious Tomasulo's algorithm, which enabled out-of-order execution processors to be implemented. John Presper Eckert and John William Mauchly ENIAC, ENIAC is the first general-purpose electronic computer.

COSC3330 Computer Architecture

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC3330 Computer Architecture

Similar presentations

Presentation on theme: "COSC3330 Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback