Winter 2017 S. Areibi School of Engineering University of Guelph ENG3380 Computer Organization and Architecture “Instruction Level Parallelism: Pipelining: Part III” Winter 2017 S. Areibi School of Engineering University of Guelph
Topics Multicycle Operations Multiple Issue Loop Unrolling Static Multiple Issue Dynamic Multiple Issue Loop Unrolling Arm & Intel Processors With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course School of Engineering
References “Computer Organization and Architecture: Designing for Performance”, 10th edition, by William Stalling, Pearson. “Computer Organization and Design: The Hardware/Software Interface”, 4th editino, by D. Patterson and J. Hennessy, Morgan Kaufmann Computer Organization and Architecture: Themes and Variations”, 2014, by Alan Clements, CENGAGE Learning School of Engineering
MIPS Pipeline
MIPS Canonical Pipeline Morgan Kaufmann Publishers 10 June, 2018 MIPS Canonical Pipeline At this point you should be able to: Understand the basic functionality of the canonical MIPS pipeline Identify hazards in the basic MIPS pipeline Structural hazards, Data hazards, Control hazards Solve issues related to hazards by Detecting hazards and using forwarding and stalling Reducing branch delay via Prediction Chapter 4 — The Processor
Data hazards Data dependencies: Hardware solution: 6/10/2018 Data dependencies: RaW (read-after-write) WaW (write-after-write) WaR (write-after-read) Hardware solution: Forwarding / Bypassing Detection logic Stalling Software solution: Scheduling
Data dependences Three types: RaW, WaR and WaW 6/10/2018 add r1, r2, 5 ; r1 := r2+5 sub r4, r1, r3 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 sub r1, r1, 1 ; WaW of r1 st r1, 5(r2) ; M[r2+5] := r1 ld r5, 0(r4) ; RaW if 5+r2 = 0+r4 WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this!
Multicycle Operations
Multicycle Operations Morgan Kaufmann Publishers 10 June, 2018 Multicycle Operations Instructions we discussed so far all execute in a single cycle: Add, Sub, AND, OR, sw, …. However, not all instructions in real processors (MIPS R10K) execute in 1 cycle: Multiply, Divide, FP operations So we have to handle multi cycle operations!! HOW? Instr. Type Latency Integer Instructions Add/sub/logical 1 MULT/DMULT 5/6, 9/10 DIV 34/35 Load 2 FP Instructions Add/sub/abs Mul DIV.S/DIV.D 12, 19 Chapter 4 — The Processor
Multiple Functional Units Morgan Kaufmann Publishers 10 June, 2018 Multiple Functional Units To handle multicycle operations: Introduce several execution pipelines (one for every functional unit) Allow multiple outstanding operations (i.e., several operations executing at the same time). Dispatching or moving instructions from Decode stage to execute stage is not as straight forward as before EX Int Unit Issue Step M FP/int Multiplier IF ID MEM WB A FP Adder DIV FP/int Divider Chapter 4 — The Processor
Supporting Multiple FP Operations Morgan Kaufmann Publishers 10 June, 2018 Supporting Multiple FP Operations To support multiple outstanding FP Operations we need to modify our original canonical MIPS pipeline: Integer ALU FP Multiplier FP Adder FP Divider Not Pipelined! Chapter 4 — The Processor
Latency and Initiation Interval Morgan Kaufmann Publishers 10 June, 2018 Latency and Initiation Interval Definitions (according to the textbook): Latency: # cycles between instruction that produces results and instruction that uses it Initiation Interval: #cycles between issuing 2 instructions of same type. This will be more clear next! Chapter 4 — The Processor
Latency and Initiation Interval FP Adder Morgan Kaufmann Publishers 10 June, 2018 We assumed that the FP Adder takes 4 cycles but it is pipelined. Therefore: Latency is 3 and, Initiation Interval is 1 Chapter 4 — The Processor
Challenges for Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Challenges for Longer Pipelines Divider not pipelined: Structural hazards can occur Instructions varying latencies: # register writes in a cycle can be > 1 Instructions do not always reach WB in order: write-after-write (WAW) hazards possible Instructions can complete out of order: Implementing exceptions becomes problematic. Longer Latency: More stalls due to read-after-write (RAW) hazards Complexity of forwarding logic increases!! Chapter 4 — The Processor
Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Hazards in Longer Pipelines Structural Hazard: Divider not pipelined Solution: Detect the divide and stall the pipeline Is this a good solution? Yes, because dividers are expensive and they do not occur frequently Chapter 4 — The Processor
Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Hazards in Longer Pipelines Structural Hazard: due to varying latencies > 1 instructions may write to RegFile in same cycle Solution: Stall or increase number of write ports Cannot happen if instructions pass MEM stage sequentially Chapter 4 — The Processor
WAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 WAW Hazards in Longer Pipelines Because instructions do not always write back in order, write after write (WAW) hazards are possible. Later instructions writes register Rn before earlier instructions writes Rn. Next instruction may read wrong value!! Possible solutions: If instruction j in ID wants to write same register as instruction already issued, do not issue j Add busy bit to each register Set in ID, cleared in WB Chapter 4 — The Processor
WAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 WAW Hazards in Longer Pipelines SUB.D uses results of ADD.D instead of L.D Chapter 4 — The Processor
RAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 RAW Hazards in Longer Pipelines Stalls due to “read-after-write” (RAW) hazards are more frequent: i.e., instructions reads registers written by preceding instruction Chapter 4 — The Processor
Forwarding in Longer Pipelines Morgan Kaufmann Publishers Forwarding in Longer Pipelines 10 June, 2018 Forwarding in longer pipelines requires large multiplexors and lots of wiring!! Forwarding logic is significantly more complex!! Chapter 4 — The Processor
Loop Unrolling
Morgan Kaufmann Publishers Loop Unrolling Morgan Kaufmann Publishers 10 June, 2018 Definition: “A technique to get more performance from loops that access arrays, in which: Multiple copies of the loop body are made and Instructions from different iterations are scheduled together”. Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies” Store followed by a load of the same register Aka “name dependence” Reuse of a register name Chapter 4 — The Processor
Loop Unrolling: increasing ILP At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } Why Unroll loops?? Any drawbacks? loop unrolling increases code size more registers needed MIPS code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI R1,R1,32 SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16
Register Renaming
Register Renaming A technique to eliminate anti- (WaR) and output (WaW) dependencies Can be implemented by the compiler advantage: low cost disadvantage: “old” codes perform poorly in hardware advantage: binary compatibility disadvantage: extra hardware needed We describe the general idea
Register Renaming there’s a physical register file larger than logical register file mapping table associates logical registers with physical register when an instruction is decoded its physical source registers are obtained from mapping table its physical destination register is obtained from a free list mapping table is updated add r3,r3,4 R8 R7 R5 R1 R9 R2 R6 before: current mapping table: current free list: r0 r1 r2 r3 r4 add R2,R1,4 R8 R7 R5 R2 R9 R6 after: new mapping table: new free list: r0 r1 r2 r3 r4
Renaming eliminates false Dependencies Before (assume r0R8, r1R6, r2R5, .. ): addi r1, r2, 1 addi r2, r0, 0 // WaR addi r1, r2, 1 // WaW + RaW After (free list: R7, R8, R9, R10) addi R7, R5, 1 addi R10, R8, 0 // WaR disappeared addi R9, R10, 1 // WaW disappeared, // RaW renamed to R10
Instruction Level Parallelism
Instruction-Level Parallelism (ILP) Morgan Kaufmann Publishers 10 June, 2018 Instruction-Level Parallelism (ILP) Pipelining: Overlapping the execution of multiple instructions To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice Chapter 4 — The Processor
Instruction Pipelining pipelining –goal was to complete one instruction per clock cycle Answer to the last question is, obviously, twenty!
Superpipelining superpipelining -Increase the depth of the pipeline to increase the clock rate Answer to the last question is, obviously, twenty!
Superscalar – multiple-issue Fetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) Answer to the last question is, obviously, twenty!
Morgan Kaufmann Publishers Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor
Superscalar Processors An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution (by hardware).
The VLIW Architecture A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. Multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common register file.
Superscalar vs. VLIW VLIW – Very Long Instruction Word EPIC – Explicitly Parallel Instruction Computing
Multiple-Issue Processor Styles Static multiple-issue processors (aka VLIW) Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer) Dynamic multiple-issue processors (aka superscalar) Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware) E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500
Morgan Kaufmann Publishers Speculation Morgan Kaufmann Publishers 10 June, 2018 “Guess” what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation If not, roll-back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome Roll back if path taken is different Speculate on load Roll back if location is updated Chapter 4 — The Processor
MIPS with Static Dual Issue Morgan Kaufmann Publishers MIPS with Static Dual Issue 10 June, 2018 Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store n + 8 n + 12 n + 16 n + 20 Chapter 4 — The Processor
MIPS with Static Dual Issue Morgan Kaufmann Publishers 10 June, 2018 Fetch/Decode 64 bits of instructions Integer ALU Operation Load/Store Operation Chapter 4 — The Processor
Morgan Kaufmann Publishers Scheduling Example Morgan Kaufmann Publishers 10 June, 2018 Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 Hazards!! Reorder the instructions to avoid as many stalls as possible ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1,–4 2 addu $t0, $t0, $s2 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Not a great result!! Chapter 4 — The Processor
Loop Unrolling Example Morgan Kaufmann Publishers 10 June, 2018 Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 Make 4 copies of loop body ALU/branch Load/store cycle Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size Chapter 4 — The Processor
Morgan Kaufmann Publishers Power Efficiency 10 June, 2018 Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 5W Pentium 1993 66MHz 2 10W Pentium Pro 1997 200MHz 10 3 Yes 29W P4 Willamette 2001 2000MHz 22 75W P4 Prescott 2004 3600MHz 31 103W Intel Core 2006 2930MHz 14 4 Intel i5 Nehalem 2010 3300MHz 2-4 87W Intel i5 ivy Bridge 2012 3400MHz 8 77W Chapter 4 — The Processor
ARM & Intel
Cortex A8 and Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Dynamic Peak instructions/clock cycle 2 Pipeline stages 14 Pipeline schedule Static in-order Dynamic out-of-order with speculation Branch prediction 2-level 1st level caches/core 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) - 2- 8 MB
ARM Cortex-A8 Pipeline Fetch Instruction (3-Stage): The first 3 stages fetch two instructions at a time and try to keep a 12-instruction entry prefetch buffer full. It uses a two-level branch predictor (512-entry branch target buffer (BTB), a 4096-entry global history buffer (GHB), and an 8-entry return stack (RS) to predict future returns. When branch prediction is wrong it empties the pipeline (13 cycle misprediction penalty)!!
ARM Cortex-A8 Pipeline Decode Instruction (5-Stage): This stage determines if there are dependences between a pair of instructions, which would force sequential execution, and decides in which pipeline of the execution stages to send the instructions.
ARM Cortex-A8 Pipeline Instruction Execution (6-Stage): The six stages of the instruction execution section offer: one pipeline for load and store instructions and two pipelines for arithmetic operations (only the first can handle multiplies!!) The execution stages have full bypassing between the three pipelines.
ARM Cortex-A8 Performance CPI of the A8 using small versions of SPEC2000 benchmarks: The ideal CPI is 0.5 The best case is 1.4 The median is 2.0 (80% stalls due to pipeline hazards, 20% stalls due to memory Hier) The worst case is 5.2
Nehalem microarchitecture (Intel) first use: Core i7 2008 45 nm hyperthreading L3 cache 3 channel DDR3 controller QIP: quick path interconnect 32K+32K L1 per core 256 L2 per core 4-8 MB L3 shared between cores
Core i7 Pipeline X86 microprocessors employ sophisticated pipelining approaches using: Dynamic multiple issue Dynamic pipeline scheduling (with out-of-order execution and speculation) These processors are faced with the challenge of implementing the complex x86 instruction set!! Intel fetches x86 instructions and translate them into internal MIPS-like instructions, which intel calls micro-operations. The six independent units can begin execution of a ready RISC operation each cycle. Fetch & Dedode Translate 6 Func Units Cache
Core i7 Performance
Summary
Morgan Kaufmann Publishers Concluding Remarks Morgan Kaufmann Publishers 10 June, 2018 ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP) Dependencies limit achievable parallelism Complexity leads to the power wall Chapter 4 — The Processor
End Slides
Exceptions
Exceptions and Interrupts Morgan Kaufmann Publishers 10 June, 2018 “Unexpected” events requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU e.g., undefined opcode, overflow, syscall, … Interrupt From an external I/O controller Dealing with them without sacrificing performance is hard Chapter 4 — The Processor
Morgan Kaufmann Publishers Handling Exceptions Morgan Kaufmann Publishers 10 June, 2018 In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We’ll assume 1-bit 0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180 Chapter 4 — The Processor
An Alternate Mechanism Morgan Kaufmann Publishers An Alternate Mechanism 10 June, 2018 Vectored Interrupts Handler address determined by the cause Example: Undefined opcode: C000 0000 Overflow: C000 0020 …: C000 0040 Instructions either Deal with the interrupt, or Jump to real handler Chapter 4 — The Processor
Morgan Kaufmann Publishers Handler Actions 10 June, 2018 Read cause, and transfer to relevant handler Determine action required If restartable Take corrective action use EPC to return to program Otherwise Terminate program Report error using EPC, cause, … Chapter 4 — The Processor
Exceptions in a Pipeline Morgan Kaufmann Publishers Exceptions in a Pipeline 10 June, 2018 Another form of control hazard Consider overflow on add in EX stage add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Use much of the same hardware Chapter 4 — The Processor
Pipeline with Exceptions Morgan Kaufmann Publishers 10 June, 2018 Chapter 4 — The Processor
Morgan Kaufmann Publishers Exception Properties Morgan Kaufmann Publishers 10 June, 2018 Restartable exceptions Pipeline can flush the instruction Handler executes, then returns to the instruction Refetched and executed from scratch PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved Handler must adjust Chapter 4 — The Processor
Morgan Kaufmann Publishers Exception Example Morgan Kaufmann Publishers 10 June, 2018 Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … Handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) … Chapter 4 — The Processor
Morgan Kaufmann Publishers Exception Example Morgan Kaufmann Publishers 10 June, 2018 Chapter 4 — The Processor
Morgan Kaufmann Publishers Exception Example 10 June, 2018 Chapter 4 — The Processor
Morgan Kaufmann Publishers Multiple Exceptions 10 June, 2018 Pipelining overlaps multiple instructions Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions “Precise” exceptions In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult! Chapter 4 — The Processor
Morgan Kaufmann Publishers Imprecise Exceptions Morgan Kaufmann Publishers 10 June, 2018 Just stop pipeline and save state Including exception cause(s) Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require “manual” completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines Chapter 4 — The Processor
Speculation and Exceptions Morgan Kaufmann Publishers 10 June, 2018 What if exception occurs on a speculatively executed instruction? e.g., speculative load before null-pointer check Static speculation Can add ISA support for deferring exceptions Dynamic speculation Can buffer exceptions until instruction completion (which may not occur) Chapter 4 — The Processor
Inst. Level Parallelism
Matrix Multiply Unrolled C code 1 #include <x86intrin.h> 2 #define UNROLL (4) 3 4 void dgemm (int n, double* A, double* B, double* C) 5 { 6 for ( int i = 0; i < n; i+=UNROLL*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 __m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) 10 c[x] = _mm256_load_pd(C+i+x*4+j*n); 11 12 for( int k = 0; k < n; k++ ) 13 { 14 __m256d b = _mm256_broadcast_sd(B+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 18 } 19 20 for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(C+i+x*4+j*n, c[x]); 22 } 23 }
Matrix Multiply Assembly code: 1 vmovapd (%r11),%ymm4 # Load 4 elements of C into %ymm4 2 mov %rbx,%rax # register %rax = %rbx 3 xor %ecx,%ecx # register %ecx = 0 4 vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3 5 vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm2 6 vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm1 7 vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element 8 add $0x8,%rcx # register %rcx = %rcx + 8 9 vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 10 vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4 11 vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 12 vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3 13 vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 14 vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements 15 add %r8,%rax # register %rax = %rax + %r8 16 cmp %r10,%rcx # compare %r8 to %rax 17 vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2 18 vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1 19 jne 68 <dgemm+0x68> # jump if not %r8 != %rax 20 add $0x1,%esi # register % esi = % esi + 1 21 vmovapd %ymm4,(%r11) # Store %ymm4 into 4 C elements 22 vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements 23 vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements 24 vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements
Performance Impact
Speculation
Morgan Kaufmann Publishers Speculation Morgan Kaufmann Publishers 10 June, 2018 Predict branch and continue issuing Don’t commit until branch outcome determined Load speculation Avoid load and cache miss delay Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit Don’t commit load until speculation cleared Chapter 4 — The Processor
Why Do Dynamic Scheduling? Morgan Kaufmann Publishers 10 June, 2018 Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can’t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards Chapter 4 — The Processor
Does Multiple Issue Work? Morgan Kaufmann Publishers Does Multiple Issue Work? 10 June, 2018 The BIG Picture Yes, but not as much as we’d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation can help if done well Chapter 4 — The Processor
Multi-Issue
Hazards in the Dual-Issue MIPS Morgan Kaufmann Publishers 10 June, 2018 Hazards in the Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required Chapter 4 — The Processor
Dynamic Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 “Superscalar” processors CPU decides whether to issue 0, 1, 2, … each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU Chapter 4 — The Processor
Dynamic Pipeline Scheduling Morgan Kaufmann Publishers Dynamic Pipeline Scheduling 10 June, 2018 Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Chapter 4 — The Processor
Dynamically Scheduled CPU Morgan Kaufmann Publishers 10 June, 2018 Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Chapter 4 — The Processor
Morgan Kaufmann Publishers Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor
Compiler/Hardware Speculation Morgan Kaufmann Publishers 10 June, 2018 Compiler/Hardware Speculation Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover from incorrect guess Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Chapter 4 — The Processor
Morgan Kaufmann Publishers Static Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Chapter 4 — The Processor
Scheduling Static Multiple Issue Morgan Kaufmann Publishers Scheduling Static Multiple Issue 10 June, 2018 Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets Varies between ISAs; compiler must know! Pad with nop if necessary Chapter 4 — The Processor
Multiple-Issue Datapath Responsibilities Must handle, with a combination of hardware and software fixes, the fundamental limitations of Storage (data) dependencies – aka data hazards Limitation more severe in a SS/VLIW processor due to (usually) low ILP Procedural dependencies – aka control hazards Ditto, but even more severe Use dynamic branch prediction to help resolve the ILP issue Resource conflicts – aka structural hazards A SS/VLIW processor has a much larger number of potential resource conflicts Functional units may have to arbitrate for result buses and register-file write ports Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource Pipelining is much less expensive than duplicaing
Instruction Issue and Completion Policies Instruction-issue – initiate execution Instruction lookahead capability – fetch, decode and issue instructions beyond the current instruction Instruction-completion – complete execution Processor lookahead capability – complete issued instructions beyond the current instruction Instruction-commit – write back results to the RegFile or D$ (i.e., change the machine state) In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion Out-of-order issue with out-of-order completion and in-order commit
In-Order Issue with In-Order Completion Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order) Example: Assume a pipelined processor that can fetch and decode two instructions per cycle, that has three functional units (a single cycle adder, a single cycle shifter, and a two cycle multiplier), and that can complete (and write back) two results per cycle And an instruction sequence with the following characteristics I1 – needs two execute cycles (a multiply) I2 I3 I4 – needs the same function unit as I3 I5 – needs data value produced by I4 I6 – needs the same function unit as I5
In-Order Issue, In-Order Completion Example I1 –two execute cycles I2 I3 I4 –same function unit as I3 I5 –data value produced by I4 I6 –same function unit as I5 IF ID I1 I2 I3 I4 I5 I6 WB EX EX I n s t r. O r d e EX IF ID IF ID WB IF ID WB EX IF ID IF ID need forwarding hardware WB EX IF ID WB EX For lecture 8 cycles in total In parallel can Fetch/decode 2 Commit 2 IF ID IF ID WB EX
IOI-OOC Example I1 I2 I3 I4 I5 I6 7 cycles in total I n s t r. O r d e I1 –two execute cycles I2 I3 I4 –same function unit as I3 I5 –data value produced by I4 I6 –same function unit as I5 IF ID I1 I2 I3 I4 I5 I6 WB EX EX I n s t r. O r d e IF ID WB EX IF ID WB EX IF ID IF ID WB EX IF ID WB EX For lecture Ask the class what is the best case time and what is the worst case time for this sequence. 7 cycles in total IF ID IF ID WB EX
Superscalar vs. VLIW
How can the machine exploit available ILP? Limitation Issue rate, FU stalls, FU depth Clock skew, FU stalls, FU depth Hazard resolution Packing Technique ° Pipelining ° Super-pipeline - Issue 1 instr. / (fast) cycle - IF takes multiple cycles ° Super-scalar - Issue multiple scalar instructions per cycle ° VLIW - Each instruction specifies multiple scalar operations IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W