Winter 2017 S. Areibi School of Engineering University of Guelph

Slides:

Advertisements

Similar presentations

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Lecture 8Fall 2006 Chapter 6: Superscalar Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson & Hennessy,

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

CSIE30300 Computer Architecture Unit 13: Introduction to Multiple Issue Hsin-Chou Chi [Adapted from material by and

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Exceptions and Interrupts “Unexpected” events requiring change in flow of control – Different ISAs use the terms differently Exception – Arises within.

Use of Pipelining to Achieve CPI < 1

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Computer Organization CS224

Morgan Kaufmann Publishers The Processor

XU, Qiang 徐強 [Adapted from UC Berkeley’s D. Patterson’s and

Computer Architecture

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Morgan Kaufmann Publishers

Pipeline Architecture since 1985

Morgan Kaufmann Publishers The Processor

Dynamic Multiple-Issue (4.10)

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Morgan Kaufmann Publishers The Processor

Pipelining: Advanced ILP

The processor: Exceptions and Interrupts

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 6

Part IV Data Path and Control

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

The processor: Pipelining and Branching

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Morgan Kaufmann Publishers The Processor

Adapted from the slides of Prof

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

How to improve (decrease) CPI

Control unit extension for data hazards

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS203 – Advanced Computer Architecture

The Processor (2/3) Chapter 4 (2/3)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

ECE 445 – Computer Organization

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Winter 2017 S. Areibi School of Engineering University of Guelph ENG3380 Computer Organization and Architecture “Instruction Level Parallelism: Pipelining: Part III” Winter 2017 S. Areibi School of Engineering University of Guelph

Topics Multicycle Operations Multiple Issue Loop Unrolling Static Multiple Issue Dynamic Multiple Issue Loop Unrolling Arm & Intel Processors With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course School of Engineering

References “Computer Organization and Architecture: Designing for Performance”, 10th edition, by William Stalling, Pearson. “Computer Organization and Design: The Hardware/Software Interface”, 4th editino, by D. Patterson and J. Hennessy, Morgan Kaufmann Computer Organization and Architecture: Themes and Variations”, 2014, by Alan Clements, CENGAGE Learning School of Engineering

MIPS Pipeline

MIPS Canonical Pipeline Morgan Kaufmann Publishers 10 June, 2018 MIPS Canonical Pipeline At this point you should be able to: Understand the basic functionality of the canonical MIPS pipeline Identify hazards in the basic MIPS pipeline Structural hazards, Data hazards, Control hazards Solve issues related to hazards by Detecting hazards and using forwarding and stalling Reducing branch delay via Prediction Chapter 4 — The Processor

Data hazards Data dependencies: Hardware solution: 6/10/2018 Data dependencies: RaW (read-after-write) WaW (write-after-write) WaR (write-after-read) Hardware solution: Forwarding / Bypassing Detection logic Stalling Software solution: Scheduling

Data dependences Three types: RaW, WaR and WaW 6/10/2018 add r1, r2, 5 ; r1 := r2+5 sub r4, r1, r3 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 sub r1, r1, 1 ; WaW of r1 st r1, 5(r2) ; M[r2+5] := r1 ld r5, 0(r4) ; RaW if 5+r2 = 0+r4 WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium!  use register renaming to solve this!

Multicycle Operations

Multicycle Operations Morgan Kaufmann Publishers 10 June, 2018 Multicycle Operations Instructions we discussed so far all execute in a single cycle: Add, Sub, AND, OR, sw, …. However, not all instructions in real processors (MIPS R10K) execute in 1 cycle: Multiply, Divide, FP operations So we have to handle multi cycle operations!! HOW? Instr. Type Latency Integer Instructions Add/sub/logical 1 MULT/DMULT 5/6, 9/10 DIV 34/35 Load 2 FP Instructions Add/sub/abs Mul DIV.S/DIV.D 12, 19 Chapter 4 — The Processor

Multiple Functional Units Morgan Kaufmann Publishers 10 June, 2018 Multiple Functional Units To handle multicycle operations: Introduce several execution pipelines (one for every functional unit) Allow multiple outstanding operations (i.e., several operations executing at the same time). Dispatching or moving instructions from Decode stage to execute stage is not as straight forward as before EX Int Unit Issue Step M FP/int Multiplier IF ID MEM WB A FP Adder DIV FP/int Divider Chapter 4 — The Processor

Supporting Multiple FP Operations Morgan Kaufmann Publishers 10 June, 2018 Supporting Multiple FP Operations To support multiple outstanding FP Operations we need to modify our original canonical MIPS pipeline: Integer ALU FP Multiplier FP Adder FP Divider Not Pipelined! Chapter 4 — The Processor

Latency and Initiation Interval Morgan Kaufmann Publishers 10 June, 2018 Latency and Initiation Interval Definitions (according to the textbook): Latency: # cycles between instruction that produces results and instruction that uses it Initiation Interval: #cycles between issuing 2 instructions of same type. This will be more clear next! Chapter 4 — The Processor

Latency and Initiation Interval FP Adder Morgan Kaufmann Publishers 10 June, 2018 We assumed that the FP Adder takes 4 cycles but it is pipelined. Therefore: Latency is 3 and, Initiation Interval is 1 Chapter 4 — The Processor

Challenges for Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Challenges for Longer Pipelines Divider not pipelined: Structural hazards can occur Instructions varying latencies: # register writes in a cycle can be > 1 Instructions do not always reach WB in order: write-after-write (WAW) hazards possible Instructions can complete out of order: Implementing exceptions becomes problematic. Longer Latency: More stalls due to read-after-write (RAW) hazards Complexity of forwarding logic increases!! Chapter 4 — The Processor

Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Hazards in Longer Pipelines Structural Hazard: Divider not pipelined Solution: Detect the divide and stall the pipeline Is this a good solution? Yes, because dividers are expensive and they do not occur frequently Chapter 4 — The Processor

Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 Hazards in Longer Pipelines Structural Hazard: due to varying latencies > 1 instructions may write to RegFile in same cycle Solution: Stall or increase number of write ports Cannot happen if instructions pass MEM stage sequentially Chapter 4 — The Processor

WAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 WAW Hazards in Longer Pipelines Because instructions do not always write back in order, write after write (WAW) hazards are possible. Later instructions writes register Rn before earlier instructions writes Rn. Next instruction may read wrong value!! Possible solutions: If instruction j in ID wants to write same register as instruction already issued, do not issue j Add busy bit to each register Set in ID, cleared in WB Chapter 4 — The Processor

WAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 WAW Hazards in Longer Pipelines SUB.D uses results of ADD.D instead of L.D Chapter 4 — The Processor

RAW Hazards in Longer Pipelines Morgan Kaufmann Publishers 10 June, 2018 RAW Hazards in Longer Pipelines Stalls due to “read-after-write” (RAW) hazards are more frequent: i.e., instructions reads registers written by preceding instruction Chapter 4 — The Processor

Forwarding in Longer Pipelines Morgan Kaufmann Publishers Forwarding in Longer Pipelines 10 June, 2018 Forwarding in longer pipelines requires large multiplexors and lots of wiring!! Forwarding logic is significantly more complex!! Chapter 4 — The Processor

Loop Unrolling

Morgan Kaufmann Publishers Loop Unrolling Morgan Kaufmann Publishers 10 June, 2018 Definition: “A technique to get more performance from loops that access arrays, in which: Multiple copies of the loop body are made and Instructions from different iterations are scheduled together”. Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies” Store followed by a load of the same register Aka “name dependence” Reuse of a register name Chapter 4 — The Processor

Loop Unrolling: increasing ILP At source level: for (i=1; i<=1000; i++) x[i] = x[i] + s; for (i=1; i<=1000; i=i+4) { x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s; } Why Unroll loops?? Any drawbacks? loop unrolling increases code size more registers needed MIPS code after scheduling: Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI R1,R1,32 SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16

Register Renaming

Register Renaming A technique to eliminate anti- (WaR) and output (WaW) dependencies Can be implemented by the compiler advantage: low cost disadvantage: “old” codes perform poorly in hardware advantage: binary compatibility disadvantage: extra hardware needed We describe the general idea

Register Renaming there’s a physical register file larger than logical register file mapping table associates logical registers with physical register when an instruction is decoded its physical source registers are obtained from mapping table its physical destination register is obtained from a free list mapping table is updated add r3,r3,4 R8 R7 R5 R1 R9 R2 R6 before: current mapping table: current free list: r0 r1 r2 r3 r4 add R2,R1,4 R8 R7 R5 R2 R9 R6 after: new mapping table: new free list: r0 r1 r2 r3 r4

Renaming eliminates false Dependencies Before (assume r0R8, r1R6, r2R5, .. ): addi r1, r2, 1 addi r2, r0, 0 // WaR addi r1, r2, 1 // WaW + RaW After (free list: R7, R8, R9, R10) addi R7, R5, 1 addi R10, R8, 0 // WaR disappeared addi R9, R10, 1 // WaW disappeared, // RaW renamed to R10

Instruction Level Parallelism

Instruction-Level Parallelism (ILP) Morgan Kaufmann Publishers 10 June, 2018 Instruction-Level Parallelism (ILP) Pipelining: Overlapping the execution of multiple instructions To increase ILP Deeper pipeline Less work per stage  shorter clock cycle Multiple issue Replicate pipeline stages  multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice Chapter 4 — The Processor

Instruction Pipelining pipelining –goal was to complete one instruction per clock cycle Answer to the last question is, obviously, twenty!

Superpipelining superpipelining -Increase the depth of the pipeline to increase the clock rate Answer to the last question is, obviously, twenty!

Superscalar – multiple-issue Fetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) Answer to the last question is, obviously, twenty!

Morgan Kaufmann Publishers Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor

Superscalar Processors An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution (by hardware).

The VLIW Architecture A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. Multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common register file.

Superscalar vs. VLIW VLIW – Very Long Instruction Word EPIC – Explicitly Parallel Instruction Computing

Multiple-Issue Processor Styles Static multiple-issue processors (aka VLIW) Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer) Dynamic multiple-issue processors (aka superscalar) Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware) E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500

Morgan Kaufmann Publishers Speculation Morgan Kaufmann Publishers 10 June, 2018 “Guess” what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation If not, roll-back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome Roll back if path taken is different Speculate on load Roll back if location is updated Chapter 4 — The Processor

MIPS with Static Dual Issue Morgan Kaufmann Publishers MIPS with Static Dual Issue 10 June, 2018 Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store n + 8 n + 12 n + 16 n + 20 Chapter 4 — The Processor

MIPS with Static Dual Issue Morgan Kaufmann Publishers 10 June, 2018 Fetch/Decode 64 bits of instructions Integer ALU Operation Load/Store Operation Chapter 4 — The Processor

Morgan Kaufmann Publishers Scheduling Example Morgan Kaufmann Publishers 10 June, 2018 Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 Hazards!! Reorder the instructions to avoid as many stalls as possible ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1,–4 2 addu $t0, $t0, $s2 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Not a great result!! Chapter 4 — The Processor

Loop Unrolling Example Morgan Kaufmann Publishers 10 June, 2018 Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 Make 4 copies of loop body ALU/branch Load/store cycle Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size Chapter 4 — The Processor

Morgan Kaufmann Publishers Power Efficiency 10 June, 2018 Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 5W Pentium 1993 66MHz 2 10W Pentium Pro 1997 200MHz 10 3 Yes 29W P4 Willamette 2001 2000MHz 22 75W P4 Prescott 2004 3600MHz 31 103W Intel Core 2006 2930MHz 14 4 Intel i5 Nehalem 2010 3300MHz 2-4 87W Intel i5 ivy Bridge 2012 3400MHz 8 77W Chapter 4 — The Processor

ARM & Intel

Cortex A8 and Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Dynamic Peak instructions/clock cycle 2 Pipeline stages 14 Pipeline schedule Static in-order Dynamic out-of-order with speculation Branch prediction 2-level 1st level caches/core 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) - 2- 8 MB

ARM Cortex-A8 Pipeline Fetch Instruction (3-Stage): The first 3 stages fetch two instructions at a time and try to keep a 12-instruction entry prefetch buffer full. It uses a two-level branch predictor (512-entry branch target buffer (BTB), a 4096-entry global history buffer (GHB), and an 8-entry return stack (RS) to predict future returns. When branch prediction is wrong it empties the pipeline (13 cycle misprediction penalty)!!

ARM Cortex-A8 Pipeline Decode Instruction (5-Stage): This stage determines if there are dependences between a pair of instructions, which would force sequential execution, and decides in which pipeline of the execution stages to send the instructions.

ARM Cortex-A8 Pipeline Instruction Execution (6-Stage): The six stages of the instruction execution section offer: one pipeline for load and store instructions and two pipelines for arithmetic operations (only the first can handle multiplies!!) The execution stages have full bypassing between the three pipelines.

ARM Cortex-A8 Performance CPI of the A8 using small versions of SPEC2000 benchmarks: The ideal CPI is 0.5 The best case is 1.4 The median is 2.0 (80% stalls due to pipeline hazards, 20% stalls due to memory Hier) The worst case is 5.2

Nehalem microarchitecture (Intel) first use: Core i7 2008 45 nm hyperthreading L3 cache 3 channel DDR3 controller QIP: quick path interconnect 32K+32K L1 per core 256 L2 per core 4-8 MB L3 shared between cores

Core i7 Pipeline X86 microprocessors employ sophisticated pipelining approaches using: Dynamic multiple issue Dynamic pipeline scheduling (with out-of-order execution and speculation) These processors are faced with the challenge of implementing the complex x86 instruction set!! Intel fetches x86 instructions and translate them into internal MIPS-like instructions, which intel calls micro-operations. The six independent units can begin execution of a ready RISC operation each cycle. Fetch & Dedode Translate 6 Func Units Cache

Core i7 Performance

Summary

Morgan Kaufmann Publishers Concluding Remarks Morgan Kaufmann Publishers 10 June, 2018 ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism More instructions completed per second Latency for each instruction not reduced Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP) Dependencies limit achievable parallelism Complexity leads to the power wall Chapter 4 — The Processor

End Slides

Exceptions

Exceptions and Interrupts Morgan Kaufmann Publishers 10 June, 2018 “Unexpected” events requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU e.g., undefined opcode, overflow, syscall, … Interrupt From an external I/O controller Dealing with them without sacrificing performance is hard Chapter 4 — The Processor

Morgan Kaufmann Publishers Handling Exceptions Morgan Kaufmann Publishers 10 June, 2018 In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We’ll assume 1-bit 0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180 Chapter 4 — The Processor

An Alternate Mechanism Morgan Kaufmann Publishers An Alternate Mechanism 10 June, 2018 Vectored Interrupts Handler address determined by the cause Example: Undefined opcode: C000 0000 Overflow: C000 0020 …: C000 0040 Instructions either Deal with the interrupt, or Jump to real handler Chapter 4 — The Processor

Morgan Kaufmann Publishers Handler Actions 10 June, 2018 Read cause, and transfer to relevant handler Determine action required If restartable Take corrective action use EPC to return to program Otherwise Terminate program Report error using EPC, cause, … Chapter 4 — The Processor

Exceptions in a Pipeline Morgan Kaufmann Publishers Exceptions in a Pipeline 10 June, 2018 Another form of control hazard Consider overflow on add in EX stage add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Use much of the same hardware Chapter 4 — The Processor

Pipeline with Exceptions Morgan Kaufmann Publishers 10 June, 2018 Chapter 4 — The Processor

Morgan Kaufmann Publishers Exception Properties Morgan Kaufmann Publishers 10 June, 2018 Restartable exceptions Pipeline can flush the instruction Handler executes, then returns to the instruction Refetched and executed from scratch PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved Handler must adjust Chapter 4 — The Processor

Morgan Kaufmann Publishers Exception Example Morgan Kaufmann Publishers 10 June, 2018 Exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … Handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) … Chapter 4 — The Processor

Morgan Kaufmann Publishers Exception Example Morgan Kaufmann Publishers 10 June, 2018 Chapter 4 — The Processor

Morgan Kaufmann Publishers Exception Example 10 June, 2018 Chapter 4 — The Processor

Morgan Kaufmann Publishers Multiple Exceptions 10 June, 2018 Pipelining overlaps multiple instructions Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions “Precise” exceptions In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult! Chapter 4 — The Processor

Morgan Kaufmann Publishers Imprecise Exceptions Morgan Kaufmann Publishers 10 June, 2018 Just stop pipeline and save state Including exception cause(s) Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require “manual” completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines Chapter 4 — The Processor

Speculation and Exceptions Morgan Kaufmann Publishers 10 June, 2018 What if exception occurs on a speculatively executed instruction? e.g., speculative load before null-pointer check Static speculation Can add ISA support for deferring exceptions Dynamic speculation Can buffer exceptions until instruction completion (which may not occur) Chapter 4 — The Processor

Inst. Level Parallelism

Matrix Multiply Unrolled C code 1 #include <x86intrin.h> 2 #define UNROLL (4) 3 4 void dgemm (int n, double* A, double* B, double* C) 5 { 6 for ( int i = 0; i < n; i+=UNROLL*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 __m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) 10 c[x] = _mm256_load_pd(C+i+x*4+j*n); 11 12 for( int k = 0; k < n; k++ ) 13 { 14 __m256d b = _mm256_broadcast_sd(B+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(A+n*k+x*4+i), b)); 18 } 19 20 for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(C+i+x*4+j*n, c[x]); 22 } 23 }

Matrix Multiply Assembly code: 1 vmovapd (%r11),%ymm4 # Load 4 elements of C into %ymm4 2 mov %rbx,%rax # register %rax = %rbx 3 xor %ecx,%ecx # register %ecx = 0 4 vmovapd 0x20(%r11),%ymm3 # Load 4 elements of C into %ymm3 5 vmovapd 0x40(%r11),%ymm2 # Load 4 elements of C into %ymm2 6 vmovapd 0x60(%r11),%ymm1 # Load 4 elements of C into %ymm1 7 vbroadcastsd (%rcx,%r9,1),%ymm0 # Make 4 copies of B element 8 add $0x8,%rcx # register %rcx = %rcx + 8 9 vmulpd (%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 10 vaddpd %ymm5,%ymm4,%ymm4 # Parallel add %ymm5, %ymm4 11 vmulpd 0x20(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 12 vaddpd %ymm5,%ymm3,%ymm3 # Parallel add %ymm5, %ymm3 13 vmulpd 0x40(%rax),%ymm0,%ymm5 # Parallel mul %ymm1,4 A elements 14 vmulpd 0x60(%rax),%ymm0,%ymm0 # Parallel mul %ymm1,4 A elements 15 add %r8,%rax # register %rax = %rax + %r8 16 cmp %r10,%rcx # compare %r8 to %rax 17 vaddpd %ymm5,%ymm2,%ymm2 # Parallel add %ymm5, %ymm2 18 vaddpd %ymm0,%ymm1,%ymm1 # Parallel add %ymm0, %ymm1 19 jne 68 <dgemm+0x68> # jump if not %r8 != %rax 20 add $0x1,%esi # register % esi = % esi + 1 21 vmovapd %ymm4,(%r11) # Store %ymm4 into 4 C elements 22 vmovapd %ymm3,0x20(%r11) # Store %ymm3 into 4 C elements 23 vmovapd %ymm2,0x40(%r11) # Store %ymm2 into 4 C elements 24 vmovapd %ymm1,0x60(%r11) # Store %ymm1 into 4 C elements

Performance Impact

Speculation

Morgan Kaufmann Publishers Speculation Morgan Kaufmann Publishers 10 June, 2018 Predict branch and continue issuing Don’t commit until branch outcome determined Load speculation Avoid load and cache miss delay Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit Don’t commit load until speculation cleared Chapter 4 — The Processor

Why Do Dynamic Scheduling? Morgan Kaufmann Publishers 10 June, 2018 Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses Can’t always schedule around branches Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards Chapter 4 — The Processor

Does Multiple Issue Work? Morgan Kaufmann Publishers Does Multiple Issue Work? 10 June, 2018 The BIG Picture Yes, but not as much as we’d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth Hard to keep pipelines full Speculation can help if done well Chapter 4 — The Processor

Multi-Issue

Hazards in the Dual-Issue MIPS Morgan Kaufmann Publishers 10 June, 2018 Hazards in the Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required Chapter 4 — The Processor

Dynamic Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 “Superscalar” processors CPU decides whether to issue 0, 1, 2, … each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU Chapter 4 — The Processor

Dynamic Pipeline Scheduling Morgan Kaufmann Publishers Dynamic Pipeline Scheduling 10 June, 2018 Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Chapter 4 — The Processor

Dynamically Scheduled CPU Morgan Kaufmann Publishers 10 June, 2018 Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Chapter 4 — The Processor

Morgan Kaufmann Publishers Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor

Compiler/Hardware Speculation Morgan Kaufmann Publishers 10 June, 2018 Compiler/Hardware Speculation Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover from incorrect guess Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Chapter 4 — The Processor

Morgan Kaufmann Publishers Static Multiple Issue Morgan Kaufmann Publishers 10 June, 2018 Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations  Very Long Instruction Word (VLIW) Chapter 4 — The Processor

Scheduling Static Multiple Issue Morgan Kaufmann Publishers Scheduling Static Multiple Issue 10 June, 2018 Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets Varies between ISAs; compiler must know! Pad with nop if necessary Chapter 4 — The Processor

Multiple-Issue Datapath Responsibilities Must handle, with a combination of hardware and software fixes, the fundamental limitations of Storage (data) dependencies – aka data hazards Limitation more severe in a SS/VLIW processor due to (usually) low ILP Procedural dependencies – aka control hazards Ditto, but even more severe Use dynamic branch prediction to help resolve the ILP issue Resource conflicts – aka structural hazards A SS/VLIW processor has a much larger number of potential resource conflicts Functional units may have to arbitrate for result buses and register-file write ports Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource Pipelining is much less expensive than duplicaing

Instruction Issue and Completion Policies Instruction-issue – initiate execution Instruction lookahead capability – fetch, decode and issue instructions beyond the current instruction Instruction-completion – complete execution Processor lookahead capability – complete issued instructions beyond the current instruction Instruction-commit – write back results to the RegFile or D$ (i.e., change the machine state) In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion Out-of-order issue with out-of-order completion and in-order commit

In-Order Issue with In-Order Completion Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order) Example: Assume a pipelined processor that can fetch and decode two instructions per cycle, that has three functional units (a single cycle adder, a single cycle shifter, and a two cycle multiplier), and that can complete (and write back) two results per cycle And an instruction sequence with the following characteristics I1 – needs two execute cycles (a multiply) I2 I3 I4 – needs the same function unit as I3 I5 – needs data value produced by I4 I6 – needs the same function unit as I5

In-Order Issue, In-Order Completion Example I1 –two execute cycles I2 I3 I4 –same function unit as I3 I5 –data value produced by I4 I6 –same function unit as I5 IF ID I1 I2 I3 I4 I5 I6 WB EX EX I n s t r. O r d e EX IF ID IF ID WB IF ID WB EX IF ID IF ID need forwarding hardware WB EX IF ID WB EX For lecture 8 cycles in total In parallel can Fetch/decode 2 Commit 2 IF ID IF ID WB EX

IOI-OOC Example I1 I2 I3 I4 I5 I6 7 cycles in total I n s t r. O r d e I1 –two execute cycles I2 I3 I4 –same function unit as I3 I5 –data value produced by I4 I6 –same function unit as I5 IF ID I1 I2 I3 I4 I5 I6 WB EX EX I n s t r. O r d e IF ID WB EX IF ID WB EX IF ID IF ID WB EX IF ID WB EX For lecture Ask the class what is the best case time and what is the worst case time for this sequence. 7 cycles in total IF ID IF ID WB EX

Superscalar vs. VLIW

How can the machine exploit available ILP? Limitation Issue rate, FU stalls, FU depth Clock skew, FU stalls, FU depth Hazard resolution Packing Technique ° Pipelining ° Super-pipeline - Issue 1 instr. / (fast) cycle - IF takes multiple cycles ° Super-scalar - Issue multiple scalar instructions per cycle ° VLIW - Each instruction specifies multiple scalar operations IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W