CMSC 611: Advanced Computer Architecture Instruction Level Parallelism Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides.

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

Instruction-Level Parallelism (ILP)

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

CIS429/529 Winter 2007 Pipelining II- 1 Additional pipelining topics Why pipelining is so hard: exception handling ILP techniques: loop unrolling.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 24, 2003 Topic: Pipelining -- Intermediate Concepts (Multicycle Operations;

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

CIS429.S00: Lec12- 1 Miscellaneous pipelining topics Why pipelining is so hard: exception handling Advanced pipelining techniques: loop unrolling.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

1 Manchester Mark I, This was the second (the first was a small- scale prototype) machine built at Cambridge. A production version of this computer.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.

Appendix A. Pipelining: Basic and Intermediate Concept

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

CSC 4250 Computer Architectures September 22, 2006 Appendix A. Pipelining.

1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.

Data Hazards Dependent instructions add %g1, %g2, %g3 sub %l1, %g3, %o0 Forwarding helps, but not all hazards can be avoided.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Instruction-Level Parallelism and Its Dynamic Exploitation

CMSC 611: Advanced Computer Architecture

Instruction-Level Parallelism

CS203 – Advanced Computer Architecture

Pipelining Wrapup Brief overview of the rest of chapter 3

Exceptions & Multi-cycle Operations

Pipelining: Advanced ILP

Lecture 6: Advanced Pipelines

Pipelining Multicycle, MIPS R4000, and More

How to improve (decrease) CPI

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Project Instruction Scheduler Assembler for DLX

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Overview What are pipeline hazards? Types of hazards

Extending simple pipeline to multiple pipes

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Control Hazards Branches (conditional, unconditional, call-return)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Interrupts and exceptions

CMSC 611: Advanced Computer Architecture

Presentation transcript:

CMSC 611: Advanced Computer Architecture Instruction Level Parallelism Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science

Floating-Point Pipeline Impractical for FP ops to complete in one clock –(complex logic and/or very long clock cycle) More complex hazards –Structural –Data

Non-pipelined DIV operation stalling the whole pipeline for 24 cycles 4-stage pipelined FP addition Integer ALU 7-stage pipelined FP multiply Multi-cycle FP Pipeline Example: blue indicate where data is needed and red when result is available

Multi-cycle FP: EX Phase Latency: cycles between instruction that produces result and instruction that uses it –Since most operations consume their operands at the beginning of the EX stage, latency is usually number of the stages of the EX an instruction uses Long latency increases the frequency of RAW hazards Initiation (Repeat) interval: cycles between issuing two operations of a given type

Example of RAW hazard caused by the long latency FP Pipeline Challenges Non-pipelined divide causes structural hazards Number of register writes required in a cycle can be larger than 1 WAW hazards are possible –Instructions no longer reach WB in order WAR hazards are NOT possible –Register reads are still taking place during the ID stage Instructions can complete out of order –Complicates exceptions Longer latency makes RAW stalls more frequent

Structural Hazard At cycle 10, MULTD, ADDD and LD instructions all in MEM At cycle 11, MULTD, ADDD and LD instructions all in WB –Additional write ports are not cost effective since they are rarely used Instead –Detect at ID and stall –Detect at MEM or WB and stall

WAW Data Hazards WAW hazards can be corrected by either: –Stalling the latter instruction at MEM until it is safe –Preventing the first instruction from overwriting the register Correcting at cycle 11 OK unless intervening RAW/use of F2 WAW hazards can be detected at the ID stage –Convert 1st instruction to no-op WAW hazards are generally very rare, designers usually go with the simplest solution

Detecting Hazards Hazards among FP instructions & and combined FP and integer instructions Separate int & fp register files limits latter to FP load and store instructions Assuming all checks are to be performed in the ID phase: –Check for structural hazards: Wait if the functional unit is busy (Divides in our case) Make sure the register write port is available when needed –Check for a RAW data hazard Requires knowledge of latency and initiation interval to decide when to forward and when to stall –Check for a WAW data hazard Write completion has to be estimated at the ID stage to check with other instructions in the pipeline Data hazard detection and forwarding logic from values stored between the stages

Stalls/Instruction, FP Pipeline

Instruction Level Parallelism (ILP) Overlap the execution of unrelated instructions Both instruction pipelining and ILP enhance instruction throughput not the execution time of the individual instruction Potential of IPL within a basic block is very limited –in “gcc” 17% of instructions are control transfer meaning on average 5 instructions per branch

Loops: Simple & Common for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i]; Techniques like loop unrolling convert loop-level parallelism into instruction-level parallelism –statically by the compiler –dynamically by hardware Loop-level parallelism can also be exploited using vector processing IPL feasibility is mainly hindered by data and control dependence among the basic blocks Level of parallelism is limited by instruction latencies

Major Assumptions Basic MIPS integer pipeline Branches with one delay cycle Functional units are fully pipelined or replicated (as many times as the pipeline depth) –An operation of any type can be issued on every clock cycle and there are no structural hazard

Motivating Example for(i=1000; i>0; i=i-1) x[i] = x[i] + s; for(i=1000; i>0; i=i-1) x[i] = x[i] + s; Standard Pipeline execution Sophisticated compiler optimization reduced execution time from 10 cycles to only 6 cycles Loop: LDF0,x(R1);F0=x[i] ADDDF4,F0,F2 ;add F2(=s) SDx(R1),F4 ;store result SUBIR1,R1,8 ;i=i-1 BNEZR1,Loop ;loop to 0 Loop: LDF0,x(R1) stall ADDD F4,F0,F2 stall SD x(R1),F4 SUBI R1,R1,8 stall BNEZ R1,Loop stall Loop:LDF0,x(R1) SUBIR1,R1,8 ADDDF4,F0,F2 stall ;F4 BNEZR1,Loop SDx+8(R1),F4 Smart compiler

Loop:LDF0,x(R1) ADDDF4,F0,F2 SDx(R1),F4 ;drop SUBI/BNEZ LDF6,x-8(R1) ADDDF8,F6,F2 SDx-8(R1),F8 ;drop again LDF10,x-16(R1) ADDDF12,F10,F2 SDx-16(R1),F12 ;drop again LDF14,x-24(R1) ADDDF16,F14,F2 SDx-24(R1),F16 SUBIR1,R1,#32 ;alter to 4*8 BNEZR1,LOOP Loop:LDF0,x(R1) ADDDF4,F0,F2 SDx(R1),F4 ;drop SUBI/BNEZ LDF6,x-8(R1) ADDDF8,F6,F2 SDx-8(R1),F8 ;drop again LDF10,x-16(R1) ADDDF12,F10,F2 SDx-16(R1),F12 ;drop again LDF14,x-24(R1) ADDDF16,F14,F2 SDx-24(R1),F16 SUBIR1,R1,#32 ;alter to 4*8 BNEZR1,LOOP Loop:LD F0,x(R1) ADDDF4,F0,F2 SDx(R1),F4 SUBIR1,R1,8 BNEZR1,Loop Replicate loop body 4 times, will need cleanup phase if loop iteration is not a multiple of 4 Loop Unrolling 6 cycles, but only 3 are loop body Loop unrolling limits overhead at the expense of a larger code –Eliminates branch delays –Enable effective scheduling Use of different registers needed to limit data hazard

Scheduling Unrolled Loops CycleInstruction 1 Loop:LDF0,x(R1) 3ADDDF4,F0,F2 6SDx(R1),F4 7LDF6,x-8(R1) 9ADDDF8,F6,F2 12SDx-8(R1),F8 13LDF10,x-16(R1) 15ADDDF12,F10,F2 18SDx-16(R1),F12 19LDF14,x-24(R1) 21ADDDF16,F14,F2 24SDx-24(R1),F16 25SUBIR1,R1,#32 27BNEZR1,LOOP 28NOOP CycleInstruction 1 Loop:LDF0,x(R1) 3ADDDF4,F0,F2 6SDx(R1),F4 7LDF6,x-8(R1) 9ADDDF8,F6,F2 12SDx-8(R1),F8 13LDF10,x-16(R1) 15ADDDF12,F10,F2 18SDx-16(R1),F12 19LDF14,x-24(R1) 21ADDDF16,F14,F2 24SDx-24(R1),F16 25SUBIR1,R1,#32 27BNEZR1,LOOP 28NOOP CycleInstruction 1 Loop:LDF0,x(R1) 2LDF6,x-8(R1) 3LDF10,x-16(R1) 4LDF14,x-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SDx(R1),F4 10SDx-8(R1),F8 11SUBIR1,R1,#32 12SDx+16(R1),F12 13BNEZR1,LOOP 14SDx+8(R1),F1 CycleInstruction 1 Loop:LDF0,x(R1) 2LDF6,x-8(R1) 3LDF10,x-16(R1) 4LDF14,x-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SDx(R1),F4 10SDx-8(R1),F8 11SUBIR1,R1,#32 12SDx+16(R1),F12 13BNEZR1,LOOP 14SDx+8(R1),F1 Loop unrolling exposes more computation that can be scheduled to minimize the pipeline stalls Understanding dependence among instructions is the key for for detecting and performing the transformation

Exception Types I/O device request Breakpoint Integer arithmetic overflow FP arithmetic anomaly Page fault Misaligned memory accesses Memory-protection violation Undefined instruction Privilege violation Hardware and power failure

Exception Requirements Synchronous vs. asynchronous –I/O exceptions: Asyncronous Allow completion of current instruction –Exceptions within instruction: Synchronous Harder to deal with User requested vs. coerced –Requested predictable and easier to handle User maskable vs. unmaskable Resume vs. terminate –Easier to implement exceptions that terminate program execution

Pipeline exceptions must follow order of execution of faulting instructions not according to the time they occur Exceptions in MIPS Multiple exceptions might occur since multiple instructions are executing –(LW followed by DIV might cause page fault and an arith. exceptions in same cycle) Exceptions can even occur out of order –IF page fault before preceeding MEM page fault

Stopping & Restarting Execution Some exceptions require restart of instruction –e.g. Page fault in MEM stage When exception occurs, pipeline control can: –Force a trap instruction into next IF stage –Until the trap is taken, turn off all writes for the faulting (and later) instructions –OS exception-handling routine saves faulting instruction PC

Stopping & Restarting Execution Precise exceptions –Instructions before the faulting one complete –Instructions after it restart –As if execution were serial Exception handling complex if faulting instruction can change state before exception occurs Precise exceptions simplifies OS Required for demand paging

Precise Exception Handling The MIPS Approach: –Hardware posts all exceptions caused by a given instruction in a status vector associated with the instruction –The exception status vector is carried along as the instruction goes down the pipeline –Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off –Upon entering the WB stage the exception status vector is checked and the exceptions, if any, will be handled according the time they occurred –Allowing an instruction to continue execution till the WB stage is not a problem since all write operations for that instruction will be disallowed

Instruction Set Complications Early-Write Instructions –MIPS only writes late in pipeline –Machines with multiple writes usually require capability to rollback the effect of an instruction e.g. VAX auto-increment, –Instructions that update memory state during execution, e.g. string copy, may need to save & restore temporary registers Branching mechanisms –Complications from condition codes, predictive execution for exceptions prior to branch Variable, multi-cycle operations –Instruction can make multiple writes

Maintaining Precise Exceptions Pipelining FP instructions can cause out- of-order completion Exceptions also a problem: DIVFF0, F2, F4 ADDFF10, F10, F8 SUBFF12, F12, F14 –No data hazards –What if DIVF exception occurs after ADDF writes F10?

Four FP Exception Solutions 1.Settle for imprecise exceptions –Some supercomputers still uses this approach –IEEE floating point standard requires precise exceptions –Some machine offer slow precise and fast imprecise exceptions 2.Buffer the results of all operations until previous instructions complete –Complex and expensive design (many comparators and large MUX) –History or future register file

Four FP Exception Solutions 3.Allow imprecise exceptions and get the handler to clean up any miss –Save PC + state about the interrupting instruction and all out-of-order completed instructions –The trap handler will consider the state modification caused by finished instructions and prepare machine to resume correctly –Issues: consider the following example Instruction1: Long running, eventual exception Instructions 2 … (n-1) : Instructions that do not complete Instruction n : An instruction that is finished –The compiler can simplify the problem by grouping FP instructions so that the trap does not have to worry about unrelated instructions

Four FP Exception Solutions 4.Allow instruction issue to continue only if previous instruction are guaranteed to cause no exceptions: –Mainly applied in the execution phase –Used on MIPS R4000 and Intel Pentium