Chapter3 Limitations on Instruction-Level Parallelism Bernard Chen Ph.D. University of Central Arkansas.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

PIPELINING AND VECTOR PROCESSING
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Compiler techniques for exposing ILP
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
Data Dependence Types and Associated Pipeline Hazards Chapter 4 — The Processor — 1 Sections 4.7.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline and Vector Processing (Chapter2 and Appendix A)
Instruction-Level Parallelism (ILP)
CS 6461: Computer Architecture Instruction Level Parallelism
1 Tomasulo’s Algorithm and IBM 360 Srivathsan Soundararajan.
Pipelined Processor II (cont’d) CPSC 321
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
COMP4211 (Seminar) Intro to Instruction-Level Parallelism
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Part 5 – Superscalar & Dynamic Pipelining - An Extra Kicker! 5/5/04+ Three major directions that simple pipelines of chapter 6 have been extended If you.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Pipelining Example Laundry Example: Three Stages
Pipelining Pipelining is a design feature that allows multiple instructions to be run simultaneously. Speeds up the execution of instruction processing.
Introduction to Computer Organization Pipelining.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CSCE 212 Chapter 6 Enhancing Performance with Pipelining Instructor: Jason D. Bakos.
COMP25212 Advanced Pipelining Out of Order Processors.
Chapter Six.
CS203 – Advanced Computer Architecture
Single Clock Datapath With Control
Pipeline Implementation (4.6)
CDA 3101 Spring 2016 Introduction to Computer Organization
\course\cpeg323-08F\Topic6b-323
Morgan Kaufmann Publishers The Processor
Pipelining Chapter 6.
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Chapter Six.
Static vs. dynamic scheduling
Chapter Six.
Pipelining Chapter 6.
Data Dependence Distances
Prof. Sirer CS 316 Cornell University
Pipelining Chapter 6.
Data Hazard Example (no stalls).
Pipelining Chapter 6.
Presentation transcript:

Chapter3 Limitations on Instruction-Level Parallelism Bernard Chen Ph.D. University of Central Arkansas

Overcome Data Hazards with Dynamic Scheduling If there is a data dependence, the hazard detection hardware stalls the pipeline No new instructions are fetched or issued until the dependence is cleared Dynamic Scheduling: the hardware rearrange the instruction execution to reduce the stalls while maintaining data flow and exception behavior

RAW If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard I: add r1,r2,r3 J: sub r4,r1,r3

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to proceed DIVF0 <- F2/F4 ADDF10<- F0+F8 SUBF12<- F8-F14

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to proceed DIVF0 <- F2/F4 SUBF12<- F8-F14 ADDF10<- F0+F8

Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to proceed DIVF0 <- F2/F4 SUBF12<- F8-F14 ADDF10<- F0+F8 Enables out-of-order execution and allows out-of- order completion (e.g., SUB ) In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)

Overcome Data Hazards with Dynamic Scheduling It offers several advantages: Simplifies the compiler It allows code that compiled for one pipeline to run efficiently on a different pipeline (Allow the processor to tolerate unpredictable delays such as cache misses)

Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; There are 2 versions of name dependence

WAR Instr J writes operand before Instr I reads it If it caused a hazard in the pipeline, called a Write After Read (WAR) hazard I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

WAW Instr J writes operand before Instr I writes it. If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Example DIVr0 <- r2 / r4 ADDr6 <- r0 + r8 SUB r8 <- r10 – r14 MULr6 <- r10 * r7 ORr3 <- r5 or r9

Example RAW

Example WAR

Example WAW

For you to practice DIVr0 <- r2 / r4 ADDr6 <- r0 + r8 STr1 <- r6 SUBr8 <- r10 - r14 MULr6 <- r10 * r8

Overcome Data Hazards with Dynamic Scheduling Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW

Limits to ILP Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Perfect Cache

Ideal ModelIBM Power 5 Instructions Issued per clock Infinite4 Renaming RegistersInfinite48 integer + 40 Fl. Pt. Branch PredictionPerfect2% to 6% misprediction CachePerfect1.92MB L2, 36 MB L3 Limits to ILP HW Model comparison

Performance beyond single thread ILP There can be much higher natural parallelism in some applications Such as “Online processing system”: which has natural parallelism among the multiple queries and updates that are presented by requests

Thread-level parallelism (TLP) Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute

Thread-level parallelism (TLP) TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP

New Approach: Mulithreaded Execution Multithreading: multiple threads to share the functional units of 1 processor via overlapping Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table

New Approach: Mulithreaded Execution When switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock

Multithreaded Categories Thread 1Thread 2Thread 3 Thread 4 Thread 5 Fine-Grained

Multithreaded Categories

Fine-Grained Multithreading Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads

Course-Grained Multithreading Switches threads only on costly stalls, such as cache misses Advantages Relieves need to have very fast thread- switching Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

Course-Grained Multithreading Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time

Multithreaded Categories Thread 1Thread 2Thread 3 Thread 4 Thread 5 Coarse-Grained (2clock cycle)