1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Lecture 9 Instruction Level Parallelism Topics Review Appendix C Dynamic Scheduling Scoreboarding Tomasulo Readings: Chapter 3 October 8, 2014 CSCE 513.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Instruction-Level Parallelism and Its Dynamic Exploitation
Compiler Techniques for ILP
Instruction Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CSCE430/830 Computer Architecture
Lecture 10 Tomasulo’s Algorithm
CSL718 : VLIW - Software Driven ILP
CS 5513 Computer Architecture Pipelining Examples
Lecture 6: Static ILP, Branch prediction
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Chapter 3: ILP and Its Exploitation
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
CS 3853 Computer Architecture Pipelining Examples
Presentation transcript:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition

2 In the beginning… Processors fetched and finished execution of one instruction in a single cycle: Simple architectures (RISC, anyway) Relatively low latency Relatively low throughput Later, processors were pipelined to improve throughput: Shorter cycle times More complicated architectures Increased throughput (but also latency) Copyright © 2012, Elsevier Inc. All rights reserved.

3 MIPS Pipeline 5 stages: Instruction Fetch Instruction Decode Execution Memory Writeback Copyright © 2012, Elsevier Inc. All rights reserved.

4 MIPS Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.

5 MIPS Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.

6 MIPS Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.

7 Pipeline Hazards Types of pipeline hazards: Structural Data Control Copyright © 2012, Elsevier Inc. All rights reserved.

8 Structural Hazards Two stages can’t use the same hardware at the same time Solution: Separate instruction and data caches, multiport registers that can be read and written in one cycle Copyright © 2012, Elsevier Inc. All rights reserved.

9 Structural Hazards Copyright © 2012, Elsevier Inc. All rights reserved.

10 Data Hazards An instruction requires that data has been processed by a previous instruction that has not yet completed Solution: “Forward” execution and memory-load results immediately to earlier stages in the pipeline so that subsequent instructions can access the data as soon as it’s ready Still can’t handle a load followed immediately by an instruction which uses the loaded data for computation Causes a single-cycle stall Copyright © 2012, Elsevier Inc. All rights reserved.

11 Data Hazards Copyright © 2012, Elsevier Inc. All rights reserved.

12 Data Hazards Copyright © 2012, Elsevier Inc. All rights reserved.

13 Control Hazards Processor loads an incorrect instruction because previous branch instruction has not yet completed Solution: Allow compiler to schedule an unrelated instruction just after the branch instruction and allow that instruction to complete whether or not the branch is taken Stall for one cycle if no such instruction is provided Can use a similar technique for load instructions Copyright © 2012, Elsevier Inc. All rights reserved.

14 Control Hazards Copyright © 2012, Elsevier Inc. All rights reserved.

15 Copyright © 2012, Elsevier Inc. All rights reserved. Exploiting Pipelining Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls per instruction + Data hazard stalls per instruction + Control stalls per instruction Introduction

16 Copyright © 2012, Elsevier Inc. All rights reserved. Instruction-Level Parallelism There are two main approaches to ILP: Compiler-based static approaches Does not require complex hardware Works well for loop-heavy scientific applications Hardware-based dynamic approaches Requires complex hardware Used in server and desktop processors Not used as extensively in PMD processors Modern processors have multiple execution units with varying latencies Integers units faster than floating-point units Introduction

17 Instruction-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved.

18 Instruction-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved.

19 Copyright © 2012, Elsevier Inc. All rights reserved. Instruction-Level Parallelism Programs can be though of as blocks of execution A “basic block” is a sequence of instructions containing no branches Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches Introduction

20 Copyright © 2012, Elsevier Inc. All rights reserved. Compiler Techniques for Exposing ILP Pipeline scheduling Use expected instruction latencies to separate execution of instructions Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Compiler Techniques

21 Copyright © 2012, Elsevier Inc. All rights reserved. Pipeline Stalls Loop:L.DF0,0(R1) stall ADD.D F4,F0,F2 stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer op latency is 1) BNE R1,R2,Loop Compiler Techniques

22 Copyright © 2012, Elsevier Inc. All rights reserved. Pipeline Scheduling Scheduled code: Loop:L.DF0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall S.D F4,8(R1) BNE R1,R2,Loop Compiler Techniques

23 Copyright © 2012, Elsevier Inc. All rights reserved. Loop Unrolling Loop unrolling Unroll by a factor of 4 (assume # elements is divisible by 4) Eliminate unnecessary instructions Loop:L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Compiler Techniques note: stalls not included notice: number of live registers vs. original loop

24 Copyright © 2012, Elsevier Inc. All rights reserved. Loop Unrolling/Pipeline Scheduling Pipeline schedule the unrolled loop to remove stalls: Loop:L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop Compiler Techniques

25 Copyright © 2012, Elsevier Inc. All rights reserved. Strip Mining Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times “Strip mining” Compiler Techniques

26 Branch Prediction Control hazards are more difficult to handle in deeper pipelines Predict branch, flush pipeline if incorrect Simple techniques: Predict never taken Predict always taken One-bit predictor Predict last outcome Two-bit saturating predictor Flip prediction after two incorrect predictions Copyright © 2012, Elsevier Inc. All rights reserved.

27 2-Bit Saturating Predictor Copyright © 2012, Elsevier Inc. All rights reserved.

28 Copyright © 2012, Elsevier Inc. All rights reserved. Branch Prediction Local predictor: Allocate array of 2-bit saturating predictors Use low-order bits of PC to select predictor Global predictor Allocate array of 2-bit saturating predictors Use last n outcomes to select predictor (n, m) predictor uses n global bits to select an m-bit predictor Correlating predictor: Allocate array of 2-bit saturating predictors Use local and global outcomes to select predictor (n, m) correlating predictor with k local bits uses n + k bits to select from 2^(n + k) m-bit predictors Tournament predictor: Uses a predictor to select between multiple predictor types Branch Prediction

29 Branch Prediction Performance Copyright © 2012, Elsevier Inc. All rights reserved.

30 Copyright © 2012, Elsevier Inc. All rights reserved. Branch Prediction Performance Branch Prediction Branch predictor performance

31 Core i7 Branch Predictor Mostly confidential Two-level predictor Small, simple first-level predictor which can make a prediction each cycle Larger second-level tournament predictor which serves as a backup Combines a 2-bit predictor, a global predictor, and a loop-exit predictor which predicts the number of loop iterations Copyright © 2012, Elsevier Inc. All rights reserved.

32 Pitfalls Unexpected instructions can cause unexpected hazards Extensive pipelining can impact other aspects of design, leading to overall worse cost- performance Copyright © 2012, Elsevier Inc. All rights reserved.