COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.

Slides:



Advertisements
Similar presentations
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.
1 Recap Superscalar and VLIW Processors. 2 A Model of an Ideal Processor Provides a base for ILP measurements No structural hazards Register renaming—infinite.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EECC551 - Shaaban #1 lec # 6 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
COMP381 by M. Hamdi 1 Midterm Exam Review. COMP381 by M. Hamdi 2 Exam Format We will have 5 questions in the exam One question: true/false which covers.
EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
EECC551 - Shaaban #1 lec # 6 Winter Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI >
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.
EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)
EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CS/EE 5810 CS/EE 6810 F00: 1 Extracting More ILP.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Lecture 7: Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
CPE 631 Lecture 15: Exploiting ILP with SW Approaches
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
CS 5513 Computer Architecture Pipelining Examples
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture 23: Static Scheduling for High ILP
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Key to pipelining: smooth flow Hazards limit performance
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
CMSC 611: Advanced Computer Architecture
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
CS 3853 Computer Architecture Pipelining Examples
Presentation transcript:

COMP381 by M. Hamdi 1 Superscalar Processors

COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls –Ideal pipeline CPI: measure of the maximum performance attainable by the implementation –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

COMP381 by M. Hamdi 3 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to Reduce:  Structural hazards Memory: Separate instruction and data memory Registers: Write 1st half of cycle and read 2 nd half of cycle Mem ALU Reg Mem Reg

COMP381 by M. Hamdi 4 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to Reduce:  Data Hazards Forwarding MUX Zero? Data Memory ALU D/A Buffer A/M BufferM/W Buffer

COMP381 by M. Hamdi 5 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to Reduce:  Control Hazards Moving the calculation of the target branch earlier in the pipeline

COMP381 by M. Hamdi 6 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to increase ILP:  Scoreboarding Allows out-of-order execution of instructions

COMP381 by M. Hamdi 7 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to increase ILP:  Scoreboarding Allows out-of-order execution of instructions Instruction statusReadExecutionWrite InstructionjkIssueoperandscompleteResult F634+R21234 F245+R35678 F0F2F F8F6F F10F0F F6F8F We have: In-oder issue, Out-of-order execute and commit L.D MUL.D SUB.D DIV.D ADD.D

COMP381 by M. Hamdi 8 Techniques to Reduce Stalls and Increase ILP Hardware Schemes to Reduce:  Data Hazards Similar to scoreboarding but more advanced (e.g., register renaming)  Control Hazards Dynamic branch prediction (using buffer lookup schemes)

COMP381 by M. Hamdi 9 Techniques to Reduce Stalls and Increase ILP Software Schemes to Reduce:  Data Hazards Compiler Scheduling: reduce load stalls Scheduled code with no stalls: LD Rb,b LD Rc,c LD Re,e DADD Ra,Rb,Rc LD Rf,f SD Ra,a DSUB Rd,Re,Rf SDRd,d Original code with stalls: LD Rb,b LD Rc,c DADD Ra,Rb,Rc SD Ra,a LD Re,e LD Rf,f DSUB Rd,Re,Rf SDRd,d Stall

COMP381 by M. Hamdi 10 Techniques to Reduce Stalls and Increase ILP Software Schemes to Reduce:  Data Hazards Compiler Scheduling: register renaming to eliminate WAW and WAR hazards

COMP381 by M. Hamdi 11 Techniques to Reduce Stalls and Increase ILP Software Schemes to Reduce:  Control Hazards Branch prediction Example : choosing backward branches (loop) as taken and forward branches (if) as not taken Tracing Program behaviour

COMP381 by M. Hamdi 12 Techniques to Reduce Stalls and Increase ILP Software Schemes to Reduce:  Control Hazards Loop unrolling 4n iterations n iterations 4 iterations

COMP381 by M. Hamdi 13 Techniques to Reduce Stalls and Increase ILP Software Schemes to Reduce:  Control Hazards Increase loop parallelism for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } – –Can be made parallel by replacing the code with the following: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100];

COMP381 by M. Hamdi 14 Using these Hardware and Software Techniques Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls –All we can achieve is to be close to the ideal CPI =1 –In practice CPI is around 0.9 This is because we can only issue one instruction per clock cycle to the pipeline How can we do better ?

COMP381 by M. Hamdi 15 A Model of an Ideal Processor No structural hazards Register renaming—infinite registers and all WAW & WAR hazards avoided Processor with perfect prediction Branch prediction—perfect; no mispredictions Jump prediction—all jumps perfectly predicted There are only true data dependences left! I: add r1,r2,r3 J: sub r4,r1,r3

COMP381 by M. Hamdi 16 Upper Bound on ILP Integer: FP:

COMP381 by M. Hamdi 17 More Realistic: Branch Impact Window: 2000 instructions Max 64 instr/cycle issue

COMP381 by M. Hamdi 18 Renaming Register impact Window: 2000 instructions Max 64 instr/cycle issue

COMP381 by M. Hamdi 19 Window Impact Window: 200 instructions Max 64 instr/cycle issue

COMP381 by M. Hamdi 20 How do we take advantage of this large number of ILP Superscalar processors VLIW (Very Long Instruction Word) processors All high-performance modern processors (e.g., Pentium, Sparc, Itanium) use one of the above techniques.

COMP381 by M. Hamdi 21 Evolution of Processor Performance CPI > (?) Pipelined (single issue) Multi-cycle Multiple Issue (CPI <1) Superscalar/VLIW

COMP381 by M. Hamdi 22 Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize ILP better, a number of independent instructions have to be issued in the same pipeline cycle. Anticipated success of multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI Multiple instruction issue processors are of two types: –Superscalar: A number of instructions (2-8) is issued in the same cycle, scheduled statically by the compiler or dynamically (scoreboarding, Tomasulo). Pentium, PowerPC, Sun UltraSparc, Alpha, HP

COMP381 by M. Hamdi 23 Multiple Instruction Issue: CPI < 1 –VLIW (Very Long Instruction Word): A fixed number of instructions (3-16) are formatted as one long instruction word or packet (statically scheduled by the compiler). –Joint HP/Intel (Itanium). –Intel Architecture-64 (IA-64) 64-bit processor: »Explicitly Parallel Instruction Computer (EPIC): Itanium. Limitations of the approaches: –Available ILP in the program (both). –Specific hardware implementation difficulties (superscalar). –VLIW optimal compiler design issues.

COMP381 by M. Hamdi 24 Two instructions can be issued per cycle (two-issue superscalar). One of the instructions is integer (including load/store, branch). The other instruction is a floating-point operation. –This restriction reduces the complexity of hazard checking. –Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues Hardware must fetch and decode two instructions per cycle. Then it determines whether zero (a stall), one or two instructions can be issued per cycle. Simple Statically Scheduled Superscalar Pipeline

COMP381 by M. Hamdi 25 Simple Statically Scheduled Superscalar Pipeline MEM EX ID IF EX ID IF WB EX MEM EX WB EX MEM EX WB EX ID IF WB EX MEM EX ID IF Integer Instruction FP Instruction Instruction Type 2-Issue pipeline (Integer & FP)

COMP381 by M. Hamdi 26 Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

COMP381 by M. Hamdi 27 Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F clocks, or 2.4 clocks per iteration

COMP381 by M. Hamdi 28 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; Reducing the stalls becomes extremely difficult. Use all the techniques we covered and more advanced ones.

COMP381 by M. Hamdi 29 VLIW Processors Very Long Instruction Word (VLIW) processors – Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that identify the instruction to be put

COMP381 by M. Hamdi 30 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration