1 Recap Superscalar and VLIW Processors. 2 A Model of an Ideal Processor Provides a base for ILP measurements No structural hazards Register renaming—infinite.

Slides:



Advertisements
Similar presentations
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.
EECC551 - Shaaban #1 lec # 6 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
EECC551 - Shaaban #1 lec # 6 Winter Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI >
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.
EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CS/EE 5810 CS/EE 6810 F00: 1 Extracting More ILP.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
Lecture 7: Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COMP 740: Computer Architecture and Implementation
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 23: Static Scheduling for High ILP
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Key to pipelining: smooth flow Hazards limit performance
Chapter 3: ILP and Its Exploitation
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
CMSC 611: Advanced Computer Architecture
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

1 Recap Superscalar and VLIW Processors

2 A Model of an Ideal Processor Provides a base for ILP measurements No structural hazards Register renaming—infinite virtual registers and all WAW & WAR hazards avoided Machine with perfect speculation Branch prediction—perfect; no mispredictions Jump prediction—all jumps perfectly predicted –There are only true data dependences left! –These cannot be avoided

3 Upper Bound on ILP

4 More Realistic HW: Branch Impact Window: 2000 instructions Max 64 instr/cycle issue Many registers

5 Renaming Register impact Window: 2000 instructions Max 64 instr/cycle issue

6 Window Impact 64 instr/cycle issue 64 renaming registers

7 How do we take advantage of this large number of ILP Superscalar processors VLIW (Very Long Instruction Word) processors All high-performance modern processors (e.g., Pentium, Sparc, Itanium) use one of the above techniques.

8 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build pipelines with multiple functional units (we can execute more than one instruction). If we can issue more than 1 instruction into the pipe at a time, then it is possible we can complete more than 1 instruction per cycle. This implies that we need to fetch and decode 2 or more instructions per cycle.

9 Multiple Issue Processors Sperscalar Processors Variable number of instructions per clock cycle Instruction Scheduling Statically Statically: Compiler technique Instruction execution in order of sequence dynamically dynamically: Scoreboarding/Tomasulo’s Algorithm Instructions are out of order execution VLIW : Very Long Instruction Word EPIC Fixed number of instructions formatted as a large instruction or a fixed instruction packet with parallelism among instructions [EPIC: explicitly parallel Instruction Computing] Statically scheduled by the compiler

10 Multiple-Issue Processor Types Common Issue HazardSchedulingDistinguishingExamples namestructuredetectioncharacteristics Super scalardynamic HW static in-order execution SUN UltraSPARC (static) Super scalardynamic HW dynamic some out of order IBM Power 2 (dynamic) Super scalardynamic HW dynamic in-order execution Pentium III/4, Alpha (speculative)with speculation with speculation HP PA8500, IBM RS64III VLIW/LIWstatic SW static no hazards between Trimedia,i860 issue packets EPICmostly mostly mostly explicit dependency Itanium static SW static marked by compiler

11 Super scalar 0-8 instruction per cycle Static scheduling all pipe line hazards are checked instructions in order Pipeline control logic will check hazards between the instructions in execution phase and the new instruction sequences. In case of hazard, only those instructions preceding that one in the instruction sequence will be issued. All instructions are checked at the same time by Issue HW Issue HW Pipeline Instruction Memory Issue Packet Complexity of HW This stage is pipelined in all dynamic super scalar system

12 Example: Superscalar of degree 3 fetch decode execute write back

13 A Superscalar MIPS –Issue 2 instructions simultaneously: 1 FP & 1 integer Fetch two instr./clock cycle; one integer and one FP Can only issue 2nd instruction if 1st instruction issues Need more ports to the register file TypePipe stages Int.IFIDEXMEMWB FPIFIDEXMEMWB Int.IFIDEXMEMWB FPIFIDEXMEMWB Int.IFIDEXMEMWB FPIFIDEXMEMWB

14 Limits to Superscalar Execution –Difficulties in scheduling within the constraints on number of functional units and the ILP in the code chunk Instruction decode complexity increases with the number of issued instructions Data and control dependences are in general more costly in a superscalar processor than in a single-issue processor Techniques to enlarge the instruction window to extract more ILP are important

15 Some VLIW Characteristics Can be hard to exploit parallelism n functional units and k pipeline stages implies n x k independent instructions Memory and register bandwidth Complexity increases with the number of functional units Code size Relies heavily on compiler technology

16 Unrolled Loop that Minimizes Stalls for 1-issue pipelines 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

17 Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F clocks, or 2.4 clocks per iteration

18 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; Reducing the stalls becomes extremely difficult. Use all the techniques we covered and more advanced ones.

19 VLIW Processors Very Long Instruction Word (VLIW) processors – Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that identify the instruction to be put

20 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration

21 Commercial Superscalar and VLIW Processors

22 1 Fetch 2 Fetch 3 Decode 4 Decode 5 Decode 6 Rename 7 ROB Rd 8 Rdy/Sch 9 Dispatch 10 Exec 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Typical P6 Pipeline Typical Pentium 4 Pipeline Pentium 4 Pipeline Stages vs. Pentium 3 Pipeline Stages

23 Pentium 3 Pipeline Architecture It is a 3-way issue supersclar It is a 3-way issue supersclar It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide) It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide)

24 Pentium 3 Pipeline stages 1 Fetch 2 3 Decode Rename registers 7 ROB (reordering instructions) 8 Rdy/Sch (Scheduling Instructions to be executed) 9 Dispatch 10 Exec

25 Pentium 4 pipeline stages StageWork 1Trace Cache next instruction pointer 2 3Trace Cache fetch 4 5Drive 6Allocation 7Rename 8 9Queue 10Schedule 11Schedule 12Schedule 13Dispatch 14Dispatch 15Register Files 16Register Files 17Execute 18Flags 19Branch Check 20Drive Increasing the number of pipeline stages increases the clock frequency It took the industry 28 years to hit 1 GHz and only 18 months to reach 2 GHz. The price paid for deeper pipelines is that it is very difficult to ovoid stalls (That is why when Pentium 4 was introduced its performance was worse than Pentium 3.) It is a 5-issue supersclar processor

TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch 3.2 GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB TC Nxt IP: Trace cache next instruction pointer Pointer indicating location of next instruction.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch TC Fetch: Trace cache fetch Read the decoded instructions (uOPs)

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the uOPs to the allocator

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Alloc: Allocate resources required for execution. The resources include Load buffers, Store buffers, etc..

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Rename: Register renaming

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Que: Write into the uOP Queue uOPs are placed into the queues, where they are held until there is room in the schedulers

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Sch: Schedule Write into the schedulers and compute dependencies. Watch for dependency to resolve.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Disp: Dispatch Send the uOPs to the appropriate execution unit.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch RF: Register File Read the register file. These are the source(s) for the pending operation (ALU or other).

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Ex: Execute Execute the uOPs on the appropriate execution port.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Flgs: Flags Compute flags (zero, negative, etc..). These are typically input to a branch instruction.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Br Ck: Branch Check The branch operation compares result of actual branch direction with the prediction.

GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the result of the branch check to the front end of the machine.

39 Commercial EPIC Processors Itanium

40 Itanium® Processor Family Architecture EPIC: explicitly parallel instruction computing Instruction encoding Bundles and templates Large register resources 128 integer 128 floating point Support for Software pipelining Predication Speculation (Control, Data, Load)

41 EPIC – Explicitly Parallel Instruction Computing Focused on parallel execution Instructions are issued in bundles Instructions distributed among processor’s execution units according to type Currently up to two complete bundles can be dispatched per clock cycle »Pipeline stages: 10 (Itanium®1), 8 (Itanium® 2)

42

43 Instruction Format: Bundles & Templates Bundle Set of three instructions (41 bits each) Template Identifies types of instructions in bundle

44 Instruction Format: Bundles & Templates Instruction types –M: Memory –I: Shifts and multimedia –A: Integer Arithmetic and Logical Unit –B: Branch –F: Floating point –L+X: Long (move, branch, …)

45 Bundle Templates Not all combinations of A, I, M, F, B, L and X are permitted Group “stops” are explicitly encoded as part of the template –can’t stop just anywhere Some bundles identical except for group stop

46 instr instr ;; instr instr ;; instr intsr instr instr ;; instr instr ;; instr … instr instr instr tmpl instr instr nop tmpl instr nop nop tmpl instr instr nop tmpl intsr instr instr tmpl … instr instr instr tmpl Handwritten code Code generator Instruction bundles Fetch Execution Code generator creates bundles, possibly including nops. Can the bundle pair Execute in parallel ? Itanium® fetches 2 bundles at a time for execution. They may or may not execute in parallel. There are two difficulties: 1)Finding instruction triplets matching the defined templates. 2)Matching pairs of bundles that can execute in parallel.

47 MEM INT FP B B B 128-bit instruction bundles from I-cache S2 S1S0T Fetch one or more bundles for execution (Implementation, Itanium® takes two.) Try to execute all instructions in parallel, depending on available units. Retired instruction bundles Processor Explicitly Parallel Instruction Computing EPIC functional units MEM INT FP B B B

48 Itanium 8-stage Pipelines In-order issue, out-of-order completion –All functional units are fully pipelined Small branch misprediction penalties FP1 FP2 IPGROT Instruction Buffer EXPRENREG MM1MM2 EXEDETWRB L1D1L1D2L1D3 FP3 FP4MemoryInt MultiMedia Floating Point

49 Itanium 2 Eight-stage Pipeline EXPRENROTIPGREGEXEDETWB FP1FP2FP3FP4WB L2NL2IL2AL2ML2DL2CL2W Core FP L2 IPG IP Generate, L1I cache (6 inst) and TLB accessEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access ROT Instruction Rotate and Buffer (6 inst)DET Exception Detect, Branch Correction EXP Expand, Port assignment and routingWB Writeback, INT register update REN INT and FP register renameFP1-WB FP FMAC pipeline (2) + register write REG INT and FP register file readL2N-L2I L2 Queue Nominate/Issue (4) speculatively issued with L1 request (speculatively issued with L1 request) L2A-L2W L2 Access, Rotate, Correct, Write (4)