COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Slides:

Advertisements

Similar presentations

Advanced CISC Implementations: Pentium 4

Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture

CSCI 4717/5717 Computer Architecture

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

A scheme to overcome data hazards

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Architecture Lec 8 – Instruction Level Parallelism.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

1 Microprocessor-based Systems Course 4 - Microprocessors.

1 Recap Superscalar and VLIW Processors. 2 A Model of an Ideal Processor Provides a base for ILP measurements No structural hazards Register renaming—infinite.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.

IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.

Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.

Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

PowerPC 604 Superscalar Microprocessor

Computer architectures M

CS203 – Advanced Computer Architecture

Pipelining: Advanced ILP

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Chapter 3: ILP and Its Exploitation

CSC3050 – Computer Architecture

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors

COMP381 by M. Hamdi 2 Superscalar Processors 0-8 instruction per cycle Static scheduling all pipe line hazards are checked instructions in order Pipeline control logic will check hazards between the instructions in execution phase and the new instruction sequences. In case of hazard, only those instructions preceding that one in the instruction sequence will be issued. Issue HW Pipeline Instruction Memory Issue Packet Complexity of HW This stage is pipelined in all dynamic super scalar system

COMP381 by M. Hamdi 3 Example: Superscalar of degree 3 fetch decode execute write back

COMP381 by M. Hamdi 4 Cache/MemoryFetchUnit EU EU Register File Multi Operation Multiple Instruction Instruction Basic Superscalar Approach Decode/IssueUnit

COMP381 by M. Hamdi 5 1 Fetch 2 Fetch 3 Decode 4 Decode 5 Decode 6 Rename 7 ROB Rd 8 Rdy/Sch 9 Dispatch 10 Exec 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Typical P6 Pipeline Typical Pentium 4 Pipeline Pentium 4 Pipeline Stages vs. Pentium 3 Pipeline Stages

COMP381 by M. Hamdi 6 Pentium 3 Pipeline Architecture It is a 3-way issue supersclar It is a 3-way issue supersclar It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide) It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide)

COMP381 by M. Hamdi 7 Pentium 3 Pipeline stages 1Fetch 2 3Decode 4 5 6Rename registers 7ROB (reordering instructions) 8Rdy/Sch (Scheduling Instructions to be executed) 9Dispatch 10Exec

COMP381 by M. Hamdi 8 Pentium 4 pipeline stages StageWork 1Trace Cache next instruction pointer 2 3Trace Cache fetch 4 5Drive 6Allocation 7Rename 8 9Queue 10Schedule 11Schedule 12Schedule 13Dispatch 14Dispatch 15Register Files 16Register Files 17Execute 18Flags 19Branch Check 20Drive Increasing the number of pipeline stages increases the clock frequency It took the industry 28 years to hit 1 GHz and only 18 months to reach 2 GHz. The price paid for deeper pipelines is that it is very difficult to ovoid stalls (That is why when Pentium 4 was introduced its performance was worse than Pentium 3.) It is a 5-issue supersclar processor

COMP381 by M. Hamdi TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch 3.2 GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB TC Nxt IP: Trace cache next instruction pointer Pointer indicating location of next instruction.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch TC Fetch: Trace cache fetch Read the decoded instructions (uOPs)

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the uOPs to the allocator

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Alloc: Allocate resources required for execution. The resources include Load buffers, Store buffers, etc..

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Rename: Register renaming

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Que: Write into the uOP Queue uOPs are placed into the queues, where they are held until there is room in the schedulers

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Sch: Schedule Write into the schedulers and compute dependencies. Watch for dependency to resolve.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Disp: Dispatch Send the uOPs to the appropriate execution unit.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch RF: Register File Read the register file. These are the source(s) for the pending operation (ALU or other).

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Ex: Execute Execute the uOPs on the appropriate execution port.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Flgs: Flags Compute flags (zero, negative, etc..). These are typically input to a branch instruction.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Br Ck: Branch Check The branch operation compares result of actual branch direction with the prediction.

COMP381 by M. Hamdi GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc  op Queues Schedulers Integer RF FP RF  Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the result of the branch check to the front end of the machine.

COMP381 by M. Hamdi 22 Commercial EPIC Processors Itanium

COMP381 by M. Hamdi 23 Itanium® Processor Family Architecture EPIC: explicitly parallel instruction computing Instruction encoding Bundles and templates Large register resources 128 integer 128 floating point Support for Software pipelining Predication Speculation (Control, Data, Load)

COMP381 by M. Hamdi 24 EPIC – Explicitly Parallel Instruction Computing Focused on parallel execution Instructions are issued in bundles Instructions distributed among processor’s execution units according to type Currently up to two complete bundles can be dispatched per clock cycle –Pipeline stages: 10 (Itanium®1), 8 (Itanium® 2)

COMP381 by M. Hamdi 25

COMP381 by M. Hamdi 26 Instruction Format: Bundles & Templates Bundle Set of three instructions (41 bits each) Template Identifies types of instructions in bundle

COMP381 by M. Hamdi 27 Instruction Format: Bundles & Templates Instruction types –M: Memory –I: Shifts and multimedia –A: Integer Arithmetic and Logical Unit –B: Branch –F: Floating point –L+X: Long (move, branch, … )

COMP381 by M. Hamdi 28 MEM INT FP B B B 128-bit instruction bundles from I-cache S2 S1S0T Fetch one or more bundles for execution (Implementation, Itanium® takes two.) Try to execute all instructions in parallel, depending on available units. Retired instruction bundles Processor Explicitly Parallel Instruction Computing EPIC functional units MEM INT FP B B B

COMP381 by M. Hamdi 29 instr instr ;; instr instr ;; instr intsr instr instr ;; instr instr ;; instr … instr instr instr tmpl instr instr nop tmpl instr nop nop tmpl instr instr nop tmpl intsr instr instr tmpl … instr instr instr tmpl Handwritten code Code generator Instruction bundles Fetch Execution Code generator creates bundles, possibly including nops. Can the bundle pair Execute in parallel ? Itanium® fetches 2 bundles at a time for execution. They may or may not execute in parallel. There are two difficulties: 1) 1)Finding instruction triplets matching the defined templates. 2) 2)Matching pairs of bundles that can execute in parallel.

COMP381 by M. Hamdi 30 Today‘s Architecture Challenges Performance barriers : - Memory latency - Branches - Loop pipelining and call / return overhead -Hardware-based instruction scheduling -Unable to efficiently schedule parallel execution -Too few registers -Unable to fully utilize multiple execution units

COMP381 by M. Hamdi 31 Improving Performance To achieve improved performance, Itanium(R) architecture code accomplishes the following: -Increases instruction level parallelism (ILP) -Improves branch handling -Hides memory latencies

COMP381 by M. Hamdi 32 Instruction level parallelism (ILP) Increase ILP by: More resources Large register files Avoiding register contention 3-instruction wide word Bundle Facilitates parallel processing of instructions Enabling the compiler/assembly writer to explicitly indicate parallelism

COMP381 by M. Hamdi 33 Itanium 8-stage Pipelines In-order issue, out-of-order completion –All functional units are fully pipelined Small branch misprediction penalties FP1 FP2 IPGROT Instruction Buffer EXPRENREG MM1MM2 EXEDETWRB L1D1L1D2L1D3 FP3 FP4MemoryInt MultiMedia Floating Point