15-447 Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Computer architecture
CSCI 4717/5717 Computer Architecture
Arsitektur Komputer Pertemuan – 13 Super Scalar
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 8. Pipelining.
Chapter Six 1.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
DLX Instruction Format
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
Chapter 4 Superscalar Organization
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Chapter One Introduction to Pipelined Processors.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
CS Perspectives in Computer Architecture Pipelining and Instruction Level Parallelism Lecture 6 January 30 th, 2013.
Pipelining and Parallelism Mark Staveley
EKT303/4 Superscalar vs Super-pipelined.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Instruction level parallelism And Superscalar processors By Kevin Morfin.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Advanced Architectures
Computer Architecture Chapter (14): Processor Structure and Function
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
CSC3050 – Computer Architecture
Created by Vivi Sahfitri
Presentation transcript:

Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture M,W 10-11:20am Lecture 16 Superscalar Processors

Computer ArchitectureFall 2007 © During This Lecture °Introduction to Superscalar Pipelined Processors Limitations of Scalar Pipelines From Scalar to Superscalar Pipelines Superscalar Pipeline Overview

Computer ArchitectureFall 2007 © General Performance Metrics °Performance is determined by (1) Instruction Count, (2) Clock Rate, & (3) CPI –Clocks per Instructions. °Exe Time = Instruction Count * CPI * Cycle Time °The Instruction Count is governed by Compiler’s Techniques & Architectural decisions °The CPI is primarily governed by Architectural decisions. ° The Cycle Time is governed by technology improvements & Architectural decisions.

Computer ArchitectureFall 2007 © One Instruction resident in Processor The number of stages, K=1 Scalar Unpipelined Processor °Only ONE instruction can be resident at the processor at any given time. °The whole processor is considered as ONE stage, k=1. CPI = 1 / IPC

Computer ArchitectureFall 2007 © Pipelined Processor °K –number of pipe stages, instructions are resident at the processor at any given time. °In our example, K=5 stages, number of parallelism (concurrent instruction in the processor) is also equal to 5. °One instruction will be accomplished each clock cycle, CPI = IPC = 1

Computer ArchitectureFall 2007 © Example of a Six-Stage Pipelined Processor

Computer ArchitectureFall 2007 © Pipelining & Performance °The Pipeline Depth is the number of stages implemented in the processor, it is an architectural decision, also it is directly related to the technology. In the previous example K=5. °The Stall’s CPI are directly related to the code’s instructions and the density of existing dependences and branches. °Ideally the CPI is ONE.

Computer ArchitectureFall 2007 © Super-Pipelining Processor °IF (Instruction Fetch) : requires a cache access in order to get the next instruction to be delivered to the pipeline. °ID (Instruction Decode) : requires that the control unit decodes the instruction to know what it does. °Obviously the fetch stage cache access, requires much more time than the decode stage – combinational logic. °Why not subdivide each pipe-stage into m other stages of smaller amount of time required for each stage; minor cycle time.

Computer ArchitectureFall 2007 © Super-Pipelining Processor °The machine can issue a new instruction every minor cycle. °Parallelism = K x m. ° For this figure, //’ism = K x m = 4 x 3 = 12. Super-pipelined machine of Degree m=3

Computer ArchitectureFall 2007 © Stage Quantazation °(a) four-stage instruction pipeline °(b) eleven-stage instruction pipeline

Computer ArchitectureFall 2007 © Superscalar machine of Degree n=3 °The superscalar degree is determined by the issue parallelism n, the maximum number of instructions that can be issued in every machine cycle. °Parallelism = K x n. ° For this figure, //’ism = K x n = 4 x 3 =12 °Is there any reason why the superscalar machine cannot also be superpipelined?

Computer ArchitectureFall 2007 © Superpipelined Superscalar Machine °The superscalar degree is determined by the issue parallelism n, and the sub-stages m, & the number of stages k. °Parallelism = K x m x n ° For this figure, //’ism = K x m x n = 5 x 3 x 3 = 45

Computer ArchitectureFall 2007 © Characteristics of Superscalar Machines °Simultaneously advance multiple instructions through the pipeline stages. °Multiple functional units => higher instruction execution throughput. °Able to execute instructions in an order different from that specified by the original program. °Out of program order execution allows more parallel processing of instructions.

Computer ArchitectureFall 2007 © Limitations of Scalar Pipelines °Instructions, regardless of type, traverse the same set of pipeline stages. °Only one instruction can be resident in each pipeline stage at any time. °Instructions advance through the pipeline stages in a lockstep fashion. °Upper bound on pipeline throughput.

Computer ArchitectureFall 2007 © Deeper Pipeline, a Solution? °Performance is proportional to (1) Instruction Count, (2) Clock Rate, & (3) IPC –Instructions Per Clock. °Deeper pipeline has fewer logic gate levels in each stage, this leads to a shorter cycle time & higher clock rate. °But deeper pipeline can potentially incur higher penalties for dealing with inter-instruction dependences.

Computer ArchitectureFall 2007 © Bounded Pipelines Performance °Scalar pipeline can only initiate at most one instruction every cycle, hence IPC is fundamentally bounded by ONE. °To get more instruction throughput, we must initiate more than one instruction every machine cycle. °Hence, having more than one instruction resident in each pipeline stage is necessary; parallel pipeline.

Computer ArchitectureFall 2007 © Inefficient Unification into a Single Pipeline °Different instruction types require different sets of computations. °In IF, ID there is significant uniformity of different instruction types.

Computer ArchitectureFall 2007 © Inefficient Unification into a Single Pipeline °But in execution stages ALU & MEM there is substantial diversity. °Instructions that require long & possibly variable latencies (F.P. Multiply & Divide) are difficult to unify with simple instructions that require only a single cycle latency.

Computer ArchitectureFall 2007 © Diversified Pipelines °Specialized execution units customized for specific instruction types will contribute to the need for greater diversity in the execution stages. °For parallel pipelines, there is a strong motivation to implement multiple different execution units –subpipelines, in the execution portion of parallel pipeline. We call such pipelines diversified pipelines.

Computer ArchitectureFall 2007 © Performance Lost Due to Rigid Pipelines °Instructions advance through the pipeline stages in a lockstep fashion; in- order & synchronously. °If a dependent instruction is stalled in pipeline stage i, then all earlier stages, regardless of their dependency or NOT, are also stalled.

Computer ArchitectureFall 2007 © Rigid Pipeline Penalty °Only after the inter-instruction dependence is satisfied, that all i stalled instructions can again advance synchronously down the pipeline. °If an independent instruction is allowed to bypass the stalled instruction & continue down the pipeline stages, an idling cycle of the pipeline can be eliminated.

Computer ArchitectureFall 2007 © Out-Of-Order Execution °It is the act of allowing the bypassing of a stalled leading instruction by trailing instructions. Parallel pipelines that support out-of-order execution are called dynamic pipelines.

Computer ArchitectureFall 2007 © From Scalar to Superscalar Pipelines °Superscalar pipelines are parallel pipelines: able to initiate the processing of multiple instructions in every machine cycle. °Superscalar are diversified; they employ multiple & heterogeneous functional units in their execution stages. °They are implemented as dynamic pipelines in order to achieve the best possible performance without requiring reordering of instructions by the compiler.

Computer ArchitectureFall 2007 © Parallel Pipelines °A k-stage pipeline can have k instructions concurrently resident in the machine & can potentially achieve a factor of k speedup over nonpipelined machines. °Alternatively, the same speedup can be achieved by employing k copies of the nonpipelined machine to process k instructions in parallel.

Computer ArchitectureFall 2007 © Temporal & Spatial Machine Parallelism °Spatial parallelism requires replication of the entire processing unit hence needs more hardware than temporal parallelism. °Parallel pipelines employs both temporal & spatial machine parallelism, that would produce higher instruction processing throughput.

Computer ArchitectureFall 2007 © Parallel Pipelines °For parallel pipelines, the speedup is primarily determined by the width of the parallel pipeline. °A parallel pipeline with width s can concurrently process up to s instructions in each of its pipeline stages; and produce a potential speedup of s.

Computer ArchitectureFall 2007 © Parallel Pipeline’s Interconnections °To connect all s instruction buffers from one stage to all s instructions buffers of the next stage, the circuitry for inter-stage interconnection can increase by a factor of s 2. °In order to support concurrent register file access by s instructions, the number of read & write ports of the register file must be increased by a factor of s. Similarly additional I-cache & D-cache access ports must be provided.

Computer ArchitectureFall 2007 © The Sequel of the i486: Pentium Microprocessor °Superscalar machine implementing a parallel pipeline of width s=2. °Essentially implements two i486 pipelines, a 5-stage scalar pipeline.

Computer ArchitectureFall 2007 © Pentium Microprocessor °Multiple instructions can be fetched & decoded by the first 2 stages of the parallel pipeline in every machine cycle. °In each cycle, potentially two instructions can be issued into the two execution pipelines. °The goal is to achieve a peak execution rate of two instructions per machine cycle. °The execute stage can perform an ALU operation or access the D-cache hence additional ports to the RF must be provided to support 2 ALU operations.

Computer ArchitectureFall 2007 © Diversified Pipelines °In a unified pipeline, though each instruction type only requires a subset of the execution stages, it must traverse all the execution stages –even if idling. °The execution latency for all instruction types is equal to the total number of execution stages; resulting in unnecessary stalling of trailing instructions.

Computer ArchitectureFall 2007 © Diversified Pipelines (3) °Instead of implementing s identical pipes in an s-wide parallel pipeline, diversified execution pipes can be implemented using multiple Functional Units.

Computer ArchitectureFall 2007 © Advantages of Diversified Pipelines °Efficient hardware design due to customized pipes for particular instruction type. °Each instruction type incurs only the necessary latency & make use of all the stages. °Possible distributed & independent control of each execution pipe if all inter-instruction dependences are resolved prior to dispatching.

Computer ArchitectureFall 2007 © Diversified Pipelines Design °The number of functional units should match the available I.L.P. of the program. The mix of F.U.s should match the dynamic mix of instruction types of the program. °Most 1 st generation superscalars simply integrated a second execution pipe for processing F.P. instructions with the existing scalar pipeline.

Computer ArchitectureFall 2007 © Diversified Pipelines Design (2) °Later, in 4-issue machines, 4 F.U.s are implemented for executing integer, F.P., Load/Store, & branch instructions. °Recently designs incorporate multiple integer units, some dedicated for long latency integer operations: multiply & divide, and operations for image, graphics, signal processing applications.

Computer ArchitectureFall 2007 © Dynamic Pipelines °Superscalar pipelines differ from (rigid) scalar pipelines in one key aspect; the use of complex multi-entry buffers. °In order to minimize unnecessary staling of instructions in a parallel pipeline, trailing instructions must be allowed to bypass a stalled leading instruction. °Such bypassing can change the order of execution of instructions from the original sequential order of the static code.

Computer ArchitectureFall 2007 © Dynamic Pipelines (2) °With out-of-order execution of instructions, they are executed as soon as their operands are available; this would approach the data-flow limit of execution.

Computer ArchitectureFall 2007 © Dynamic Pipelines (3) °A dynamic pipeline achieves out-of-order execution via the use of complex multi- entry buffers that allow instructions to enter & leave the buffers in different orders.

Computer ArchitectureFall 2007 ©

Computer ArchitectureFall 2007 © Pentium 4 ° CISC °Pentium – some superscalar components Two separate integer execution units °Pentium Pro – Full blown superscalar °Subsequent models refine & enhance superscalar design

Computer ArchitectureFall 2007 © Pentium 4 Block Diagram

Computer ArchitectureFall 2007 © Pentium 4 Operation °Fetch instructions form memory in order of static program °Translate instruction into one or more fixed length RISC instructions (micro-operations) °Execute micro-ops on superscalar pipeline micro-ops may be executed out of order °Commit results of micro-ops to register set in original program flow order °Outer CISC shell with inner RISC core °Inner RISC core pipeline at least 20 stages Some micro-ops require multiple execution stages -Longer pipeline c.f. five stage pipeline on x86 up to Pentium

Computer ArchitectureFall 2007 © Pentium 4 Pipeline

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (1)

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (2)

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (3)

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (4)

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (5)

Computer ArchitectureFall 2007 © Pentium 4 Pipeline Operation (6)

Computer ArchitectureFall 2007 © Introduction to PowerPC 620 °The first 64-bit superscalar processor to employ: Aggressive branch prediction, Real out-of-order execution, A Six pipelined execution units, A Dynamic renaming for all register files, Distributed multientry reservation stations, A completion buffer to ensure program correctness. °Most of these features had not been previously implemented in a single-chip microprocessor.