EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Slides:

Advertisements

Similar presentations

Topics Left Superscalar machines IA64 / EPIC architecture

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE

1 Microprocessor-based Systems Course 4 - Microprocessors.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

The Alpha – Data Stream Matt Ziegler.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Protection in Virtual Mode

Instruction Level Parallelism

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Flow Path Model of Superscalars

Introduction to Pentium Processor

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Comparison of Two Processors

Alpha Microarchitecture

Sampoorani, Sivakumar and Joshua

* From AMD 1996 Publication #18522 Revision E

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options –Reduce the number of instructions executed –Reduce the cycles to execute an instruction –Reduce the clock period Our next focus: Further reducing CPI –Approach: Superscalar execution –Capable of initiating multiple instructions per cycle –Possible to implement for in-order or out-of-order pipelines

Why Superscalar? PipeliningSuperscalar + Pipelining Optimization results in more complexity –Longer wires, more logic  higher t CLK and t CPU –Architects must strike a balance with reductions in CPI

Implications of Superscalar Execution Instruction fetch? –Taken branches, multiple branches, partial cache lines Instruction decode? –Simple for fixed length ISA, much harder for variable length Renaming? –Multi-port RT, inter-inst dependencies must be recognized Dynamic Scheduling? –Requires multiple results buses, smarter selection logic Execution? –Multiple functional units, multiple result buses Commit? –Multiple ROB/ARF ports, dependencies must be recognized

P4 Overview Latest iA32 processor from Intel –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the P6 microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Comparison Between Pentium III and Pentium 4

Execution Pipeline

Front End Predicts branches Fetches/decodes code into trace cache Generates  ops for complex instructions Prefetches instructions that are likely to be executed

Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB If no dynamic prediction available, statically predict –Taken for backwards looping branches –Not taken for forward branches –Implemented at decode Traces built across (predicted) taken branches to avoid taken branch penalties Also includes a 16-entry return address stack predictor

Decoder Single decoder available –Operates at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the micro-ROM –Used for very complex iA32 instructions (> 4  ops) –After the microcode ROM finishes, the front- end resumes fetching  ops from the Trace Cache

Execution Pipeline

Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded  ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions

Branch Hints P4 software can provide hints to branch prediction and trace cache –Specify the likely direction of a branch –Implemented with conditional branch prefixes –Used for decode-stage predictions and trace building

Execution Pipeline

Execution 126  ops can in flight at once –Up to 48 loads / 24 stores Can dispatch up to 6  ops per cycle 2x trace cache and retirement  op bandwidth –Provides additional B/W for scheduling mispeculation

Execution Units

Register Renaming

8-entry architectural register file 128-entry physical register file 2 RAT (Front-end RAT and Retirement RAT) Retirement RAT eliminates register writes into ARF

Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers

Execution Pipeline

Retirement Can retire 3  ops per cycle Implements precise exceptions Reorder buffer used to organize completed  ops Also keeps track of branches and sends updated branch information to the BTB

Data Stream of Pentium 4 Processor

On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:

L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected

L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per 1.5GHz

L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams

System Bus Deliver data with 3.2Gbytes/S 64-bit wide bus Four data phase per clock cycle (quad pumped) 100MHz clocked system bus

Execution on MPEG4 1 GHz

Performance Trends Moore's Law Speedup Performance Gap Real-time speech 10k SPECInt2000

Power Trends Real-time Speech 500 mW Power Power Gap Hot Plate Nuclear Reactor Rocket Nozzle