The Microarchitecture of the Pentium 4 processor

Slides:

Advertisements

Similar presentations

Advanced CISC Implementations: Pentium 4

Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Final Project : Pipelined Microprocessor Joseph Kim.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 Microprocessor-based Systems Course 4 - Microprocessors.

Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.

IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.

Chapter 12 Pipelining Strategies Performance Hazards.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

P5 Micro architecture : Intel’s Fifth generation

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Real-World Pipelines Idea Divide process into independent stages

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Protection in Virtual Mode

Microarchitecture.

Instruction Level Parallelism

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Structure Multi-Threading

PowerPC 604 Superscalar Microprocessor

Computer Architectures M

Computer architectures M

Flow Path Model of Superscalars

Introduction to Pentium Processor

Hyperthreading Technology

Pipelining: Advanced ILP

Superscalar Pipelines Part 2

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Comparison of Two Processors

Morgan Kaufmann Publishers Computer Organization and Assembly Language

Ka-Ming Keung Swamy D Ponpandi

Alpha Microarchitecture

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Adapted from the slides of Prof

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Adapted from the slides of Prof

Evolution of the Intel Architecture

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

The Microarchitecture of the Pentium 4 processor 11/15/2018 The Microarchitecture of the Pentium 4 processor Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel. Presented by : Ajay Sharma 11/15/2018

Overview of the Netburst™ Micro-Architecture BTB/Branch Prediction Fetch/ Decode Trace cache Out of order Execution Logic Retire ment Execution Unit Level 1 Data cache Bus Unit Level 2 Cache System Bus Memory Subsystem Integer & FP Execution Units Branch History Update Out of order Engine Front End 11/15/2018

In-Order Front End Fetches the Instructions, decode them and send them to the out of order execution core. There are three parts to it: Fetch/Decode Unit. Execution Trace cache. BTB/Branch Prediction 11/15/2018

Out of Order Engine This is where the Instructions are prepared for execution. There are two parts to it: Out of order Execution Logic -> Allows maximum Utilization Retirement Unit -> Ensures that the Instruction are back in order. 11/15/2018

Integer and Floating-Point Units This is the Unit where the Instructions are actually executed. It has two parts: L-1 data cache Execution Unit 11/15/2018

Memory Subsystem It does many things like store the Instructions in the Level 2 cache when the Trace cache and the L1 cache is filled. It also is used to access the main memory when the L2 cache has a cache miss and the System I/O resources. 11/15/2018

Clock Rates Clock rates determine the stages of pipeline. Higher clock rate actually require deeper pipeline and more time for cache miss and mispredicted branch. But overall they are performance booster. Say 50% increase in frequency results in only 30% increase in the Net Performance but that is still good. 11/15/2018

Clocking trends The clock rates have increased by 2.5 times from original in 286. 11/15/2018

Misprediction Pipeline As the No of Pipeline increase we can do more work per clock and so the clock rate increases. 11/15/2018

NetBurst™ MicroArchitecture 11/15/2018

1. Front End Front End BTB & Instruction TLB: Steer the front-end when a cache miss happens. ITLB translates the Linear address to physical address. Trace cache: Only decoded instructions are stored in this cache and when there is a mis-prediction there is no need to re-decode the instruction and so decode latency is reduced. Trace Cache BTB: The Instructions in the cache are predicted for branch taken/not taken. So that the delay can be reduced. Microcode ROM: This is used for complex instruction execution. µop Queue: This holds in-order µOPs from trace cache and microcode ROM before they are sent to the out-of-order execution unit. 11/15/2018

2. Out Of order Execution Logic Allocator: It attempts to allocate as many instructions are possible that have their operands ready . 11/15/2018

Mechanism of the Allocator Instructions Allocator Buffer Stalled Instructions If the Register File is busy Register File 11/15/2018

Instances of Registers Register Renaming Instances of Registers 128 P regs EAX EDX EBP 9 5 4 8 A regs EAX EDX EBP EBP1 EDX1 EAX1 EAX2 EAX3 EAX4 1 2 3 4 5 6 7 8 9 EAX EDX EBP Register Alias Table Original Registers Sequence number Instance name 11/15/2018

2.1 µOP Scheduling The Schedular determines when an instruction is ready by looking at the register operands It has Two Structures: µOP Queues µOP Scheduler 11/15/2018

2.1.1 µOP QUEUES Two Queues Load and Store Queue (Memory Operation) 2. ALU and Branch Queue (ALU and Branch Instructions) -Both Write and Store in Strict FIFO -But Read Out of Order 11/15/2018

2.1.2 µOP Schedular Its Tied to FOUR different Dispatch port. Port 0 Load Port Store Port 11/15/2018

2.1.2.1 Mechanism of Schedular Arbitate for Ports when the Schedular has ready instructions Schedulars Port 0 2µOP/cycle Port 1 Load Port 1µOP/cycle Store Total of all : Load + Store + Port 0 + port 1 = 1 + 1 + 2 + 2 = 6 instructions/cycles 11/15/2018

2.1.2.2 Types Of instruction Dispatched Port 0 Port 1 FP Move ALU 2x speed Integer Operation ALU FP Execute Load Port Store Port Memory Store from Register Memory Load into Register 11/15/2018

3.Integer and Floating Point Execution Unit This is the Place where the instructions are actually executed. Handles most common case first It has different types of units Integer Operations Unit L1 data cache Floating Point Unit 11/15/2018

3.1 Integer Operations Unit Low Latency Integer ALU: 2. Complex Integer Operations: 11/15/2018

3.1.1 Low Latency Integer ALU: Designed to Handle common cases first 60-70% Instructions use the ALU bypass Executes Fully Dependent instructions at 2 times clock rate This core is kept as small as possible Unnecessary hardware kept aside Ex: Multiplier ,Shift ,Rotate ,Branch Processing 11/15/2018

3.1.2 Complex Integer Operation Unit Shift, Rotate, Multiply, Divide, Branch Address calculation etc.. These Instructions come from the Complex Integer dispatch port. Latency of 4 clocks for shift, rotate operations Multiply- 14 clocks Divide – 60 Clocks 11/15/2018

3.2. Low Latency Level 1(L1) Cache Used for Both Integer and FP loads and stores 4 Way associative cache, write through (Every Data in L1 written to L2) 8 K in Size and it is very fast. Instead of having a big slow L1 cache, one fast and one slow 11/15/2018

3.3. Floating Point (FP)/SSE Execution Unit Floating Point instructions are executed here Every Clock 1 instruction can start Two Execution Port: a. 128 bit General Execution b. 128 bit register-register moves. 11/15/2018

4. Memory Subsystem It is responsible for handling L1 cache miss and L2 cache miss. Two Parts L2 cache (store data that does not fit in L1 cache) System Bus (Used to access Main Memory when L2 cache miss and I/O devices) 11/15/2018

4.1 L2 Cache 256/512/1024 KB Used when there is a cache miss in Trace cache, L1 cache Has 128 bytes per cache line (64*2) Bandwidth – 48GB/s 11/15/2018

4.2. System Bus Used for Accessing the Main memory when there is a L2 cache miss. Used also for accessing the i/o devices Bandwidth – 3.2 GB/s Width – 64 Bits Clock rate – 400 MHz 11/15/2018

Performance Delivers highest Performance in the world(SPECint_base). SPECfp200 performance is also good 15-20% gain in Integer performance over PIII 30-70% gain in Floating & Mutlimedia performance over PIII 5% gain in SSE/SSE2 over x87 only version 11/15/2018

Thank you Questions? 11/15/2018