Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Slides:

Advertisements

Similar presentations

Topics Left Superscalar machines IA64 / EPIC architecture

Advertisements

Computer Organization and Architecture

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

CSC 4250 Computer Architectures November 7, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Architecture Basics ECE 454 Computer Systems Programming

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

1 Chapter 03 Authors: John Hennessy & David Patterson.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

Hardware Support for Compiler Speculation

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

PipeliningPipelining Computer Architecture (Fall 2006)

Use of Pipelining to Achieve CPI < 1

Dynamic Scheduling Why go out of style?

CS203 – Advanced Computer Architecture

Instruction Level Parallelism

Reducing Hit Time Small and simple caches Way prediction Trace caches

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

Computer Structure Multi-Threading

PowerPC 604 Superscalar Microprocessor

Limits on ILP and Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture 12 Reorder Buffers

Flow Path Model of Superscalars

Microprocessors Chapter 4.

The Microarchitecture of the Pentium 4 processor

Superscalar Pipelines Part 2

Levels of Parallelism within a Single Processor

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Ka-Ming Keung Swamy D Ponpandi

CS 704 Advanced Computer Architecture

Lecture 10: Branch Prediction and Instruction Delivery

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Chapter 3: ILP and Its Exploitation

Levels of Parallelism within a Single Processor

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) Front end decoder translates each IA-32 instruction into a series of RISC like micro-operations called uops Uops executed by dynamically scheduled speculative pipeline Section 2.10

Pentium 4, continued Uops are stored in an execution trace cache Stores sequences of instructions to be executed, including nonadjacent instructions Accessed using branch prediction bits and address of first instruction in trace Has its own branch target buffer for predicting the outcome of uop branches Very high hit rate – IA-32 instruction fetch rarely needed Section 2.10

Pentium 4, continued Uops executed by an out-of-order speculative pipeline that uses register renaming rather than a reorder buffer Up to three uops per clock cycle can be renamed and dispatched to the functional unit queue Up to three uops can commit each clock cycle Up to six uops can be dispatched to functional units each clock cycle Section 2.10

Figure 2.26 Section 2.10

About figure 2.26 Front-end BTB – predicts next IA-32 instruction to fetch; only accessed if miss in execution trace cache Execution trace cache – holds uops Trace cache BTB – predicts the next uop Registers for renaming – 128; supports 128 uops executing simultaneously Functional units – 7 (simple ones run at twice the clock rate and accept up to two every clock cycle) Section 2.10

About figure 2.26 L1 data cache – supports up to 8 outstanding misses; integer load latency is 4 cycles; FP load latency is 12 cycles L2 cache – 18 cycle access time Section 2.10

Pentium 4 Deep pipeline makes speculation and branch prediction very important for high performance Cost of cache miss is also very high as queues will fill waiting for the miss to be handled Section 2.10

Pentium 4: Branch misprediction Figure 2.28 (next slide) show branch-misprediction rate per 1000 instructions Top five are integer benchmarks (average 186 branches per 1000 instructions) Bottom five are fp benchmarks (48 branches per 1000 instructions) Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks Section 2.10

Figure 2.28 Section 2.10

Pentium 4: Misspeculation Misprediction causes wrong instructions to be executed (misspeculated instructions), requires recovery time and wastes energy Figure 2.29 (next slide) shows the percentage of uop instructions issued that are misspeculated Note Figure 2.29 closely matches Figure 2.28 Section 2.10

Figure 2.29 Section 2.10

Pentium 4: cache misses Trace cache miss rates are almost negligible for SPEC benchmarks L1 and L2 miss rates are more significant Figure 2.30 (next slide) shows misses per 1000 instructions for the L1 and L2 caches Misses for L1 is higher, however miss penalty for L2 is higher so both will impact performance Section 2.10

Figure 2.30 Section 2.10

Pentium 4: CPI Figure 2.31 (next slide) shows cycles per instruction for these same 10 SPEC benchmarks Note mcf has worst misspeculation rate and worst L1 and L2 miss rate and also has highest CPI Note swim has high L1 and L2 miss rate and is lowest performing FP benchmark Section 2.10

Figure 2.31 Section 2.10

Comparing Pentium 4 to AMD Opteron Both use dynamically scheduled, speculative pipeline capable of issuing three IA-32 instructions per clock cycle Both have two levels of on-chip cache, but Opteron L1 instruction cache is not a trace cache Biggest difference is that the Pentium 4 is more deeply pipelined Pentium 4 has higher CPI (figure 2.32) but this makes sense given deeper pipeline Section 2.10

Figure 2.32 Section 2.10

Comparing Pentium 4 to AMD Opteron Deeper pipelining allows increase in clock rate – Will this increase make up for increase in CPI? Figure 2.33 (next slide) compares 2.8 GHz AMD Opteron versus 3.8 GHz Intel Pentium 4 Note the AMD has higher performance, thus the higher clock rate is insufficient to overcome the higher CPI Section 2.10

Figure 2.33 Section 2.10

Comparing Pentium 4 to IBM Power5 Sophisticated multiple-issue pipelines usually have slower clock rates than simple pipelines Faster clock rate will win in the presence of limited ILP IBM Power5 designed for high-performance integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate Section 2.10

Comparing Pentium 4 to IBM Power5 Pentium 4 – single processor with multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz) Figure 2.34 (next slide) compares the performance of these machines Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) Pentium 4 does better on Integer (higher clock rate) Section 2.10

Figure 2.34 Section 2.10