CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Pentium microprocessors CAS 133 – Basic Computer Skills/MS Office CIS 120 – Computer Concepts I Russ Erdman.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Lecture: Out-of-order Processors
CS203 – Advanced Computer Architecture
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
The Microarchitecture of the Pentium 4 processor
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
IA-64 Microarchitecture --- Itanium Processor
Lecture: SMT, Cache Hierarchies
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Alpha Microarchitecture
Lecture: Out-of-order Processors
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: SMT, Cache Hierarchies
Evolution of the Intel Architecture
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Presentation transcript:

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001

Clock Frequencies Aggressive clocks => little work per pipeline stage => deep pipelines => low IPC, large buffers, high power, high complexity, low efficiency 50% increase in clock speed => 30% increase in performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc

Deep Pipelines

Variable Clocks The fastest clock is defined as the time for an ALU operation and bypass (twice the main processor clock) Different parts of the chip operate at slower clocks to simplify the pipeline design (e.g. RAMs)

Microarchitecture Overview

Front End ITLB, RAS, decoder Trace Cache: contains 12K  ops (~8K-16KB I-cache), saves 3 pipe stages, reduces power Front-end BTB accessed on a trace cache miss and smaller Trace-cache BTB to detect next trace line – no details on branch pred algo Microcode ROM: implements  op translation for complex instructions

Execution Engine Allocator: resource (regs, IQ, LSQ, ROB) manager Rename: 8 logical regs are renamed to 128 phys regs; ROB (126 entries) only stores pointers (Pentium 4) and not the actual reg values (unlike P6) – simpler design, less power Two queues (memory and non-memory) and multiple schedulers (select logic) – can issue six instrs/cycle

Schedulers 3GHz clock speed = time for a 16-bit add and bypass

NetBurst 3GHz ALU clock = time for a 16-bit add and bypass to itself (area is kept to a minimum) Used by 60-70% of all  ops in integer programs Staggered addition – speeds up execution of dependent instrs – an add takes three cycles Early computation of lower 16 bits => early initiation of cache access

Detailed Microarchitecture

Data Cache 4-way 8KB cache; 2-cycle load-use latency for integer instrs and 6-cycle latency for fp instrs Distance between load scheduler and execution is longer than load latency Speculative issue of load-dependent instrs and selective replay Store buffer (24 entries) to forward results to loads (48 entries) – no details on load issue algo

Cache Hierarchy 256KB 8-way L2; 7-cycle latency; new operation every two cycles Stream prefetcher from memory to L2 – stays 256 bytes ahead 3.2GB/s system bus: 64-bit wide bus at 400MHz

Performance Results

Quick Facts November 2000: Willamette, 0.18 , Al interconnect, 42M transistors, 217mm 2, 55W, 1.5GHz February 2004: Prescott, 0.09 , Cu interconnect, 125M transistors, 112mm 2, 103W, 3.4GHz

Improvements Willamette (2000)  Prescott (2004) L1 data cache 8KB  16KB L2 cache 256KB  1MB Pipeline stages 20  31 Frequency 1.5GHz  3.4GHz Technology 0.18   0.09 

Pentium M Based on the P6 microarchitecture Lower design complexity (some inefficiencies persist, such as copying register values from ROB to architected register file) Improves on P4 branch predictor

PM Changes to P6, cont. Intel has not released the exact length of the pipeline. Known to be somewhere between the P4 (20 stage) and the P3 (10 stage). Rumored to be 12 stages. Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …

Banias 1 st version 77 million transistors, 23 million more than P4 1 MB on die Level 2 cache 400 MHz FSB (quad pumped 100 MHZ) 130 nm process Frequencies between 1.3 – 1.7 GHz Thermal Design Point of 24.5 watts

Dothan Launched May 10, million transistors 2 MB Level 2 cache 400 or 533 MHz FSB Frequencies between 1.0 to 2.26 GHz Thermal Design Point of 21(400 MHz FSB) to 27 watts

Branch Prediction Longer pipelines mean higher penalties for mispredicted branches Improvements result in added performance and hence less energy spent per instruction retired

Branch Prediction in Pentium M Enhanced version of Pentium 4 predictor Two branch predictors added that run in tandem with P4 predictor: –Loop detector –Indirect branch detector 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance

Branch Prediction Based on diagram found here:

Loop Detector A predictor that always branches in a loop will always incorrectly branch on the last iteration Detector analyzes branches for loop behavior Benefits a wide variety of program types issue02/art03_pentiumm/p05_branch.htm

Indirect Branch Predictor Picks targets based on global flow control history Benefits programs compiled to branch to calculated addresses ue02/art03_pentiumm/p05_branch.htm

Benchmark

Battery Life

UltraSPARC IV CMP with 2 UltraSPARC IIIs – speedups of 1.6 and 1.14 for swim and lucas (static parallelization) UltraSPARC III : 4-wide, 16 queue entries, 14 pipeline stages 4KB branch predictor – 95% accuracy, 7-cycle penalty 2KB prefetch buffer between L1 and L2

Alpha Tournament predictor – local and global; 36Kb Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP Two clusters, each with 2 FUs and a copy of the 80-entry register file

Title Bullet