1 Lecture 12: Limits of ILP and Pentium Processors ILP limits, Study strategy, Results, P-III and Pentium 4 processors Adapted from UCB CS252 S01.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Architecture Lec 8 – Instruction Level Parallelism.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

EEL 5708 Speculation. Branch prediction. Superscalar processors. Lotzi Bölöni.

Based on Lectures by Prof. David A. Patterson UC Berkeley

CENG 450 Computer Systems and Architecture Lecture 13

EENG449b/Savvides Lec /29/05 March 29, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Microprocessor-based Systems Course 4 - Microprocessors.

Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.

Chapter 12 Pipelining Strategies Performance Hazards.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

CS252/Patterson Lec /4/01 CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CS252/Culler Lec /3/01 CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

April 4, 2001 Prof. David A. Patterson Computer Science 252

CC 423: Advanced Computer Architecture Limits to ILP

CS203 – Advanced Computer Architecture

Flow Path Model of Superscalars

April 2, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

Electrical and Computer Engineering

Advantages of Dynamic Scheduling

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Limits to ILP Conflicting studies of amount

Ka-Ming Keung Swamy D Ponpandi

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CS 704 Advanced Computer Architecture

Lecture 23: Static Scheduling for High ILP

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

Dynamic Hardware Prediction

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 5: Instruction-Level Parallelism and Its Dynamic Execution

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

1 Lecture 12: Limits of ILP and Pentium Processors ILP limits, Study strategy, Results, P-III and Pentium 4 processors Adapted from UCB CS252 S01

2 Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit FP per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.

3 Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses are known & a load can be moved before a store provided addresses not equal Also: unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/);

4 Study Strategy First, observe ILP on the ideal machine using simulation Then, observe how ideal ILP decreases when  Add branch impact  Add register impact  Add memory address alias impact More restrictions in practice  Functional unit latency: floating point  Memory latency: cache hit more than one cycle, cache miss penalty

5 Upper Limit to ILP: Ideal Machine (Figure 3.35, page 242) Programs Instruction Issues per cycle gccespressolifppppdoducdtomcatv Integer: FP: IPC

6 More Realistic HW: Window Size Impact

7 More Realistic HW: Branch Impact

8 Memory Alias Impact

9 How to Exceed ILP Limits of this study? WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Unnecessary dependences (compiler not unrolling loops so iteration variable dependence) Overcoming the data flow limit: value prediction, predicting values and speculating on prediction Address value prediction and speculation predicts addresses and speculates by reordering loads and stores; could provide better aliasing analysis, only need predict if addresses =

10 Workstation Microprocessors 3/2001 Source: Microprocessor Report, Max issue: 4 instructions (many CPUs) Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III) Max Window Size (OOO): 126 intructions (Pent. 4) Max Pipeline: 22/24 stages (Pentium 4)

11 SPEC 2000 Performance 3/2001 Source: Microprocessor Report, 1.6X 3.8X 1.2X 1.7X 1.5X

12 Conclusion : 1000X performance Moore’s Law transistors/chip => Moore’s Law for Performance/MPU Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, … ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW? Otherwise drop to old rate of 1.3X per year? Less than 1.3X because of processor-memory performance gap? Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?

13 Dynamic Scheduling in P6 (Pentium Pro, II, III) Q: How pipeline 1 to 17 byte 80x86 instructions? P6 doesn’t pipeline 80x86 instructions P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS) Sends micro-operations to reorder buffer & reservation stations Many instructions translate to 1 to 4 micro-operations Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations 14 clocks in total pipeline (~ 3 state machines)

14 Dynamic Scheduling in P6 Parameter80x86microops Max. instructions issued/clock36 Max. instr. complete exec./clock5 Max. instr. commited/clock3 Window (Instrs in reorder buffer)40 Number of reservations stations20 Number of rename registers40 No. integer functional units (FUs)2 No. floating point FUs1 No. SIMD Fl. Pt. FUs1 No. memory Fus 1 load + 1 store

15 P6 Pipeline 14 clocks in total (~3 state machines) 8 stages are used for in-order instruction fetch, decode, and issue Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops) 3 stages are used for out-of-order execution in one of 5 separate functional units 3 stages are used for instruction commit Instr Fetch 16B /clk Instr Decode 3 Instr /clk Renaming 3 uops /clk Execu- tion units (5) Gradu- ation 3 uops /clk 16B6 uops Reserv. Station Reorder Buffer

16 P6 Block Diagram

17 Pentium III Die Photo EBL/BBL - Bus logic, Front, Back MOB - Memory Order Buffer Packed FPU - MMX Fl. Pt. (SSE) IEU - Integer Execution Unit FAU - Fl. Pt. Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Fl. Pt. RS - Reservation Station BTB - Branch Target Buffer IFU - Instruction Fetch Unit (+I$) ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer 1st Pentium III, Katmai: 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminum

18 P6 Performance: Stalls at decode stage I$ misses or lack of RS/Reorder buf. entry

19 P6 Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus

20 P6 Performance: Branch Mispredict Rate

21 P6 Performance: Speculation rate (% instructions issued that do not commit)

22 P6 Performance: Cache Misses/1k instr

23 P6 Performance: uops commit/clock Average 0: 55% 1: 13% 2: 8% 3: 23% Integer 0: 40% 1: 21% 2: 12% 3: 27%

24 P6 Dynamic Benefit? Sum of parts CPI vs. Actual CPI Ratio of sum of parts vs. actual CPI: 1.38X avg. (1.29X integer)

25 AMD Althon Similar to P6 microarchitecture (Pentium III), but more resources Transistors: PIII 24M v. Althon 37M Die Size: 106 mm 2 v. 117 mm 2 Power: 30W v. 76W Cache: 16K/16K/256K v. 64K/64K/256K Window size: 40 vs. 72 uops Rename registers: 40 v. 36 int +36 Fl. Pt. BTB: 512 x 2 v x 2 Pipeline: stages v stages Clock rate: 1.0 GHz v. 1.2 GHz Memory bandwidth: 1.06 GB/s v GB/s

26 Pentium 4 Still translate from 80x86 to micro-ops P4 has better branch predictor, more FUs Instruction Cache holds micro-operations vs. 80x86 instructions no decode stages of 80x86 on cache hit called “trace cache” (TC) Faster memory bus: 400 MHz v. 133 MHz Caches Pentium III: L1I 16KB, L1D 16KB, L2 256 KB Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock Clock rates: Pentium III 1 GHz v. Pentium IV 1.5 GHz

27 Pentium 4 features Multimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions When used by programs? Faster Floating Point: execute 2 64-bit FP Per clock Memory FU: bit load, store /clock to MMX regs Using RAMBUS DRAM Bandwidth faster, latency same as SDRAM Cost 2X-3X vs. SDRAM ALUs operate at 2X clock rate for many ops Pipeline doesn’t stall at this clock rate: uops replay Rename registers: 40 vs. 128; Window: 40 v. 126 BTB: 512 vs entries (Intel: 1/3 improvement)

28 Basic Pentium 4 Pipeline 1-2 trace cache next instruction pointer 3-4 fetch uops from Trace Cache 5 drive upos to alloc 6 alloc resources (ROB, reg, …) 7-8 rename logic reg to 128 physical reg 9 put renamed uops into queue write uops into scheduler move up to 6 uops to FU read registers 17 FU execution 18 computer flags e.g. for branch instructions 19 check branch output with branch prediction 20 drive branch check result to frontend TC Nxt IPDriveTC FetchAllocRenameQueueSchd Disp Reg ExFlagsBr ChkDrive

29 Block Diagram of Pentium 4 Microarchitecture BTB = Branch Target Buffer (branch predictor) I-TLB = Instruction TLB, Trace Cache = Instruction cache RF = Register File; AGU = Address Generation Unit "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00

30 Pentium 4 Die Photo 42M Xtors PIII: 26M 217 mm 2 PIII: 106 mm 2 L1 Execution Cache Buffer 12,000 Micro-Ops 8KB data cache 256KB L2$

31 Benchmarks: Pentium 4 v. PIII v. Althon SPECbase2000 Int, GHz: 524, 454, AMD FP, GHz: 549, 329, AMD WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) P4 : 164, PIII : 167, AMD Althon: 180 Quake 3 Arena: P4 172, Althon 151 SYSmark 2000 composite: P4 209, Althon 221 Office productivity: P4 197, Althon 209 S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."