CC 423: Advanced Computer Architecture Limits to ILP

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 12: Limits of ILP and Pentium Processors ILP limits, Study strategy, Results, P-III and Pentium 4 processors Adapted from UCB CS252 S01.
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EEL 5708 Speculation. Branch prediction. Superscalar processors. Lotzi Bölöni.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 Digital Equipment Corporation (DEC) PDP-1. Spacewar (below), the first video game was developed on a PDP-1. DEC PDP-8. Extremely successful. Made DEC.
CS252 Graduate Computer Architecture Lecture 11 Limits to ILP / Multithreading March 1 st, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences.
Limits of Instruction-Level Parallelism CS 282 – KAUST – Spring 2010 Muhamed Mudawar Original slides created by: David Patterson.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
CSE 820 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading Base on slides by David Patterson.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Korea UniversityG. Lee CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures,
Computer Architecture Lec 10 –Simultaneous Multithreading.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Csci 211 Computer System Architecture Limits on ILP and Simultaneous Multithreading Xiuzhen Cheng Department of Computer Sciences The George Washington.
EECS 252 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading David Patterson Electrical Engineering and Computer Sciences.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
CSC 7080 Graduate Computer Architecture Lec 6 – Limits to ILP and Simultaneous Multithreading Dr. Khalaf Notes adapted from: David Patterson Electrical.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
1 Chapter 3: Limits on ILP Limits to ILP (another perspective) Thread Level Parallelism Multithreading Simultaneous Multithreading Power 4 vs. Power 5.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
Lynn Choi Dept. Of Computer and Electronics Engineering
CS 704 Advanced Computer Architecture
Limits on ILP and Multithreading
CS203 – Advanced Computer Architecture
Chapter 3: ILP and Its Exploitation
Henk Corporaal TUEindhoven 2009
David Patterson Electrical Engineering and Computer Sciences
John Kubiatowicz Electrical Engineering and Computer Sciences
April 2, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
Electrical and Computer Engineering
CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.
A Dynamic Algorithm: Tomasulo’s
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Limits to ILP Conflicting studies of amount
Lec 9 – Limits to ILP and Simultaneous Multithreading
Yingmin Li Ting Yan Qi Zhao
Adapted from the slides of Prof
CS 704 Advanced Computer Architecture
Advanced Computer Architecture
Lecture on High Performance Processor Architecture (CS05162)
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Adapted from the slides of Prof
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
Dynamic Hardware Prediction
8 – Simultaneous Multithreading
Chapter 3: Limits of Instruction-Level Parallelism
Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures
Presentation transcript:

CC 423: Advanced Computer Architecture Limits to ILP Adapted from the slides of Prof. David Patterson, University of California, Berkeley CS252 S05

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.

Overcoming Limits Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

Limits to ILP HW Model comparison Power 5 Instructions Issued per clock Infinite 4 Instruction Window Size 200 Renaming Registers 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis ??

Upper Limit to ILP: Ideal Machine (Figure 3.1) FP: 75 - 150 Integer: 18 - 60 Instructions Per Clock

Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock Infinite 4 Instruction Window Size Infinite, 2K, 512, 128, 32 200 Renaming Registers 48 integer + 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias ??

More Realistic HW: Window Impact Figure 3.2 Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 Integer: 8 - 63 IPC

Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers 48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. none Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias ??

More Realistic HW: Branch Impact Figure 3.3 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 Integer: 6 - 12 IPC Perfect Tournament BHT (512) Profile No prediction

Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers Infinite v. 256, 128, 64, 32, none 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Branch Predictor Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias

More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None

Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers 256 Int + 256 FP 48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect v. Stack v. Inspect v. none

More Realistic HW: Memory Address Alias Impact Figure 3.6 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC Perfect Global/Stack perf; heap conflicts Inspec. Assem. None

Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock 64 (no restrictions) Infinite 4 Instruction Window Size Infinite vs. 256, 128, 64, 32 200 Renaming Registers 64 Int + 64 FP 48 integer + 40 Fl. Pt. Branch Prediction 1K 2-bit Perfect Tournament Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation

Realistic HW: Window Impact (Figure 3.7) Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: 8 - 45 IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4

How to Exceed ILP Limits of this study? These are not laws of physics; just practical limits for today, and perhaps overcome via research Compiler and ISA advances could change results WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack

Limits to ILP Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to issue 3 or 4 data memory accesses per cycle, resolve 2 or 3 branches per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!

Limits to ILP Most techniques for increasing performance increase power consumption The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? Multiple issue processors techniques all are energy inefficient: Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows Growing gap between peak issue rates and sustained performance Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance  increasing energy per unit of performance