Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures,

Slides:



Advertisements
Similar presentations
Computer Architecture Instruction-Level Parallel Processors
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS252 Graduate Computer Architecture Lecture 11 Limits to ILP / Multithreading March 1 st, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Limits of Instruction-Level Parallelism CS 282 – KAUST – Spring 2010 Muhamed Mudawar Original slides created by: David Patterson.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
CSE 820 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading Base on slides by David Patterson.
CPE 731 Advanced Computer Architecture Thread Level Parallelism Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Computer Architecture Lec 10 –Simultaneous Multithreading.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Csci 211 Computer System Architecture Limits on ILP and Simultaneous Multithreading Xiuzhen Cheng Department of Computer Sciences The George Washington.
EECS 252 Graduate Computer Architecture Lec 9 – Limits to ILP and Simultaneous Multithreading David Patterson Electrical Engineering and Computer Sciences.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Chapter 3.4: Loop-Level Parallelism and Thread-Level Parallelism
CSC 7080 Graduate Computer Architecture Lec 6 – Limits to ILP and Simultaneous Multithreading Dr. Khalaf Notes adapted from: David Patterson Electrical.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
1 Chapter 3: Limits on ILP Limits to ILP (another perspective) Thread Level Parallelism Multithreading Simultaneous Multithreading Power 4 vs. Power 5.
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Ch3. Limits on Instruction-Level Parallelism 1. ILP Limits 2. SMT (Simultaneous Multithreading) ECE562/468 Advanced Computer Architecture Prof. Honggang.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Instruction-Level Parallelism and Its Dynamic Exploitation
CSL718 : Superscalar Processors
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Lynn Choi Dept. Of Computer and Electronics Engineering
CC 423: Advanced Computer Architecture Limits to ILP
Chapter 3: ILP and Its Exploitation
David Patterson Electrical Engineering and Computer Sciences
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Flow Path Model of Superscalars
Electrical and Computer Engineering
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Limits to ILP Conflicting studies of amount
Lec 9 – Limits to ILP and Simultaneous Multithreading
Advanced Computer Architecture
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Prof. Onur Mutlu Carnegie Mellon University
Dynamic Hardware Prediction
8 – Simultaneous Multithreading
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Chapter 3: Limits of Instruction-Level Parallelism
Presentation transcript:

Korea UniversityG. Lee CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, Papers from ISCA, MICRO, and ICCD Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.

Korea UniversityG. Lee Reg. File rename D-Cache Function units Superscalar Processor Model ROB I-Cache I Buffer BTB Instr. window Dispatch (scheduler) Rename reservation station IF DP IS WB CT VLIW – EPIC SMT

Korea UniversityG. Lee Virtual address I-TLB D-TLB Page table Page table pointer register Entry with Dirty = 1 Memory Access Flow Cache Memory From program counter or Load/Store Instruction Processor

Korea UniversityG. Lee Walls: Limit in performance ILP Wall Memory Wall Power Wall

Korea UniversityG. Lee ILP(Instruction Level Parallelism) Fundamental limitation: data flow dependency Practical limiting factors Instruction Window Size Branch Prediction Data dependency Register Renaming Memory-address Alias Memory Disambiguation (Resource Conflicts) (Memory Latency due to cache-miss and lack of ports)

Korea UniversityG. Lee ILP(Instruction Level Parallelism) With no limiting factors i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known, the average ILP in programs are known to be quite high. But with realistic limiting factors, IPC becomes fairly restricted.

Korea UniversityG. Lee ILP Limit 1.Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec No. of Branches bypassedILP 0(basic block) ∞ 51.2

Korea UniversityG. Lee ILP Limit 2.Spec92 H&P-Text Fig. 3.1 p. 157 ILP = 17.9 for li to for tomcatv 3.M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar With no memory aliasing, for li – for mgrid (61.47 for tomcatv) With stack dependency (for allocating activation record) removed, for li – for mgrid

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factors: ( H&P-text p. 152 – 170 ) Instruction Window Size more instructions to consider, better ILP potential Branch Prediction Accuracy less wasted cycles Renaming Registers more registers, better chance to remove WAR and WAW Memory Aliasing more accurate memory dependency Resources matching function unit types available to ILP

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor - Instruction Window Size Instruction Window;  set of instructions examined for simultaneous execution - reservation station + current fetch  max. no. of comparisons: no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)  with typical window size of 64 to 128, time-critical

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor - Instruction Window Size e.g. (from H&P-Text Fig. 3.2 p. 159) ILP vs. window size note : 1. effects of window size 2. inefficiency of larger window

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Branch Prediction e.g. (from H&P-Text Fig. 3.3 p. 160) ILP vs. Branch prediction note : perf: perfect branch prediction comb: tournament predictor bi: bimodal predictor(2-bit counter) stat: static prediction with profiling none: no prediction note: instruction window size: 2K issue limit: 64 jmp prediction with 2K entry table

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Renaming Registers e.g. (from H&P-Text Fig. 3.5 p. 163) ILP vs. additional rename registers note: instruction window size: 2K issue limit: 64 combining predictor of total 8K entry jmp prediction with 2K entry table

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g ld$3, #200($4) st$5, #200($6) how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150 Perfect – after executing program Global reference and Stack references Global data region Stack access for local variables (activation records) Unknown, i.e. assume conflicts, for heap region for dynamic data structures Inspection – compile time region analysis

Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g. (from H&P-Text Fig. 3.6 p. 164) ILP vs. aliasing detection schemes note: instruction window size: 2K issue limit: 64 with 256 registers combining predictor of total 8K entry jmp prediction with 2K entry table P: perfect alias resolution G/S: global/stack Ins: inspection

Korea UniversityG. Lee ILP Limit A Realizable Superscalar Processor : H&P-Text sec.3.3 with rather realistic assumptions 64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers note: no issue restriction is virtually impossible even for lower issue count, say 16.

Korea UniversityG. Lee ILP Limit – Realistic Processor around 25%

Korea UniversityG. Lee ILP Limit – Realistic Processor ILP potential in software  ILP limited by resources  Window size  Function unit mismatch  Registers  ILP limited by dependency  Branch prediction  False Dependency  Output dependency (WAW)  Data dependency (RAW)

Korea UniversityG. Lee ProcessorMicro architectureFetch / Issue / Execute FUClock Rate (GHz) Trs Die size Power Intel Pentium 4 Extreme Speculative dynamically scheduled; deeply pipelined; SMT 3/3/47 int. 1 FP M 122 mm W AMD Athlon 64 FX-57 Speculative dynamically scheduled 3/3/46 int. 3 FP M 115 mm W IBM Power5 (1 CPU only) Speculative dynamically scheduled; SMT; 2 CPU cores/chip 8/4/86 int. 2 FP M 300 mm 2 (est.) 80W (est.) Intel Itanium 2 Statically scheduled VLIW-style 6/5/119 int. 2 FP M 423 mm W Processor Architecture Comparison (H&P-Text Sec.3.6)

Korea UniversityG. Lee Performance on SPECint2000

Korea UniversityG. Lee Performance on SPECfp2000

Korea UniversityG. Lee Normalized Performance: Efficiency Rank Itanium2Itanium2 PentIum4PentIum4 AthlonAthlon Power5Power5 Int/Trans4213 FP/Trans4213 Int/area4213 FP/area4213 Int/Watt4312 FP/Watt2431

Korea UniversityG. Lee Superscalar processor N-way Superscalar:  Fetch and decode N instructions  N “ready” instructions “issued” to function units issue fetch, decode, renaming, dispatch, issue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor can send simultaneously is the “issue width”. Actual issue rate is much less Fetch=Decode > Issue = Execute > Commit

Korea UniversityG. Lee Note: Can we keep going with Superscalar path for better performance? Increase instruction window Issue width Data path width → wire delay become more important factor → clustered organization may help frequent intra-cluster operations infrequent inter-cluster operations Simpler may be better? But it does not utilize available on-chip resources fully Adapting multiprocessor approach? How to control multiprocessors for multiple instructions

Korea UniversityG. Lee Note: Removing dependency limit 1. Current practice/convention of programming model imposes unnecessary dependency WAR and WAW through memory  because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used specific use of registers  loop counter, return address register, stack pointer, 2. Going beyond data-flow limit Data Value prediction with speculation general value prediction; unlikely address value prediction constant/loop index value prediction

Korea UniversityG. Lee Dealing with Other Walls Memory Wall  Faster Multilevel Cache  Non-blocking pipelined cache  Cache in multicore processor Transaction memory Power Wall  Lower driving voltage  Allowing errors

Korea UniversityG. Lee Adding New Functionality Network and I/O related  Bypassing OS intervention Multimedia  Vector instructions Trusted Computing  Trusted Platform Module