Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)
Alpha Microarchitecture Onur/Aditya 11/6/2001.
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
CS 61C: Great Ideas in Computer Architecture Case Studies: Server and Cellphone microprocessors Instructors: Krste Asanovic, Randy H. Katz
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Microarchitecture of Superscalars (7) Preserving sequential consistency Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
Advanced Micro Devices - Athlon Buddy Guest Mike Lewitt Bill McCorkle November 28, 2001.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
Architecture Basics ECE 454 Computer Systems Programming
Microarchitecture of Superscalars (5) Dynamic Instruction Issue Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.
High Performance Architectures
Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.
Chun Chiu. Overview What is RISC? Characteristics of RISC What is CISC? Why using RISC? RISC Vs. CISC RISC Pipelines Advantage of RISC / disadvantage.
TECH 6 VLIW Architectures {Very Long Instruction Word}
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Issue and Despatch 23rd Jan, 2006.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
1 Processor Architecture Jurij Silc, Borut Robic, Theo Ungerer.
AMD K-6 Processor Evaluation. Registers AMD-K6 Registers General purpose registers Segment registers Floating point registers MMX registers EFLAGS register.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Instruction level parallelism And Superscalar processors By Kevin Morfin.
Microarchitecture of Superscalars (6) Register renaming Dezső Sima Spring 2008 (Ver. 2.0)  Dezső Sima, 2008.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Precise Exceptions and Out-of-Order Execution
PowerPC 604 Superscalar Microprocessor
Prof. Onur Mutlu Carnegie Mellon University
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Flow Path Model of Superscalars
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
I. Evolution of the ILP Processing
The Microarchitecture of the Pentium 4 processor
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 2 Microprocessor Overview
7. Microarchitecture of Superscalars (5) Dynamic Instruction Issue
* From AMD 1996 Publication #18522 Revision E
A new era in processor evolution
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
1. Evolution of ILP-processing
Microarchitecture of Superscalars (4) Decoding
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Overview 1. Overview 2. Straightforward parallel decoding 3. Predecoding 4. Decoding with CISC/RISC conversion 4.1 Overview 4.2 Decoding into µops 4.3 Decoding into macroops 5. Using a trace cache 6. Decoding with instruction grouping 6.1 Overview 6.2 Grouping of RISC instructions 6.3 Grouping of CISC instructions

1. Overview 1. gen. RISC superscalars Intel PredecodingStraightforward parallel decoding Using a trace cache Decoding with instruction grouping Decoding techniques used in superscalars Decoding with CISC/RISC conversion Beginning with 2. gen. superscalars Beginning with 2. gen. superscalar CISCs P4-family Decoding into µops Decoding into macroops AMD (up to two µops) Grouping of RISC instructions POWER4 POWER5 Grouping of CISC instructions Pentium M Core Beginning with the Pentium Pro Beginning with the K7 K7 (Athlon) K8 (Hammer)

2 Straightforward parallel decoding Figure 2.1: The PowerPC 601’s front end Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug

3 Predecoding (1) Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor Icache Superscalar issue DF...I Decode / Issue / Check Instruction buffer Decode / Issue / Check Scalar issue Typical FX- pipeline layout D/IF... Icache Instruction buffer

3 Predecoding (1) Figure 3.2: The principle of predecoding Second-level cache (or memory) Predecode unit I-cache Typically 128 bits/cycle When instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction. AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte). E.g. 148 bits/cycle Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

3 Predecoding (2) Figure 3.3: The introduction of predecoding Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

3. Predecoding (3) Figure 3.4: Variable length instruction decoding in the Athlon Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,

3 Predecoding (4) Figure 3.5: Opteron’s instruction cache and decoding Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,

4 Decoding with CISC/RISC conversion Decoding with CISC/RISC conversion RISC core Retiring with RISC/CISC conversion CISC instructions Decoding with CISC/RISC conversion Examples: PProK6 µopsmacroops Modification of the program state after RISC/CISC re-conversion Figure 4.1: Principle of decoding with CISC/RISC conversion Source: Sima, D. et al., „ACA”, Addison-Wesley Overview

4.2 Decoding into µops (1) Figure 4.2: The Microarchitecture of the Pentium Pro Source: Shanley, T.,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997

4.2 Decoding into µops (2) Figure 4.3: Basic misprediction pipeline of the Pentium III Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

Figure 4.4: Decoding in AMD’s K6 Source: Shriver, B., Smith,.B.,”The Anatomy of a High-Performance Microprocessor” IEEE Computer Society Press, Decoding into µops (3)

Figure 4.5: The Microarchitecture of the Pentium M (Yonah) 4.2 Decoding into µops (4) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

4.2 Decoding into µops (5) Figure 4.6: The Microarchitecture of the Core processor family Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

4.3 Decoding into macroops (1) Figure 4.7: AMD Athlon TM the Microarchitecture of the Athlon Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

4.3 Decoding into macroops (2) Figure 4.8: Decoding in the Athlon (1) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

4.3 Decoding into macroops (3) Figure 4.9: Decoding in the Athlon (2) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

Each MacroOp: 1 or 2 operations (OPs) eg: ADD EAX, EBX1 ADD OP AND EAX, [EBX+16]1 LOAD OP 1 AND OP Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle 4.3 Decoding into macroops (4)

4.3 Decoding into macroops (5) Figure 4.10: The Microarchitecture of the Hammer Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001

5 Using a trace cache (1) Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)

5 Using a trace cache (2) Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette) Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

5 Using a trace cache (3) Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

Decoding with instruction grouping Grouping of RISC instructions POWER4 POWER5 Grouping of CISC instructions Pentium M Core arch. 6. Decoding with instruction grouping K7 (Athlon) K8 (Hammer) 6.1 Overview

Operation of the Reorder Buffer (ROB) index lane 0 lane 1 lane 2 = Out Of Order finished Instructions, results still speculative. = Instructions being retired now. = Retired Instructions, not speculative anymore. Figure 5.3: Instruction grouping in the K7 and K8 Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed. 6.2 Grouping of RISC instructions (1)

Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007) (The K8L scheduler has 8*3 entires vs 6*3 in the K8) Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug Grouping of RISC instructions (2) Schedulers Decoders EUs

Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors 6.2 Grouping of RISC instructions (3) Instruction groups EU Issue queues Execution units ROB Dispatch instruction groups in-order, forward individual instructions to the issue queues Execute individual instructions ooo Retire isntruction groups in-order, modify program state Retire

6.2 Grouping of RISC instructions (4) Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept

6.3 Grouping of CISC instructions (1) (Intel: macro-op fusion) x86 instructions: macro-ops internal instructions: μops Macro-op fusion: combines two macro ops into a single μop. Specifically: x86 compare or test instructions are fused with x86 jumps to produce a single μop. Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle. In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle Macro-op fusion can reduce the number of μops by about 10%. Introduced in the Core architecture

6.3 Grouping of CISC instructions (2) Benefits: Fewer μops Increased performance ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions