Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Similar presentations


Presentation on theme: "Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."— Presentation transcript:

1 Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

2 Overview 1. Overview 2. Straightforward parallel decoding 3. Predecoding 4. Decoding with CISC/RISC conversion 4.1 Overview 4.2 Decoding into µops 4.3 Decoding into macroops 5. Using a trace cache 6. Decoding with instruction grouping 6.1 Overview 6.2 Grouping of RISC instructions 6.3 Grouping of CISC instructions

3 1. Overview 1. gen. RISC superscalars Intel PredecodingStraightforward parallel decoding Using a trace cache Decoding with instruction grouping Decoding techniques used in superscalars Decoding with CISC/RISC conversion Beginning with 2. gen. superscalars Beginning with 2. gen. superscalar CISCs P4-family Decoding into µops Decoding into macroops AMD (up to two µops) Grouping of RISC instructions POWER4 POWER5 Grouping of CISC instructions Pentium M Core Beginning with the Pentium Pro Beginning with the K7 K7 (Athlon) K8 (Hammer)

4 2 Straightforward parallel decoding Figure 2.1: The PowerPC 601’s front end Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004. http://arstechnica.com/articles

5 3 Predecoding (1) Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor Icache Superscalar issue DF...I Decode / Issue / Check Instruction buffer Decode / Issue / Check Scalar issue Typical FX- pipeline layout D/IF... Icache Instruction buffer

6 3 Predecoding (1) Figure 3.2: The principle of predecoding Second-level cache (or memory) Predecode unit I-cache Typically 128 bits/cycle When instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction. AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte). E.g. 148 bits/cycle Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

7 3 Predecoding (2) Figure 3.3: The introduction of predecoding Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

8 3. Predecoding (3) Figure 3.4: Variable length instruction decoding in the Athlon Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

9 3 Predecoding (4) Figure 3.5: Opteron’s instruction cache and decoding Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

10 4 Decoding with CISC/RISC conversion Decoding with CISC/RISC conversion RISC core Retiring with RISC/CISC conversion CISC instructions Decoding with CISC/RISC conversion Examples: PProK6 µopsmacroops Modification of the program state after RISC/CISC re-conversion Figure 4.1: Principle of decoding with CISC/RISC conversion Source: Sima, D. et al., „ACA”, Addison-Wesley 1997 4.1 Overview

11 4.2 Decoding into µops (1) Figure 4.2: The Microarchitecture of the Pentium Pro Source: Shanley, T.,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997

12 4.2 Decoding into µops (2) Figure 4.3: Basic misprediction pipeline of the Pentium III Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

13 Figure 4.4: Decoding in AMD’s K6 Source: Shriver, B., Smith,.B.,”The Anatomy of a High-Performance Microprocessor” IEEE Computer Society Press, 1998 4.2 Decoding into µops (3)

14 Figure 4.5: The Microarchitecture of the Pentium M (Yonah) 4.2 Decoding into µops (4) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

15 4.2 Decoding into µops (5) Figure 4.6: The Microarchitecture of the Core processor family Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

16 4.3 Decoding into macroops (1) Figure 4.7: AMD Athlon TM the Microarchitecture of the Athlon Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

17 4.3 Decoding into macroops (2) Figure 4.8: Decoding in the Athlon (1) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

18 4.3 Decoding into macroops (3) Figure 4.9: Decoding in the Athlon (2) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

19 Each MacroOp: 1 or 2 operations (OPs) eg: ADD EAX, EBX1 ADD OP AND EAX, [EBX+16]1 LOAD OP 1 AND OP Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle 4.3 Decoding into macroops (4)

20 4.3 Decoding into macroops (5) Figure 4.10: The Microarchitecture of the Hammer Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001

21 5 Using a trace cache (1) Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)

22 5 Using a trace cache (2) Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette) Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

23 5 Using a trace cache (3) Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

24 Decoding with instruction grouping Grouping of RISC instructions POWER4 POWER5 Grouping of CISC instructions Pentium M Core arch. 6. Decoding with instruction grouping K7 (Athlon) K8 (Hammer) 6.1 Overview

25 Operation of the Reorder Buffer (ROB) index123456789101112 lane 0 lane 1 lane 2 = Out Of Order finished Instructions, results still speculative. = Instructions being retired now. = Retired Instructions, not speculative anymore. Figure 5.3: Instruction grouping in the K7 and K8 Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed. 6.2 Grouping of RISC instructions (1)

26 Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007) (The K8L scheduler has 8*3 entires vs 6*3 in the K8) Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006. 6.2 Grouping of RISC instructions (2) Schedulers Decoders EUs

27 Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors 6.2 Grouping of RISC instructions (3) Instruction groups EU Issue queues Execution units ROB Dispatch instruction groups in-order, forward individual instructions to the issue queues Execute individual instructions ooo Retire isntruction groups in-order, modify program state Retire

28 6.2 Grouping of RISC instructions (4) Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.

29 6.3 Grouping of CISC instructions (1) (Intel: macro-op fusion) x86 instructions: macro-ops internal instructions: μops Macro-op fusion: combines two macro ops into a single μop. Specifically: x86 compare or test instructions are fused with x86 jumps to produce a single μop. Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle. In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle Macro-op fusion can reduce the number of μops by about 10%. Introduced in the Core architecture

30 6.3 Grouping of CISC instructions (2) Benefits: Fewer μops Increased performance ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions


Download ppt "Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."

Similar presentations


Ads by Google